# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```