Commit graph

18 commits

Author SHA1 Message Date
Sébastien Han
b45cc42202
chore: remove usage of load_tiktoken_bpe
The `load_tiktoken_bpe()` function depends on blobfile to load
tokenizer.model files. However, blobfile brings in pycryptodomex, which
is primarily used for JWT signing in GCP - functionality we don’t
require, as we always load tokenizers from local files. pycryptodomex
implements its own cryptographic primitives, which are known to be
problematic and insecure. While blobfile could potentially switch to the
more secure PyCA cryptography library, the project appears inactive, so
this transition may not happen soon. Fortunately, `load_tiktoken_bpe()`
is a simple function that just reads a BPE file and returns a dictionary
mapping byte sequences to their mergeable ranks. It’s straightforward
enough for us to implement ourselves.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-27 20:47:06 +02:00
raghotham
5a422e236c
chore: make cprint write to stderr (#2250)
Also do sys.exit(1) in case of errors
2025-05-24 23:39:57 -07:00
ehhuang
664161c462
fix: llama4 tool use prompt fix (#2103)
Tests:
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct

LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model Llama-4-Maverick-17B-128E-Instruct --text-model
Llama-4-Maverick-17B-128E-Instruct

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-05-06 22:18:31 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Sébastien Han
dc94433072
feat(pre-commit): enhance pre-commit hooks with additional checks (#2014)
# What does this PR do?

Add several new pre-commit hooks to improve code quality and security:

- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml

The respective fixes have been included.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 11:35:49 -07:00
ehhuang
0266b20535
docs: update prompt_format.md for llama4 (#2035)
torchrun --nproc_per_node=8 scripts/generate_prompt_format.py
meta-llama/Llama-4-Scout-17B-16E-Instruct ~/local/checkpoints/<path>/
llama_stack.models.llama.llama4.prompts
llama_stack/models/llama/llama4/prompt_format.md

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-25 15:52:15 -07:00
ehhuang
1b2e116a2a
fix: tool call encoded twice (#2034)
# What does this PR do?


## Test Plan
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct
2025-04-25 13:16:16 -07:00
ehhuang
29072f40ab
feat: new system prompt for llama4 (#2031)
Tests:

LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-25 11:29:08 -07:00
ehhuang
2976b5d992
fix: OAI compat endpoint for meta reference inference provider (#1962)
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-17 11:16:04 -07:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Ashwin Bharambe
70a7e4d51e fix: unhide python_start, python_end 2025-04-11 20:30:44 -07:00
Jiawen Liu
36a31fe5dd
fix: on-the-fly int4 quantize parameter (#1920)
Mirror to https://github.com/meta-llama/llama-models/pull/324 with some
clean up

```
with-proxy pip install -e .
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct
export QUANTIZATION_TYPE=int4_mixed
with-proxy llama stack build --run --template meta-reference-gpu
```

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-04-09 15:00:12 -07:00
Ashwin Bharambe
e2299291c4
fix: Mirror llama4 rope scaling fixes, small model simplify (#1917)
See:
- https://github.com/meta-llama/llama-models/pull/322
- https://github.com/meta-llama/llama-models/pull/320
2025-04-09 11:28:45 -07:00
Ashwin Bharambe
8001c30a4f fix: meta reference + llama4 tokenizer fix 2025-04-09 00:46:32 -07:00
Ashwin Bharambe
530d4bdfe1
refactor: move all llama code to models/llama out of meta reference (#1887)
# What does this PR do?

Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.

Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.

## Test Plan

```
LLAMA_MODELS_DEBUG=1 \
  with-proxy llama stack run meta-reference-gpu \
  --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --env INFERENCE_CHECKPOINT_DIR=<DIR> \
   --env MODEL_PARALLEL_SIZE=4 \
   --env QUANTIZATION_TYPE=fp8_mixed
```

Start a server with and without quantization. Point integration tests to
it using:

```
pytest -s -v  tests/integration/inference/test_text_inference.py \
   --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-04-07 15:03:58 -07:00
Hardik Shah
28e262ecdc
feat: make multi-turn tool call tests work with llama4 (#1886)
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags 
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates 
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```

### Test Plan 
- Start server with meta-reference 
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL  llama stack run meta-reference-gpu 
``` 
- Added **NEW** tests with 5 test cases for multi-turn tool calls 
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
``` 
- Also verified all vision and agent tests pass
2025-04-06 19:14:21 -07:00
Ashwin Bharambe
3f92b2bf85 fix: kill the usage of python_start and python_end tokens 2025-04-05 19:00:26 -07:00
Ashwin Bharambe
b8f1561956
feat: introduce llama4 support (#1877)
As title says. Details in README, elsewhere.
2025-04-05 11:53:35 -07:00