Commit graph

22 commits

Author SHA1 Message Date
Ashwin Bharambe
ef0736527d
feat(tools)!: substantial clean up of "Tool" related datatypes (#3627)
This is a sweeping change to clean up some gunk around our "Tool"
definitions.

First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.

Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.

To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.

Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.

This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
2025-10-02 15:12:03 -07:00
Mustafa Elbehery
c3b2b06974
refactor(logging): rename llama_stack logger categories (#3065)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.

This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865

cc @leseb @rhuss

Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
2025-08-21 17:31:04 -07:00
Mustafa Elbehery
3f8df167f3
chore(pre-commit): add pre-commit hook to enforce llama_stack logger usage (#3061)
# What does this PR do?

This PR adds a step in pre-commit to enforce using `llama_stack` logger.

Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.

Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
2025-08-20 07:15:35 -04:00
Sébastien Han
c20388c424
ci: add python package build test (#2457)
# What does this PR do?

We now test a package build on every PRs.

Closes: https://github.com/meta-llama/llama-stack/issues/2406

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-19 18:57:32 +05:30
Sébastien Han
1c0c6e1e17
chore: remove usage of load_tiktoken_bpe (#2276) 2025-06-02 07:33:37 -07:00
raghotham
5a422e236c
chore: make cprint write to stderr (#2250)
Also do sys.exit(1) in case of errors
2025-05-24 23:39:57 -07:00
ehhuang
664161c462
fix: llama4 tool use prompt fix (#2103)
Tests:
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct

LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model Llama-4-Maverick-17B-128E-Instruct --text-model
Llama-4-Maverick-17B-128E-Instruct

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-05-06 22:18:31 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Sébastien Han
dc94433072
feat(pre-commit): enhance pre-commit hooks with additional checks (#2014)
# What does this PR do?

Add several new pre-commit hooks to improve code quality and security:

- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml

The respective fixes have been included.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 11:35:49 -07:00
ehhuang
0266b20535
docs: update prompt_format.md for llama4 (#2035)
torchrun --nproc_per_node=8 scripts/generate_prompt_format.py
meta-llama/Llama-4-Scout-17B-16E-Instruct ~/local/checkpoints/<path>/
llama_stack.models.llama.llama4.prompts
llama_stack/models/llama/llama4/prompt_format.md

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-25 15:52:15 -07:00
ehhuang
1b2e116a2a
fix: tool call encoded twice (#2034)
# What does this PR do?


## Test Plan
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct
2025-04-25 13:16:16 -07:00
ehhuang
29072f40ab
feat: new system prompt for llama4 (#2031)
Tests:

LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-25 11:29:08 -07:00
ehhuang
2976b5d992
fix: OAI compat endpoint for meta reference inference provider (#1962)
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-17 11:16:04 -07:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Ashwin Bharambe
70a7e4d51e fix: unhide python_start, python_end 2025-04-11 20:30:44 -07:00
Jiawen Liu
36a31fe5dd
fix: on-the-fly int4 quantize parameter (#1920)
Mirror to https://github.com/meta-llama/llama-models/pull/324 with some
clean up

```
with-proxy pip install -e .
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct
export QUANTIZATION_TYPE=int4_mixed
with-proxy llama stack build --run --template meta-reference-gpu
```

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-04-09 15:00:12 -07:00
Ashwin Bharambe
e2299291c4
fix: Mirror llama4 rope scaling fixes, small model simplify (#1917)
See:
- https://github.com/meta-llama/llama-models/pull/322
- https://github.com/meta-llama/llama-models/pull/320
2025-04-09 11:28:45 -07:00
Ashwin Bharambe
8001c30a4f fix: meta reference + llama4 tokenizer fix 2025-04-09 00:46:32 -07:00
Ashwin Bharambe
530d4bdfe1
refactor: move all llama code to models/llama out of meta reference (#1887)
# What does this PR do?

Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.

Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.

## Test Plan

```
LLAMA_MODELS_DEBUG=1 \
  with-proxy llama stack run meta-reference-gpu \
  --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --env INFERENCE_CHECKPOINT_DIR=<DIR> \
   --env MODEL_PARALLEL_SIZE=4 \
   --env QUANTIZATION_TYPE=fp8_mixed
```

Start a server with and without quantization. Point integration tests to
it using:

```
pytest -s -v  tests/integration/inference/test_text_inference.py \
   --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-04-07 15:03:58 -07:00
Hardik Shah
28e262ecdc
feat: make multi-turn tool call tests work with llama4 (#1886)
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags 
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates 
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```

### Test Plan 
- Start server with meta-reference 
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL  llama stack run meta-reference-gpu 
``` 
- Added **NEW** tests with 5 test cases for multi-turn tool calls 
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
``` 
- Also verified all vision and agent tests pass
2025-04-06 19:14:21 -07:00
Ashwin Bharambe
3f92b2bf85 fix: kill the usage of python_start and python_end tokens 2025-04-05 19:00:26 -07:00
Ashwin Bharambe
b8f1561956
feat: introduce llama4 support (#1877)
As title says. Details in README, elsewhere.
2025-04-05 11:53:35 -07:00