# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
This adds a check to ensure we don't attempt to concatenate `None + str`
or `str + None` when building up our arguments for streaming tool calls
in the Responses API.
## Test Plan
All existing tests pass with this change.
Unit tests:
```
python -m pytest -s -v \
tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
Integration tests:
```
llama stack run llama_stack/templates/together/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -s -v \
tests/integration/agents/test_openai_responses.py \
--text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Verification tests:
```
llama stack run llama_stack/templates/together/run.yaml
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Additionally, the manual example using Codex CLI from #2325 now succeeds
instead of throwing a 500 error.
Closes#2325
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
* Added support postgresql inference store
* Added 'oracle' template that demos how to config postgresql stores
(except for telemetry, which is not supported currently)
## Test Plan
llama stack build --template oracle --image-type conda --run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v tests/integration/
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct -k
'inference_store'
# What does this PR do?
Updates sambanova inference to use strict as false in json_schema
structured output
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Adds an import for all of the template modules before the executor to
prevent deadlock
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2278
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
# Run the pre-commit multiple times and verify the deadlock doesn't occur
for i in {1..10}; do pre-commit run --all-files; done
```
We must store the full (re-hydrated) input not just the original input
in the Response object. Of course, this is not very space efficient and
we should likely find a better storage scheme so that we can only store
unique entries in the database and then re-hydrate them efficiently
later. But that can be done safely later.
Closes https://github.com/meta-llama/llama-stack/issues/2299
## Test Plan
Unit test
# What does this PR do?
Previously prompt guard was hard coded to require cuda which prevented
it from being used on an instance without a cuda support.
This PR allows prompt guard to be configured to use either cpu or cuda.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes [#2133](https://github.com/meta-llama/llama-stack/issues/2133)
## Test Plan (Edited after incorporating suggestion)
1) started stack configured with prompt guard as follows on a system
without a GPU
and validated prompt guard could be used through the APIs
2) validated on a system with a gpu (but without llama stack) that the
python selecting between cpu and cuda support returned the right value
when a cuda device was available.
3) ran the unit tests as per -
https://github.com/meta-llama/llama-stack/blob/main/tests/unit/README.md
[//]: # (## Documentation)
---------
Signed-off-by: Michael Dawson <mdawson@devrus.com>
# What does this PR do?
Use a more common pattern and known terminology from the ecosystem,
where Route is more approved than Endpoint.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Followup of https://github.com/meta-llama/llama-stack/pull/2287. We must
use `--group` when running commands with uv.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the providers list is missing post_training. Add that column and
`HuggingFace`, `TorchTune`, and `NVIDIA NEMO` as supported providers.
also point to these providers in docs/source/providers/index.md, and
describe basic functionality
There are other missing provider types here as well, but starting with
this
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Fix a bug in openai_compat where choices are not indexed correctly.
## Test Plan
Added a new test.
Rerun the failed inference_store tests:
llama stack run fireworks --image-type conda
pytest -s -v tests/integration/ --stack-config http://localhost:8321 -k
'test_inference_store' --text-model meta-llama/Llama-3.3-70B-Instruct
--count 10
# What does this PR do?
Changed the test to not require tool_call in output, but still keeping
the tools params there as a smoke test.
## Test Plan
Used llama3.3 from fireworks (same as CI)
<img width="1433" alt="image"
src="https://github.com/user-attachments/assets/1e5fca98-9b4f-402e-a0bc-d9f910f2c207"
/>
Run with ollama distro and 3b model.
# What does this PR do?
The previous `[project.optional-dependencies]` was misrepresenting what
the packages were. They were NOT optional dependencies to the project
but development dependencies. Unlike optional dependencies, development
dependencies are local-only and will not be included in the project
requirements when published to PyPI or other indexes. As such,
development dependencies are not included in the [project] table.
Additionally, the dev group is synced by default.
Source:
https://docs.astral.sh/uv/concepts/projects/dependencies/#development-dependencies
Signed-off-by: Sébastien Han <seb@redhat.com>
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
# What does this PR do?
Handles the case where the vllm config `tls_verify` is set to `false` or
`true`.
Closes: https://github.com/meta-llama/llama-stack/issues/2283
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
It's not used anywhere in the build process. Ancient artifact from an
old attempt of using sub packages to build distros.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
N/A
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This is not a core dependency of the distro server. It's only necessary
when using `inline::rag-runtime` or `inline::meta-reference` providers.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Fixes an issue where running `llama stack build --template ollama
--image-type venv --run` fails with a TypeError when validating external
providers directory paths.
The error occurs because `os.path.exists()` is called with `Path(None)`
instead of converting it to a string first. This change ensures
consistent handling of `None` values for `external_providers_dir` across
both build and
[run](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/cli/stack/run.py#L134)
commands by using `str()` conversion before path validation.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```bash
INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template ollama --image-type venv --run
```
Command completes successfully without TypeError
[//]: # (## Documentation)
Two somewhat annoying fixes:
- we are going to index tools for non-MCP toolgroups always (like we
used to do). because there are just random assumptions in our tests,
etc. and I don't want to fix them right now
- we need to handle the funny case of toolgroups like
`builtin::rag/knowledge_search` where we added the tool name to use in
the toolgroup itself.
# What does this PR do?
The `tls_verify` can now receive a path to a certificate file if the
endpoint requires it.
Signed-off-by: Sébastien Han <seb@redhat.com>
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
Getting this error from pypi of late
```
'python-requests/2.32.3 User-Agents are currently blocked from accessing JSON release resources. A cluster is apparently crawling all project/release resources resulting in excess cache misses. Please contact admin@pypi.org if you have information regarding what this software may be.'
```
The test depends on llama's tool calling ability. In the CI, we run with
a small ollama model.
The fix might be to check for either message or function_call because
the model is flaky and we aren't really testing that behavior?
# What does this PR do?
This fixes a high vulnerable CVE in `setuptools`:
https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
TSIA
`--enable-ui` to enable
## Test Plan
`llama stack run dev --image-type conda --enable-ui`
`localhost:8322` shows UI
llama stack run dev --image-type conda
`localhost:8322` does not work
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
Having to run (and re-run) a server while running verifications can be
annoying while you are iterating on code. This makes it so you can use
the library client -- and because it is OpenAI client compatible, it all
works.
## Test Plan
```
pytest -s -v tests/verifications/openai_api/test_responses.py \
--provider=stack:together \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
Since https://github.com/meta-llama/llama-stack/pull/2193 switched to
openai sdk, we need to strip 'openai/' from the model_id
## Test Plan
start server with openai provider and send a chat completion call
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs
## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B
# What does this PR do?
We now only print the 'active' routes, not all the possible routes. This
is based on the distribution server config by looking at enabled APIs
and their respective providers.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Not sure why it passed CI earlier...
Strange only 24 workflows run here
https://github.com/meta-llama/llama-stack/pull/2216 so the test never
ran...
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
* remove requirements.txt to use pyproject.toml as the source of truth
* update relevant docs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Use a composite action to avoid similar steps repetitions and
centralization of the defaults.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The cache_ttl config value is not in fact tied to the lifetime of any of
the keys, it represents the time interval between for our key cache
refresher.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
feat(quota): add server‑side per‑client request quotas (requires auth)
Unrestricted usage can lead to runaway costs and fragmented client-side
workarounds. This commit introduces a native quota mechanism to the
server, giving operators a unified, centrally managed throttle for
per-client requests—without needing extra proxies or custom client
logic. This helps contain cloud-compute expenses, enables fine-grained
usage control, and simplifies deployment and monitoring of Llama Stack
services. Quotas are fully opt-in and have no effect unless explicitly
configured.
Notice that Quotas are fully opt-in and require authentication to be
enabled. The 'sqlite' is the only supported quota `type` at this time,
any other `type` will be rejected. And the only supported `period` is
'day'.
Highlights:
- Adds `QuotaMiddleware` to enforce per-client request quotas:
- Uses `Authorization: Bearer <client_id>` (from
AuthenticationMiddleware)
- Tracks usage via a SQLite-based KV store
- Returns 429 when the quota is exceeded
- Extends `ServerConfig` with a `quota` section (type + config)
- Enforces strict coupling: quotas require authentication or the server
will fail to start
Behavior changes:
- Quotas are disabled by default unless explicitly configured
- SQLite defaults to `./quotas.db` if no DB path is set
- The server requires authentication when quotas are enabled
To enable per-client request quotas in `run.yaml`, add:
```
server:
port: 8321
auth:
provider_type: "custom"
config:
endpoint: "https://auth.example.com/validate"
quota:
type: sqlite
config:
db_path: ./quotas.db
limit:
max_requests: 1000
period: day
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#2093
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Wen Liang <wenliang@redhat.com>
Co-authored-by: Wen Liang <wenliang@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
```
llama stack rm llamastack-test
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
#225
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
This adds an alternative option to the oauth_token auth provider that
can be used with existing authorization services which support token
introspection as defined in RFC 7662. This could be useful where token
revocation needs to be handled or where opaque tokens (or other non jwt
formatted tokens) are used
## Test Plan
Tested against keycloak
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
This PR adds a lock to coordinate concurrent coroutines passing through
the jwt verification. As _refresh_jwks() was setting _jwks to an empty
dict then repopulating it, having multiple coroutines doing this
concurrently risks losing keys. The PR also builds the updated dict as a
separate object and assigns it to _jwks once completed. This avoids
impacting any coroutines using the key set as it is being updated.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.
## Test Plan
unit tests added
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
When launching a fine-tuning job, an upcoming version of NeMo Customizer
will expect the `config` name to be formatted as
`namespace/name@version`. Here, `config` is a reference to a model +
additional metadata. There could be multiple `config`s that reference
the same base model.
This PR updates NVIDIA's `supervised_fine_tune` to simply pass the
`model` param as-is to NeMo Customizer. Currently, it expects a
specific, allowlisted llama model (i.e. `meta/Llama3.1-8B-Instruct`) and
converts it to the provider format (`meta/llama-3.1-8b-instruct`).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
From a notebook, I built an image with my changes:
```
!llama stack build --template nvidia --image-type venv
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
```
And could successfully launch a job:
```
response = client.post_training.supervised_fine_tune(
job_uuid="",
model="meta/llama-3.2-1b-instruct@v1.0.0+A100", # Model passed as-is to Customimzer
...
)
job_id = response.job_uuid
print(f"Created job with ID: {job_id}")
Output:
Created job with ID: cust-Jm4oGmbwcvoufaLU4XkrRU
```
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
chore: Updated readme
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
If a user has previously serialized data into their vector store without
the `metadata_token_count` in the chunk, the `query` method will fail in
a server error. This fixes that edge case by returning 0 when the key is
not detected. This solution is suboptimal but I think it's better to
understate the token size rather than recalculate it and add unnecessary
complexity to the retrieval code.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This extracts the W3C trace context headers (traceparent and tracestate)
from incoming requests, stuffs them as attributes on the spans we
create, and uses them within the tracing provider implementation to
actually wrap our spans in the proper context.
What this means in practice is that when a client (such as an OpenAI
client) is instrumented to create these traces, we'll continue that
distributed trace within Llama Stack as opposed to creating our own root
span that breaks the distributed trace between client and server.
It's slightly awkward to do this in Llama Stack because our Tracing API
knows nothing about opentelemetry, W3C trace headers, etc - that's only
knowledge the specific provider implementation has. So, that's why the
trace headers get extracted by in the server code but not actually used
until the provider implementation to form the proper context.
This also centralizes how we were adding the `__root__` and
`__root_span__` attributes, as those two were being added in different
parts of the code instead of from a single place.
Closes#2097
## Test Plan
This was tested manually using the helpful scripts from #2097. I
verified that Llama Stack properly joined the client's span when the
client was instrumented for distributed tracing, and that Llama Stack
properly started its own root span when the incoming request was not
part of an existing trace.
Here's an example of the joined spans:

Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
The `external_config_dir` configuration parameter is not being passed to
the `BuildConfig` for `LlamaStackAsLibraryClient`.
This prevents _plugin_ providers from being loaded when `llama-stack` is
uses as a library.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I ran `LlamaStackAsLibraryClient` with a configuration file that
contained `external_config_dir` and related configuration.
It does not work without this change: _external_ providers are not
resolved.
It does work with this change 👍
[//]: # (## Documentation)
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.
Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.
Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.
The APIs are OpenAI-compliant, with an additional `input_messages`
field.
## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
This PR adds a notion of `principal` (aka some kind of persistent
identity) to the authentication infrastructure of the Stack. Until now
we only used access attributes ("claims" in the more standard OAuth /
OIDC setup) but we need the notion of a User fundamentally as well.
(Thanks @rhuss for bringing this up.)
This value is not yet _used_ anywhere downstream but will be used to
segregate access to resources.
In addition, the PR introduces a built-in JWT token validator so the
Stack does not need to contact an authentication provider to validating
the authorization and merely check the signed token for the represented
claims. Public keys are refreshed via the configured JWKS server. This
Auth Provider should overwhelmingly be considered the default given the
seamless integration it offers with OAuth setups.
closes#2162
# test plan
run `llama stack build --image-name ollama --image-type
<venv/conda/container> --config llama_stack/templates/ollama/build.yaml`
and verify venv | conda | container are built.
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
fixes#2188
## Test Plan
`INFERENCE_MODEL=meta-llama/Llama-3.3-70B-Instruct llama stack build
--image-name ollama --image-type conda --template ollama --run` without
error
# What does this PR do?
start_stack.sh was using --yaml-config which is deprecated.
a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config
resolves#2189
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
It may not always be desirable to listen on all interfaces, which is the
default. As an example, by listening instead only on a loopback
interface, the server cannot be reached except from within the host it
is run on. This PR makes this configurable, through a CLI option, an env
var or an entry on the config file.
## Test Plan
I ran a server with and without the added CLI argument to verify that
the argument is used if provided, but the default is as it was before if
not.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
fixes#2121
this implementation splits reponsibility between litellm and openai
libraries -
| Inference Method | Implementation Source |
|----------------------------|--------------------------|
| completion | LiteLLMOpenAIMixin |
| chat_completion | LiteLLMOpenAIMixin |
| embedding | LiteLLMOpenAIMixin |
| batch_completion | LiteLLMOpenAIMixin |
| batch_chat_completion | LiteLLMOpenAIMixin |
| openai_completion | AsyncOpenAI |
| openai_chat_completion | AsyncOpenAI |
## Test Plan
smoke test with -
```
$ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run
$ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8
$ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{
"model": "Llama-4-Scout-17B-16E-Instruct-FP8",
"messages": [
{"role": "user", "content": "Hello Llama! Can you give me a quick intro?"}
]
}'
{"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}}
```
and run full test suite.
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=http://localhost:8321 pytest
tests/integration/tool_runtime/test_rag_tool.py --embedding-model
text-embedding-3-small
# What does this PR do?
This PR adds a few enhancements:
- Dark mode
- A dark mode icon
- Adds a link to the API documentation
- Adds prettier and a linter to the code
- Aligning the default text
- Linted the code
## Before:

## After (dark mode):

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Related to https://github.com/meta-llama/llama-stack/issues/2085
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
We want this to be a "flagship" distribution we can advertize to a
segment of users to get started quickly. This distro should package a
bunch of remote providers and some cheap inline providers so they get a
solid "AI Platform in a box" setup instantly.
# What does this PR do?
This fixes an issue in how we used the tool_call_buf from streaming tool
calls in the remote-vllm provider where it would end up concatenating
parameters from multiple different tool call results instead of
aggregating the results from each tool call separately.
It also fixes an issue found while digging into that where we were
accidentally mixing the json string form of tool call parameters with
the string representation of the python form, which mean we'd end up
with single quotes in what should be double-quoted json strings.
Closes#1120
## Test Plan
The following tests are now passing 100% for the remote-vllm provider,
where some of the test_text_inference were failing before this change:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
```
All but one of the agent tests are passing (including the multi-tool
one). See the PR at https://github.com/vllm-project/vllm/pull/17917 and
a gist at
https://gist.github.com/bbrowning/4734240ce96b4264340caa9584e47c9e for
changes needed there, which will have to get made upstream in vLLM.
Agent tests:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
````
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
We added:
* make sure docstrings are present with 'params' and 'returns'
* fail if someone sets 'returns: None'
* fix the failing APIs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
currently the "default" dir for external providers is
`/etc/llama-stack/providers.d`
This dir is not used anywhere nor created.
Switch to a more friendly `~/.llama/providers.d/`
This allows external providers to actually create this dir and/or
populate it upon installation, `pip` cannot create directories in `etc`.
If a user does not specify a dir, default to this one
see https://github.com/containers/ramalama-stack/issues/36
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Currently the website only displays the "latest" version. This is
because our config and workflow do not include version information. This
PR adds missing version info.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
This allows users to print the usage information for this script:
```
📚 Llama-Stack Deployment Script
Description:
This script sets up and deploys Llama-Stack with Ollama integration in containers.
It handles both Docker and Podman runtimes and includes automatic platform detection.
Usage:
install.sh [OPTIONS]
Options:
-p, --port PORT Server port for Llama-Stack (default: 8321)
-o, --ollama-port PORT Ollama service port (default: 11434)
-m, --model MODEL Model alias to use (default: llama3.2:3b)
-i, --image IMAGE Server image (default: llamastack/distribution-ollama:0.2.2)
-t, --timeout SECONDS Service wait timeout in seconds (default: 300)
-h, --help Show this help message
For more information:
Documentation: https://llama-stack.readthedocs.io/
GitHub: https://github.com/meta-llama/llama-stack
Report issues:
https://github.com/meta-llama/llama-stack/issues
```
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Introduces scaffolding for Llama Stack's UI. Created with next.js and
https://ui.shadcn.com/.
1. Initialized directory with `npx shadcn@latest init`
2. Added sidebar component `npx shadcn@latest add sidebar` and added
menu items for chat completions and responses.
3. Placeholder pages for each.
## Test Plan
`npm run dev`
<img width="1058" alt="image"
src="https://github.com/user-attachments/assets/5695a53f-e22e-418e-80d1-5bf0ae9b6fe8"
/>
# What does this PR do?
In the Responses API, we convert incoming response requests to chat
completion requests. When streaming the resulting chunks of those chat
completion requests, inference providers that use OpenAI clients will
often return a `type=None` value in the tool call parts of the response.
This causes issues when we try to dump and load that response into our
pydantic model, because type cannot be None in the Responses API model
we're loading these into.
So, strip the "type" field, if present, off those chat completion tool
call results before dumping and loading them as our typed pydantic
models, which will apply our default value for that type field.
## Test Plan
This was found via manual testing of the Responses API with codex, where
I was getting errors in some tool call situations. I added a unit test
to simulate this scenario and verify the fix, as well as manual codex
testing to verify the fix.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
note: the openai provider exposes the litellm specific model names to
the user. this change is compatible with that. the litellm names should
be deprecated.
# What does this PR do?
Closes#2113.
Closes#1783.
Fixes a bug in handling the end of tool execution request stream where
no `finish_reason` is provided by the model.
## Test Plan
1. Ran existing unit tests
2. Added a dedicated test verifying correct behavior in this edge case
3. Ran the code snapshot from #2113
[//]: # (## Documentation)
# What does this PR do?
most third-party actions use hashes for pinning but not all
do proper hash pinning on all remaining actions using tags
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Closes#2111.
Fixes an error causing Llama Stack to just return `<tool_call>` and
complete the turn without actually executing the tool. See the issue
description for more detail.
## Test Plan
1) Ran existing unit tests
2) Added a dedicated test verifying correct behavior in this edge case
3) Ran the code snapshot from #2111
The provider selling point *is* using the same provider for both.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Prevent it from returning results about
'LT Wright Maverick Scout' knives. Ultimatly
we want the word "model" in the returned results
putting llm in the search term make this more likely.
Closes: #2150
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.
I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.
The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.
Closes#2074Closes#2080
## Test Plan
### Unit Tests
Several new unit tests were added, and they all pass. Ran via:
```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
### Responses API Verification Tests
I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).
Running the stack setup for verification testing:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Together, passing 100% as an example:
```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```
## Documentation
We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
While adding other tests, I came across this and wasn’t sure how useful
it is. It doesn’t seem to be exercised anywhere in CI, but I figured I’d
fix it anyway. Happy to remove it if preferred. :)
## Test Plan
Run:
```
uv run pytest tests/integration/inference --stack-config=ollama --report=test_report.md -v --text-model="llama3.2:3b" --embedding-model=all-MiniLM-L6-v2
```
Look at the produced `test_report.md`.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This adds a config option for a CA to be specified with which client
certs are verified. If specified client certs are required. This offers
a simple way of securing access to the server.
(Note: at present it is not possible to access the details of the client
certificate using uvicorn (unless it was monkey patched). Though there
is a defined TLS extension for ASGI, this is not implemented in uvicorn
pending a review and likely change to the specification. See
https://github.com/encode/uvicorn/pull/1119 and
https://github.com/django/asgiref/issues/466. Without access to the DN
it isn't possible to set user access attributes for a mutually
authentication tls connection, so more fine grained access control is
not yet possible).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Used proposed config option to specify a CA and verified that the server
can only be accessed with a valid client certificate.
[//]: # (## Documentation)
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
The ollama provider was using an older variant of the code to convert
incoming parameters from the OpenAI API completions and chat completion
endpoints into requests that get sent to the backend provider over its
own OpenAI client. This updates it to use the common
`prepare_openai_completion_params` method used elsewhere, which takes
care of removing stray `None` values even for nested structures.
Without this, some other parameters, even if they have values of `None`,
make their way to ollama and actually influence its inference output as
opposed to when those parameters are not sent at all.
## Test Plan
This passes tests/integration/inference/test_openai_completion.py and
fixes the issue found in #2098, which was tested via manual curl
requests crafted a particular way.
Closes#2098
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This new check will fail if some webmethods are missing the ellipsis:
```
API Method Return Type Validation Errors:
Method Api.eval.job_result does not contain ellipsis (...) in its implementation
Method Api.agents.create_agent_turn does not contain ellipsis (...) in its implementation
Method Api.agents.create_openai_response does not contain ellipsis (...) in its implementation
Method Api.eval.evaluate_rows does not contain ellipsis (...) in its implementation
Method Api.eval.run_eval does not contain ellipsis (...) in its implementation
```
Unless not implemented.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR adds stubs to the end of functions create_agent_turn,
create_openai_response and job_result.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Ran provided unit tests
[//]: # (## Documentation)
```
$ INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
CHROMADB_URL=http://localhost:8000 \
llama stack build --image-type conda --image-name llama \
--providers vector_io=remote::chromadb,inference=remote::ollama \
--run
...
File ".../llama_stack/providers/remote/vector_io/chroma/chroma.py", line 31, in <module>
ChromaClientType = chromadb.AsyncHttpClient | chromadb.PersistentClient
TypeError: unsupported operand type(s) for |: 'function' and 'function'
```
issue: AsyncHttpClient and PersistentClient are functions that return
AsyncClientAPI and ClientAPI types, respectively. | cannot be used to
construct a type from functions.
previously the code was Union[AsyncHttpClient, PersistentClient], which
did not trigger an error
# What does this PR do?
Closes#2135
# What does this PR do?
As per docs [1], since python 3.11 wait_for() raises TimeoutError. Since
we currently support python 3.10+, we have to catch both.
[1]:
https://docs.python.org/3.12/library/asyncio-task.html#asyncio.wait_for
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
No explicit testing; just code hardening to reflect docs.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This PR fixes the behavior of the `/tool-runtime/rag-tool/query`
endpoint when invoked with an empty `vector_db_ids` parameter.
As of now, it simply returns an empty result, which leads to a
misleading error message from the server and makes it difficult and
time-consuming to detect the problem with the input parameter.
The proposed fix is to return an indicative error message in this case.
## Test Plan
Running the following script:
```
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [],
},
)
],
)
response = agent.create_turn(
messages=[
{
"role": "user",
"content": "How to install OpenShift?",
}
],
session_id=agent.create_session(f"rag-session")
)
```
results in the following error message in the non-patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: Tool call result (id: 494b8020-90bb-449b-aa76-10960d6b2cc2, name: knowledge_search) does not have any content
```
and in the following one in the patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: No vector DBs were provided to the RAG tool. Please provide at least one DB.
```
# What does this PR do?
Adds the API to query metrics from telemetry.
## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
We are dropping configuration via CLI flag almost entirely. If any
server configuration has to be tweak it must be done through the server
section in the run.yaml.
This is unfortunately a breaking change for whover was using:
* `--tls-*`
* `--disable_ipv6`
`--port` stays around and get a special treatment since we believe, it's
common for user dev to change port for quick experimentations.
Closes: https://github.com/meta-llama/llama-stack/issues/1076
## Test Plan
Simply do `llama stack run <config>` nothing should break :)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
In our OpenAI API verification tests, some providers were still calling
tools even when `tool_choice="none"` was passed in the chat completion
requests. Because they aren't all respecting `tool_choice` properly,
this adjusts our routing implementation to remove the `tools` and
`tool_choice` from the request if `tool_choice="none"` is passed in so
that it does not attempt to call any of those tools. Adjusting this in
the router fixes this across all providers.
This also cleans up the non-streaming together.ai responses for tools,
ensuring it returns `None` instead of an empty list when there are no
tool calls, to exactly match the OpenAI API responses in that case.
## Test Plan
I observed existing failures in our OpenAI API verification suite - see
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#together-llama-stack
for the failing `test_chat_*_tool_choice_none` tests. All streaming and
non-streaming variants were failing across all 3 tested models.
After this change, all of those 6 failing tests are now passing with no
regression in the other tests.
I verified this via:
```
llama stack run --image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=together-llama-stack
```
The entire verification suite is not 100% on together.ai yet, but it's
getting closer.
This also increased the pass rate for fireworks.ai, and did not regress
the groq or openai tests at all.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
We don't have GA workflow enabled to proceed with automation so I am
doing this manually again.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Checks for RAGDocument of type InterleavedContent
I noticed when stepping through the code that the supported types for
`RAGDocument` included `InterleavedContent` as a content type. This type
is not checked against before putting the `doc.content` is regex matched
against. This would cause a runtime error. This change adds an explicit
check for type.
The only other part that I'm unclear on is how to handle the
`ImageContent` type since this would always just return `<image>` which
seems like an undesired behavior. Should the `InterleavedContent` type
be removed from `RAGDocument` and replaced with `URI | str`?
## Test Plan
[//]: # (## Documentation)
---------
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.
Signed-off-by: Sébastien Han <seb@redhat.com>
# Issue
Closes#2073
# What does this PR do?
- Removes the `datasets.rst` from the list of document urls as it no
longer exists in torchtune. Referenced PR:
https://github.com/pytorch/torchtune/pull/1781
- Added a step to run `uv sync`. Previously, I would get the following
error:
```
➜ llama-stack git:(remove-deprecated-rst) uv venv --python 3.10
source .venv/bin/activate
Using CPython 3.10.13 interpreter at: /usr/bin/python3.10
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
(llama-stack) ➜ llama-stack git:(remove-deprecated-rst) INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
zsh: llama: command not found...
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
To test: Run through `rag_agent` example in the `detailed_tutorial.md`
file.
[//]: # (## Documentation)
# What does this PR do?
Revert a change that by mistake forced efficiency_config on torchtune
provider
users.
```
fix: Don't require efficiency_config for torchtune
It was enforced by mistake when
0751a960a5 merged.
Other asserts made sense in that the code was written, potentially, to
always expect a non-None value. But not efficiency_config.
```
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Don't use unicode characters in the codebase. ASCII-only is preferred
for compatibility or readability reasons
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR introduces a reusable GitHub Actions workflow for pulling and
running an Ollama model, with caching to avoid repeated downloads.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes: #1949
## Test Plan
1. Trigger a workflow that uses the Ollama setup. Confirm that:
- The model is pulled successfully.
- It is placed in the correct directory, official at the moment (not
~ollama/.ollama/models as per comment so need to confirm this).
2. Re-run the same workflow to validate that:
- The model is restored from the cache.
- Execution succeeds with the cached model.
[//]: # (## Documentation)
# What does this PR do?
For the Issue :-
#[2010](https://github.com/meta-llama/llama-stack/issues/2010)
Currently, if we try to connect the Llama stack server to a remote
Milvus instance that has TLS enabled, the connection fails because TLS
support is not implemented in the Llama stack codebase. As a result,
users are unable to use secured Milvus deployments out of the box.
After adding this , the user will be able to connect to remote::Milvus
which is TLS enabled .
if TLS enabled :-
```
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: "http://<host>:<port>"
token: "<user>:<password>"
secure: True
server_pem_path: "path/to/server.pem"
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I have already tested it by connecting to a Milvus instance which is TLS
enabled and i was able to start llama stack server .
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Add fixtures for SqliteKVStore, DiskDistributionRegistry and
CachedDiskDistributionRegistry. And use them in tests that had all been
duplicating similar setups.
## Test Plan
unit tests continue to run
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
- Clarified best practices for using `# noqa` and `# type: ignore`,
requiring justification comments
- Improved formatting for readability
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Several fixes to ensure the script runs properly on macOS & Podman:
- Automates Podman VM startup on macOS
- Fixes host-gateway handling
- Adds explicit ARM64 platform overrides (this also fixes the platform
warning on Docker)
- Switches health checks to in-container exec calls to avoid Podman
timeouts
- Minor formatting nits
# (Closes#2064 )
## Test Plan
- Manual testing on macOS and Podman
# What does this PR do?
When converting OpenAI message content for the "system" and "assistant"
roles to Llama Stack inference APIs (used for some providers when
dealing with Llama models via OpenAI API requests to get proper prompt /
tool handling), we were not properly converting any non-string content.
I discovered this while running the new Responses AI verification suite
against the Fireworks provider, but instead of fixing it as part of some
ongoing work there split this out into a separate PR.
This fixes that, by using the `openai_content_to_content` helper we used
elsewhere to ensure content parts were mapped properly.
## Test Plan
I added a couple of new tests to `test_openai_compat` to reproduce this
issue and validate its fix. I ran those as below:
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Nullable param type is not supported, e.g. ['string', 'null'], since it
fails type validation.
Tests:
Run inference with
messages:
- content: You are a helpful assistant that can use tools to get
information.
role: system
- content: What's the temperature in San Francisco in celsius?
role: user
tools:
- function:
description: Get current temperature for a given location.
name: get_weather
parameters:
additionalProperties: false
properties:
location:
description: "City and country e.g. Bogot\xE1, Colombia"
type: string
unit:
description: "Unit of temperature, default to celsius"
type: [string, "null"] # <= nullable type
required:
- location
type: object
type: function
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>
All merges produced by github are pushes to main, which makes the check
fail. The check is local by design, not meant for CI.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
When the result of a ToolCall gets passed back into vLLM for the model
to handle the tool call result (as is often the case in agentic
tool-calling workflows), we forgot to handle the case where BuiltinTool
calls are not string values but instead instances of the BuiltinTool
enum. This fixes that, properly converting those enums to string values
before trying to serialize them into an OpenAI chat completion request
to vLLM.
PR #1931 fixed a bug where we weren't passing these tool calling results
back into vLLM, but as a side-effect it created this serialization bug
when using BuiltinTools.
Closes#2070
## Test Plan
I added a new unit test to the openai_compat unit tests to cover this
scenario, ensured the new test failed before this fix, and all the
existing tests there plus the new one passed with this fix.
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add several new pre-commit hooks to improve code quality and security:
- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml
The respective fixes have been included.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Partial revert of fa68ded07c
this commit ensures users know where their new templates are generated
and how to run the newly built distro locally
discussion on Discord:
1351652390
## Test Plan
Did a local run - let me know if we want any unit testing covering this

## Documentation
Updated "Zero to Hero" guide with new output
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
- Added new Ruff lint rules to detect ambiguous or non-ASCII characters:
- Added per-file ignores where Unicode usage is still required.
- Fixed whatever had to be fixed
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When running a Llama Stack server and invoking the
`/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some
cases errors with a `422: Unprocessable Entity` due to malformed input.
For example, given an request body like:
```
{
"model": "test",
"messages": [
{ "role": "user", "content": "You are stupid." }
]
}
```
`convert_pydantic_to_json_value` converts the message to:
```
{ "role": "user", "content": "You are stupid.", "context": null }
```
Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client
Error: Unprocessable Entity for url:
http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be
included in the body.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I ran the Llama Stack server locally and manually verified that the
endpoint now succeeds.
```
message = {"role": "user", "content": "You are stupid."}
response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={})
```
Server logs:
```
14:29:09.656 [START] /v1/safety/run-shield
INFO: 127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK
14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms
```
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with
properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates
and TelemetryConfig. This prevents silent parsing issues and ensures
consistent environment variable resolution.
Slipped in https://github.com/meta-llama/llama-stack/pull/2058
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* pull the embedding model so that it's not pulled during the distro
server startup sequence
* cache the models
* collect logs at the end of the workflow
Signed-off-by: Sébastien Han <seb@redhat.com>
Distribution Template Codegen was broken
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
* new workflow job **build-ubi9-container-distribution**
* runs on the default `ubuntu-latest` runner
* uses the existing `dev` template
* invokes `uv run llama stack build` with `.container_base =
"registry.access.redhat.com/ubi9/ubi-minimal:latest"`
* inspects the resulting image to verify its entrypoint
# (Closes#1994)
## Test Plan
- CI now includes the `build-ubi9-container-distribution` job and will
turn green when that job passes on changes to build files
# What does this PR do?
The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.
This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.
All references to `SQLITE_DB_PATH` have been removed.
Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
# What does this PR do?
This builds on top of
https://github.com/meta-llama/llama-stack/pull/2037 to include some
additional changes to fix integration tests builds.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
In our OpenAI API verification tests, ollama was still calling tools
even when `tool_choice="none"` was passed in its chat completion
requests. Because ollama isn't respecting `tool_choice` properly, this
adjusts our provider implementation to remove the `tools` from the
request if `tool_choice="none"` is passed in so that it does not attempt
to call any of those tools.
## Test Plan
I tested this with a couple of Llama models, using both our OpenAI
completions integration tests and our verification test suites.
### OpenAI Completions / Chat Completions integration tests
These all passed before, and still do.
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \
llama stack build --template ollama --image-type venv --run
```
```
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -v \
tests/integration/inference/test_openai_completion.py \
--text-model "llama3.2:3b-instruct-fp16"
```
### OpenAI API Verification test suite
test_chat_*_tool_choice_none OpenAI API verification tests pass now,
when they failed before.
See
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack
for an example of these failures from a recent nightly CI run.
```
INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \
llama stack build --template ollama --image-type venv --run
```
```
cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml
base_url: http://localhost:8321/v1/openai/v1
api_key_var: OPENAI_API_KEY
models:
- llama3.3:70b-instruct-q3_K_M
model_display_names:
llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct
test_exclusions:
llama3.3:70b-instruct-q3_K_M:
- test_chat_non_streaming_image
- test_chat_streaming_image
- test_chat_multi_turn_multiple_images
EOF
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=ollama-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR updates how the `AgentType` gets set using the radio button on
the tools page of the playground. This change is needed due to the fact
with its current implementation, the chat interface will resets after
every input, preventing users from having a multi-turn conversation with
the agent.
## Test Plan
Run the Playground without these changes:
```bash
streamlit run llama_stack/distribution/ui/app.py
```
Navigate to the tools page and attempt to have a multi-turn
conversation. You should see the conversation reset after asking a
second question.
Repeat the steps above with these changes and you will see that it works
as expected when asking the agent multiple questions.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:
- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage
The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints
## Test Plan
Setup a Kube cluster using Kind or Minikube.
Run a server with:
```
server:
port: 8321
auth:
provider_type: kubernetes
config:
api_server_url: http://url
ca_cert_path: path/to/cert (optional)
```
Run:
```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```
Or replace "my-user" with your service account.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.
Open Issues:
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860
Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)
## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/unit/providers/nvidia/test_datastore.py .. [100%]
============================================================ warnings summary ============================================================
====================================================== 2 passed, 1 warning in 0.84s ======================================================
```
cc: @dglogo, @mattf, @yanxi0830
# What does this PR do?
Add installation script for Llama Stack Meta Reference distro (Docker
only).
# Closes#1374
## Test Plan
./instal.sh
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
This resolves a new critical severity on h11. See
https://access.redhat.com/security/cve/cve-2025-43859. We should
consider releasing a new patch with this fix.
This was updated via:
```
uv add "h11>=0.16.0"
uv export --frozen --no-hashes --no-emit-project --output-file=requirements.txt
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
There are new changes in repo which needs to add some additional
functions to the inference which is fixed. Also need one additional
params to pass some extra arguments to watsonx.ai
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
This addresses 2 bugs I ran into when launching a fine-tuning job with
the NVIDIA Adapter:
1. Session handling in `_make_request` helper function returns an error.
```
INFO: 127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms)
16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint
return await maybe_await(value)
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await
return await value
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune
response = await self._make_request(
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request
async with self.session.request(method, url, params=params, json=json, **kwargs) as response:
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__
self._resp: _RetType = await self._coro
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request
handle = tm.start()
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start
return self._loop.call_at(when, self.__call__)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at
self._check_closed()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
```
Note: This only occurred when initializing the client like so:
```
client = LlamaStackClient(
base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...) # Returns error
```
I didn't run into this issue when using the library client:
```
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
response = client.post_training.supervised_fine_tune(...) # Works fine
```
2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a
`dict` when run from unit tests, but a Pydantic model when invoked using
the Llama Stack client. So, the call fails outside of unit tests:
```
INFO: 127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms)
21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint
return await maybe_await(value)
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await
return await value
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune
"adapter_dim": algorithm_config.get("adapter_dim"),
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'LoraFinetuningConfig' object has no attribute 'get'
```
The code assumes `algorithm_config` should be `dict`, so I just handle
both cases.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
1. I ran a local Llama Stack server with the necessary env vars:
```
lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ...
```
And invoked `supervised_fine_tune` to confirm neither of the errors
above occur.
```
client = LlamaStackClient(
base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...)
```
2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py`
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct
adding the --gpu all flag to Docker run commands
for meta-reference-gpu distributions ensures models are loaded into GPU
instead of CPU.
Remove docs for meta-reference-quantized-gpu
The distribution was removed in #1887
but these files were left behind.
Fixes: #1798
# What does this PR do?
Fixes doc to add --gpu all command to docker run
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#1798
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
verified in docker documentation but untested
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Introduce a `.coveragerc` file to omit:
- test files (*/tests/*)
- provider code (*/llama_stack/providers/*)
- template files (*/llama_stack/templates/*)
- virtual environment (.venv/*)
This ensures coverage reports focus on core application logic (API and
CLI).
Note: I'm opening this for discussing as well - we might decide to
ignore more and or re-add some directories!
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
Enhances the user experience in the `llama stack build` command by
adding interactive TAB completion for image type selection. This ensures
the UX consistency with other parts of the CLI that already support tab
completion, such as provider selection, providing a more intuitive and
discoverable interface for users.
<img width="1531" alt="image"
src="https://github.com/user-attachments/assets/12161d45-451d-4820-b34d-7ea4decf810f"
/>
# What does this PR do?
This PR improves the Tools page in the LlamaStack Playground UI by
enhancing the readability of the active tool list shown in the sidebar.
- Previously, active tools were displayed in a flat JSON array with
verbose identifiers (e.g., builtin::code_interpreter:code_interpreter).
- This PR updates the logic to group tools by their toolgroup (e.g.,
builtin::websearch) and renders each tool name in a simplified,
human-readable format (e.g., web_search).
- This change improves usability when working with multiple toolgroups,
especially in configurations involving MCP tools or complex tool
identifiers.
Before and After Comparison:
**Before**

**After**

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- Followed the [LlamaStack UI Developer Setup
instructions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/distribution/ui)
- Ran the Streamlit UI via: `uv run --with "[.ui]" streamlit run
llama_stack/distribution/ui/app.py`
- Selected multiple built-in toolgroups (e.g., code_interpreter,
websearch, wolfram_alpha) from the sidebar.
[//]: # (## Documentation)
# What does this PR do?
Adding nbformat version fixes this issue. Not sure exactly why this
needs to be done, but this version was rewritten to the bottom of a nb
file when I changed its name trying to get to the bottom of this. When I
opened it on GH the issue was no longer present
Closes#1837
## Test Plan
N/A
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)
## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()
client.models.register(
model_id=model_name,
model_type=ModelType.llm,
provider_id="nvidia"
)
response = client.inference.chat_completion(
model_id=model_name,
messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```
## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items
tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%]
============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
/home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```
[//]: # (## Documentation)
Updated Readme.md
cc: @dglogo, @sumitb, @mattf
# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`
Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This expands the `test_sse` test suite and fixes some edge cases with
bugs in our SSE error handling to ensure streaming clients always get a
proper error response.
First, we handle the case where a client disconnects before we actually
start streaming the response back. Previously we only handled the case
where a client disconnected as we were streaming the response, but there
was an edge case where a client disconnecting before we streamed any
response back did not trigger our logic to cleanly handle that
disconnect.
Second, we handle the case where an error is thrown from the server
before the actual async generator gets created from the provider. This
happens in scenarios like the newly merged OpenAI API input validation,
where we eagerly raise validation errors before returning the async
generator object that streams the responses back.
## Test Plan
Tested via:
```
python -m pytest -s -v tests/unit/server/test_sse.py
```
Both test cases failed before, and passed afterwards. The test cases
were written based on me experimenting with actual clients that would do
bad things like randomly disconnect or send invalid input in streaming
mode and I hit these two cases, where things were misbehaving in our
error handling.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Include the tool call details with the chat when doing Rag with Remote
vllm
Fixes: #1929
With this PR the tool call is included in the chat returned to vllm, the
model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as
expected.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Remove `distributions/**` from integration, external provider, and unit
tests
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
N/A
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Update External Providers CI to not run on changes to docs, rfcs, and
scripts
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR addresses the content dominance problem that frequently arises
with multiple models when executing queries with the RAG tool. When the
retrieved content is too large, it disproportionately influences the
generation process, causing the model to ignore the original question
and to provide meaningless comments on the retrieved information
instead.
This situation is especially common with agentic RAG, which is the
standard way of doing RAG in Llama Stack, since directly manipulating
the prompt combining the query with the retrieved content is not
possible.
This PR appends a grounding message to the results returned by the
knowledge search tool, reminding the model about the original query and
the purpose of the inference call. This makes the problem significantly
less likely to occur.
## Test Plan
Running the following script before the fix demonstrates the content
dominance problem where the model insists to comment on the retrieved
content and refuses to address the question.
Running the script after the fix results in getting the correct answer.
```
import os
import uuid
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"
# inference settings
MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "
# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))
# init the RAG retrieval parameters
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
vector_providers = [
provider for provider in client.providers.list() if provider.api == "vector_io"
]
vector_provider_to_use = vector_providers[0]
# define and register the document collection to be used
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=VECTOR_DB_EMBEDDING_MODEL,
embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
provider_id=vector_provider_to_use.provider_id,
)
# ingest the documents into the newly created document collection
urls = [
("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=url,
mime_type=url_type,
metadata={},
)
for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)
queries = [
"How to install OpenShift?",
]
# initializing the agent
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
# we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval
},
)
],
)
for prompt in queries:
print(f"User> {prompt}")
# create a new turn with a new session ID for each prompt
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=agent.create_session(f"rag-session_{uuid.uuid4()}")
)
# print the response, including tool calls output
for log in AgentEventLogger().log(response):
print(log.content, end='')
```
As part of the build process, we now include the generated run.yaml
(based of the provided build configuration file) into the container. We
updated the entrypoint to use this run configuration as well.
Given this simple distribution configuration:
```
# build.yaml
version: '2'
distribution_spec:
description: Use (an external) Ollama server for running LLM inference
providers:
inference:
- remote::ollama
vector_io:
- inline::faiss
safety:
- inline::llama-guard
agents:
- inline::meta-reference
telemetry:
- inline::meta-reference
eval:
- inline::meta-reference
datasetio:
- remote::huggingface
- inline::localfs
scoring:
- inline::basic
- inline::llm-as-judge
- inline::braintrust
tool_runtime:
- remote::brave-search
- remote::tavily-search
- inline::code-interpreter
- inline::rag-runtime
- remote::model-context-protocol
- remote::wolfram-alpha
container_image: "registry.access.redhat.com/ubi9"
image_type: container
image_name: test
```
Build it:
```
llama stack build --config build.yaml
```
Run it:
```
podman run --rm \
-p 8321:8321 \
-e OLLAMA_URL=http://host.containers.internal:11434 \
--name llama-stack-server \
localhost/leseb-test:0.2.2
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the ramalama team has decided to rename their external provider
`ramalama-stack` (more catchy!). Update docs accordingly
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When clients called the Open AI API with invalid input that wasn't
caught by our own Pydantic API validation but instead only caught by the
backend inference provider, that backend inference provider was
returning a HTTP 400 error. However, we were wrapping that into a HTTP
500 error, obfuscating the actual issue from calling clients and
triggering OpenAI client retry logic.
This change adjusts our existing `translate_exception` method in
`server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing
through the string representation of the error message to the calling
user so they can see the actual input validation error and correct it. I
tried changing this in a few other places, but ultimately
`translate_exception` was the only real place to handle this for both
streaming and non-streaming requests across all inference providers that
use the OpenAI server APIs.
This also tightens up our validation a bit for the OpenAI chat
completions API, to catch empty `messages` parameters, invalid
`tool_choice` parameters, invalid `tools` items, or passing
`tool_choice` when `tools` isn't given.
Lastly, this extends our OpenAI API chat completions verifications to
also check for consistent input validation across providers. Providers
behind Llama Stack should automatically pass all the new tests due to
the input validation added here, but some of the providers fail this
test when not run behind Llama Stack due to differences in how they
handle input validation and errors.
(Closes#1951)
## Test Plan
To test this, start an OpenAI API verification stack:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Then, run the new verification tests with your provider(s) of choice:
```
python -m pytest -s -v \
tests/verifications/openai_api/test_chat_completion.py \
--provider openai-llama-stack
python -m pytest -s -v \
tests/verifications/openai_api/test_chat_completion.py \
--provider together-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds the name of the tool that is used by the agent on the
"tools" page of the playground. See image below for an example.

## Test Plan
Run the playground and navigate to the tools page. There users can see
that this additional text is present when tools are invoked and absent
when they are not.
```
streamlit run llama_stack/distribution/ui/app.py
```
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
Previously, when a streaming client would disconnect before we were
finished streaming the entire response, an error like the below would
get raised from the `sse_generator` function in
`llama_stack/distribution/server/server.py`:
```
AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'?
```
This was because we were calling `aclose` on a coroutine instead of the
awaited value from that coroutine. This change fixes that, so that we
save off the awaited value and then can call `aclose` on it if we
encounter an `asyncio.CancelledError`, like we see when a client
disconnects before we're finished streaming.
The other changes in here are to add a simple set of tests for the happy
path of our SSE streaming and this client disconnect path.
That unfortunately requires adding one more dependency into our unit
test section of pyproject.toml since `server.py` requires loading some
of the telemetry code for me to test this functionality.
## Test Plan
I wrote the tests in `tests/unit/server/test_sse.py` first, verified the
client disconnected test failed before my change, and that it passed
afterwards.
```
python -m pytest -s -v tests/unit/server/test_sse.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add examples for how to define RAGDocuments. Not sure if this is the
best place for these docs. @raghotham Please advise
## Test Plan
None, documentation
[//]: # (## Documentation)
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
Closes#1968.
The asynchronous client in `VLLMInferenceAdapter` is now initialized
directly before first use and not in `VLLMInferenceAdapter.initialize`.
This prevents issues arising due to accessing an expired event loop from
a completed `asyncio.run`.
## Test Plan
Ran unit tests, including `test_remote_vllm.py`.
Ran the code snippet mentioned in #1968.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Now, tool outputs and retrieved chunks from the vector DB (i.e.,
everything except for the actual model reply) are hidden under an
expander form when presented to the user.
# Test Plan
Navigate to the RAG page in the Playground UI.
# What does this PR do?
The together inference provider was throwing a stack trace every time it
shut down, as it was trying to call a non-existent `close` method on the
AsyncTogether client. While fixing that, I also adjusted its shutdown
logic to close the OpenAI client if we've created one of those, as that
client does have a `close` method.
In testing that, I also realized we were defaulting to treating all
requests as streaming requests instead of defaulting to non-streaming.
So, this flips that default to non-streaming to match how the other
providers work.
## Test Plan
I tested this by ensuring the together inference provider no longer
spits out a long stack trace when shutting it down and by running the
OpenAI API chat completion verification suite to ensure the change in
default streaming logic didn't mess anything else up.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR cleans up the sidebar on the tools page of the playground in the
following ways:
* created a clearer hierarchy of configuration options and tool
selections.
* Removed the `mcp::` or `builtin::` prefixes from the tool selection
buttons.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Run the playground and see the updated sidebar does not cause any new
errors.
```
streamlit run llama_stack/distribution/ui/app.py
```
[//]: # (## Documentation)
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
37da47ef8e (diff-4d7c51b1efe9043e44439a949dfd92e5827321b34082903477fd04876edb7552)
Pydantic was updated from v1 to v2 in this commit which caused this
breaking change
# What does this PR do?
Part of #1857
This won't fix the Validation error with the example, but it will
correctly supply user with a proper error rather than a 5xx code.
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
We were passing a dict into the compat mixin for OpenAI Completions when
using Llama models with Fireworks, and that was breaking some strong
typing code that was added in openai_compat.py. We shouldn't have been
converting these params to a dict in that case anyway, so this adjusts
things to pass the params in as their actual original types when calling
the OpenAIChatCompletionToLlamaStackMixin.
## Test Plan
All of the fireworks provider verification tests were failing due to
some OpenAI compatibility cleanup in #1962. The changes in that PR were
good to make, and this just cleans up the fireworks provider code to
stop passing in untyped dicts to some of those `openai_compat.py`
methods since we have the original strongly-typed parameters we can pass
in.
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=fireworks-llama-stack
```
Before this PR, all of the fireworks OpenAI verification tests were
failing. Now, most of them are passing.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
- Update NVIDIA documentation links to GA docs
- Remove reference to notebooks until merged
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This is helpful when debugging issues with vLLM + Llama Stack after this
PR https://github.com/vllm-project/vllm/pull/15593
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
NVIDIA Inference provider was using the ModelRegistryHelper to map input
model ids to provider model ids. this updates it to use the model_store.
## Test Plan
`LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest -v
tests/integration/inference/{test_embedding.py,test_text_inference.py,test_openai_completion.py}
--embedding-model nvidia/llama-3.2-nv-embedqa-1b-v2
--text-model=meta-llama/Llama-3.1-70B-Instruct`
# What does this PR do?
Fixes the UBI 9 container build failure ( `error: command 'gcc' failed`
when installing `polyleven`, `faiss`, etc.) by installing the missing
compiler tool‑chain:
- `python3.11-devel gcc` make added to the UBI 9 `dnf install` line.
### Closes#1970
## Test Plan
- Build a distro with an UBI image
# What does this PR do?
This was not caught as part of the CI build:
dd62a2388c.
[This PR](https://github.com/meta-llama/llama-stack/pull/1354) was too
old and didn't include the additional CI builds yet.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
- Adds a note about unexpected Brave Search output appearing even when
Tavily Search is called. This behavior is expected for now and is a work
in progress https://github.com/meta-llama/llama-stack/issues/1229. The
note aims to clear any confusion for new users.
- Adds two example scripts demonstrating how to build an agent using:
1. WebSearch tool
2. WolframAlpha tool
These examples provide new users with an instant understanding of how to
integrate these tools.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Tested these example scripts using following steps:
step 1. `ollama run llama3.2:3b-instruct-fp16 --keepalive 60m`
step 2.
```
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
```
step 3: `llama stack run --image-type conda
~/llama-stack/llama_stack/templates/ollama/run.yaml`
step 4: run the example script with your api keys.
expected output:


[//]: # (## Documentation)
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
Some of our multi-turn verification tests were failing because I had
accidentally marked content as a required field in the OpenAI chat
completion request assistant messages, but it's actually optional. It is
required for messages from other roles, but assistant is explicitly
allowed to be optional.
Similarly, the assistant message tool_calls field should default to None
instead of an empty list.
These two changes get the openai-llama-stack verification test back to
100% passing, just like it passes 100% when not behind Llama Stack. They
also increase the pass rate of some of the other providers in the
verification test, but don't get them to 100%.
## Test Plan
I started a Llama Stack server setup to run all the verification tests
(requires OPENAI_API_KEY env variable)
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Then, I manually ran the verification tests to see which were failing,
fix them, and ran them again after these changes to ensure they were all
passing.
```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This PR handles the case where a Customization Job's status is
`unknown`. Since we don't map `unknown` to a valid `JobStatus`, the
PostTraining provider throws an exception when fetching/listing a job.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
`./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py` succeeds
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Fixes a crash that occurred when building a stack as a container image
via the interactive wizard without supplying --template or --config.
- Root cause: template_or_config was None; only the container path
relies on that parameter, which later reaches subprocess.run() and
triggers
`TypeError: expected str, bytes or os.PathLike object, not NoneType.`
- Change: in `_run_stack_build_command_from_build_config` we now fall
back to the freshly‑written build‑spec file whenever both optional
sources are missing. Also adds a spy‑based unit test that asserts a
valid string path is passed to build_image() for container builds.
### Closes#1976
## Test Plan
- New unit test: test_build_path.py. Monkey‑patches build_image,
captures the fourth argument, and verifies it is a real path
- Manual smoke test:
```
llama stack build --image-type container
# answer wizard prompts
```
Build proceeds into Docker without raising the previous TypeError.
## Future Work
Harmonise `build_image` arguments so every image type receives the same
inputs, eliminating this asymmetric special‑case.
# What does this PR do?
Build failures are hard to read, sometimes we get errors like:
```
Error building stack: 'key'
```
Which are difficult to debug without a proper trace.
## Test Plan
If `llama stack build` fails you get a traceback now.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR lets users select an existing vdb to use with their agent on the
tools page of the playground. The drop down menu that lets users select
a vdb only appears when the rag tool is selected. Without this change,
there is no way for a user to specify which vdb they want their rag tool
to use on the tools page. I have intentionally left the RAG options
sparse here since the full RAG options are exposed on the RAG page.
## Test Plan
Without these changes the RAG tool will throw the following error:
`name: knowledge_search) does not have any content `
With these changes the RAG tool works as expected.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
Adds `meta/llama-3.2-1b-instruct` to list of models that NeMo Customizer
can fine-tune. This is the model our example notebooks typically use for
fine-tuning.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This PR persists the theme preference across page navigation.
Currently, if the default theme is detected, it is used.
But if a user flips **_the default theme_** and goes to a new page, the
theme will switch back to the default.
This resolves that issue.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixes: #1955
Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and
before) when there are no tools configured which causes the issue
described in #1955 p. This patch avoids sending the 'tools' param to the
vLLM altogether instead of an empty list.
It also adds a small unit test to avoid regressions.
The OpenAI
[specification](https://platform.openai.com/docs/api-reference/chat/create)
does not explicitly state that the list cannot be empty but I found this
out through experimentation and it might depend on the actual remote
vllm. In any case, as this parameter is Optional, is best to skip it
altogether if there's no tools configured.
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
# What does this PR do?
This PR adds a `max_tokens` slider to playground tools page. I have
found that in some instances the llama stack server throws a 500 error
if the max_tokens value is not explicitly set in the agent's
`sampling_params`. This PR, uses the same implementation of the
`max_tokens` slider from the chat page, and includes it on the tools
page.
## Test Plan
1. Attempting to call a tool without these changes results in a `500:
Internal server error: An unexpected error occurred`.
2. Attempting to call a tool with these changes results in the expected
output.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
PR adds instructions to setup vLLM remote endpoint for vllm-remote llama
stack distribution.
## Test Plan
* Verified with manual tests of the configured vllm-remote against vllm
endpoint running on the system with Intel GPU
* Also verified with ci pytests (see cmdline below). Test passes in the
same capacity as it does on the A10 Nvidia setup (some tests do fail
which seems to be known issues with vllm remote llama stack
distribution)
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=http://localhost:5001 \
--text-model=meta-llama/Llama-3.2-3B-Instruct
```
CC: @ashwinb
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
# What does this PR do?
allow users to specify only the providers they want in the llama stack
build command. If a user wants a non-interactive build, but doesn't want
to use a template, `--providers` allows someone to specify something
like `--providers inference=remote::ollama` for a distro with JUST
ollama
## Test Plan
`llama stack build --providers inference=remote::ollama --image-type
venv`
<img width="1084" alt="Screenshot 2025-03-20 at 9 34 14 AM"
src="https://github.com/user-attachments/assets/502b5fa2-edab-4267-a595-4f987204a6a9"
/>
`llama stack run --image-type venv
/Users/charliedoern/projects/Documents/llama-stack/venv-run.yaml`
<img width="1149" alt="Screenshot 2025-03-20 at 9 35 19 AM"
src="https://github.com/user-attachments/assets/433765f3-6b7f-4383-9241-dad085b69228"
/>
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
## Test Plan
(myenv) ➜ llama-stack python tests/verifications/generate_report.py
--providers fireworks,together,openai --run-tests
f27f617629/tests/verifications/REPORT.md
## What does this PR do?
This PR improves the server's request routing logic by ensuring built-in
FastAPI paths such as `/docs`, `/redoc`, `/openapi.json`,
`/favicon.ico`, and `/static` bypass the custom `TracingMiddleware`.
This prevents unnecessary tracing logic for documentation and static
file requests, ensuring better performance and cleaner logs.
Additionally, it adds proper metadata (`title`, `description`, and
`version`) to the FastAPI application initialization and updates the
requirements document accordingly.
[//]: # (Closes#1822 )
---
## Test Plan
- Ran the server locally with `uvicorn` using the provided `run.yaml`
config
- Verified that:
- FastAPI docs (`/docs`, `/redoc`) load correctly without triggering the
custom tracing middleware
- All other routes still go through the middleware and trace logic
- Application metadata appears as expected in the OpenAPI docs
To reproduce:
1. Start the server with `python server.py --template <template-name>`
2. Navigate to `/docs` and `/redoc`
3. Confirm that no extra trace headers are added for those routes
4. Confirm other API endpoints behave as expected and include
`x-trace-id` in the response headers
[//]: # (## Documentation)
---
Froze the requirements file to include many of the other libraries that
have been added in the past few releases to make install easier.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
ollama's CLI supports running models via commands such as 'ollama run
llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack
var as currently specifying a tag such as 'latest' is required
this commit will check to see if the 'latest' model is available and use
that model if a user passes a model name without a tag but the 'latest'
is available in ollama
## Test Plan
Behavior pre-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module>
main()
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main
impls = asyncio.run(construct_stack(config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack
await register_resources(run_config, impls)
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources
await method(**obj.model_dump())
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model
registered_model = await self.register_object(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object
registered_obj = await register_object_with_provider(obj, p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider
return await p.register_model(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model
raise ValueError(
ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest
++ error_handler 108
++ echo 'Error occurred in script at line: 108'
Error occurred in script at line: 108
++ exit 1
```
Behavior post-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
WARNING 2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider
resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest'
INFO 2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO 2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321
INFO: Started server process [28378]
INFO: Waiting for application startup.
INFO 2025-04-08 13:58:18,803 __main__:148 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
...
```
## Documentation
Did not document this anywhere but happy to do so if there is an
appropriate place
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Now a separate thread is started to execute training jobs. Training
requests now return job ID before the job completes. (Which fixes API
timeouts for any jobs that take longer than a minute.)
Note: the scheduler code is meant to be spun out in the future into a
common provider service that can be reused for different APIs and
providers. It is also expected to back the /jobs API proposed here:
https://github.com/meta-llama/llama-stack/discussions/1238
Hence its somewhat generalized form which is expected to simplify its
adoption elsewhere in the future.
Note: this patch doesn't attempt to implement missing APIs (e.g. cancel
or job removal). This work will belong to follow-up PRs.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Added unit tests for the scheduler module. For the API coverage, did
manual testing and was able to run a training cycle on GPU. The initial
call returned job ID before the training completed, as (now) expected.
Artifacts are returned as expected.
```
JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50')
```
The integration test is currently disabled for the provider. I will look
into how it can be enabled in a different PR / issue context.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
5.4.0 to 5.4.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v5.4.1 🌈 Add support for pep440 version specifiers</h2>
<h2>Changes</h2>
<p>With this release you can also use <a
href="https://peps.python.org/pep-0440/#version-specifiers">pep440
version specifiers</a> as <code>required-version</code> in
files<code>uv.toml</code>, <code>pyroject.toml</code> and in the
<code>version</code> input:</p>
<pre lang="yaml"><code>- name: Install a pep440-specifier-satisfying
version of uv
uses: astral-sh/setup-uv@v5
with:
version: ">=0.4.25,<0.5"
</code></pre>
<h2>🐛 Bug fixes</h2>
<ul>
<li>Add support for pep440 version identifiers <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>chore: update known checksums for 0.6.10 @<a
href="https://github.com/apps/github-actions">github-actions[bot]</a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li>
</ul>
<h2>📚 Documentation</h2>
<ul>
<li>Add pep440 to docs header <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li>
<li>Fix glob syntax link <a
href="https://github.com/flying-sheep"><code>@flying-sheep</code></a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li>
<li>Add link to supported glob patterns <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0c5e2b8115"><code>0c5e2b8</code></a>
Add pep440 to docs header (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li>
<li><a
href="794ea9455c"><code>794ea94</code></a>
Add support for pep440 version identifiers (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li>
<li><a
href="2d49baf2b6"><code>2d49baf</code></a>
chore: update known checksums for 0.6.10 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li>
<li><a
href="4fa25599ce"><code>4fa2559</code></a>
Fix glob syntax link (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li>
<li><a
href="224dce1d79"><code>224dce1</code></a>
Add link to supported glob patterns (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li>
<li>See full diff in <a
href="22695119d7...0c5e2b8115">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
Currently the instructions for Llama 4 take quite some space before
people can see the overview and other sections about Llama Stack. Moving
this to a collapsed section would make it less verbose.
workflow -
0. Checkout
1. Install uv
2. Install Ollama
3. Pull Ollama image
4. Start Ollama in background
5. Set Up Environment and Install Dependencies
6. Wait for Ollama to start
7. Start Llama Stack server in background
8. Wait for Llama Stack server to be ready
9. Run Integration Tests
changes -
(4) starts the loading of the ollama model, it does not start ollama.
the model will be loaded when used. this step is removed.
(6) is handled in (2). this step is removed.
(2) is renamed to reflect it's dual purpose.
* feat: function tools in OpenAI Responses by @bbrowning in https://github.com/meta-llama/llama-stack/pull/2094, getting closer to ready. Streaming is the next missing piece.
* feat: Adding support for customizing chunk context in RAG insertion and querying by @franciscojavierarceo in https://github.com/meta-llama/llama-stack/pull/2134
* feat: scaffolding for Llama Stack UI by @ehhuang in https://github.com/meta-llama/llama-stack/pull/2149, more to come in the coming releases.
---
# v0.2.6
Published on: 2025-05-12T18:06:52Z
---
# v0.2.5
Published on: 2025-05-04T20:16:49Z
---
# v0.2.4
Published on: 2025-04-29T17:26:01Z
## Highlights
* One-liner to install and run Llama Stack yay! by @reluctantfuturist in https://github.com/meta-llama/llama-stack/pull/1383
* support for NVIDIA NeMo datastore by @raspawar in https://github.com/meta-llama/llama-stack/pull/1852
* (yuge!) Kubernetes authentication by @leseb in https://github.com/meta-llama/llama-stack/pull/1778
* (yuge!) OpenAI Responses API by @bbrowning in https://github.com/meta-llama/llama-stack/pull/1989
* add api.llama provider, llama-guard-4 model by @ashwinb in https://github.com/meta-llama/llama-stack/pull/2058
---
# v0.2.3
Published on: 2025-04-25T22:46:21Z
## Highlights
* OpenAI compatible inference endpoints and client-SDK support. `client.chat.completions.create()` now works.
* significant improvements and functionality added to the nVIDIA distribution
* many improvements to the test verification suite.
* new inference providers: Ramalama, IBM WatsonX
* many improvements to the Playground UI
---
# v0.2.2
Published on: 2025-04-13T01:19:49Z
## Main changes
- Bring Your Own Provider (@leseb) - use out-of-tree provider code to execute the distribution server
- OpenAI compatible inference API in progress (@bbrowning)
@ -110,25 +110,9 @@ uv run pre-commit run --all-files
> [!CAUTION]
> [!CAUTION]
> Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
> Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
## Running unit tests
## Running tests
You can run the unit tests by running:
You can find the Llama Stack testing documentation here [here](tests/README.md).
```bash
source .venv/bin/activate
./scripts/unit-tests.sh
```
If you'd like to run for a non-default version of Python (currently 3.10), pass `PYTHON_VERSION` variable as follows:
```
source .venv/bin/activate
PYTHON_VERSION=3.13 ./scripts/unit-tests.sh
```
## Running integration tests
You can run integration tests following the instructions [here](tests/integration/README.md).
## Adding a new dependency to the project
## Adding a new dependency to the project
@ -141,11 +125,20 @@ uv sync
## Coding Style
## Coding Style
* Comments should provide meaningful insights into the code. Avoid filler comments that simply describe the next step, as they create unnecessary clutter, same goes for docstrings.
* Comments should provide meaningful insights into the code. Avoid filler comments that simply
* Prefer comments to clarify surprising behavior and/or relationships between parts of the code rather than explain what the next line of code does.
describe the next step, as they create unnecessary clutter, same goes for docstrings.
* Catching exceptions, prefer using a specific exception type rather than a broad catch-all like `Exception`.
* Prefer comments to clarify surprising behavior and/or relationships between parts of the code
rather than explain what the next line of code does.
* Catching exceptions, prefer using a specific exception type rather than a broad catch-all like
`Exception`.
* Error messages should be prefixed with "Failed to ..."
* Error messages should be prefixed with "Failed to ..."
* 4 spaces for indentation rather than tabs
* 4 spaces for indentation rather than tab
* When using `# noqa` to suppress a style or linter warning, include a comment explaining the
justification for bypassing the check.
* When using `# type: ignore` to suppress a mypy warning, include a comment explaining the
justification for bypassing the check.
* Don't use unicode characters in the codebase. ASCII-only is preferred for compatibility or
readability reasons.
## Common Tasks
## Common Tasks
@ -174,14 +167,11 @@ If you have made changes to a provider's configuration in any form (introducing
If you are making changes to the documentation at [https://llama-stack.readthedocs.io/en/latest/](https://llama-stack.readthedocs.io/en/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
If you are making changes to the documentation at [https://llama-stack.readthedocs.io/en/latest/](https://llama-stack.readthedocs.io/en/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
```bash
```bash
cd docs
uv sync --extra docs
# This rebuilds the documentation pages.
# This rebuilds the documentation pages.
uv run make html
uv run --group docs make -C docs/ html
# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
uv run sphinx-autobuild source build/html --write-all
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
```
```
### Update API Documentation
### Update API Documentation
@ -189,7 +179,7 @@ uv run sphinx-autobuild source build/html --write-all
If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:
If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:
```bash
```bash
uv run --with ".[dev]" ./docs/openapi_generator/run_openapi_generator.sh
uv run ./docs/openapi_generator/run_openapi_generator.sh
```
```
The generated API documentation will be available in `docs/_static/`. Make sure to review the changes before committing.
The generated API documentation will be available in `docs/_static/`. Make sure to review the changes before committing.
We released [Version 0.2.0](https://github.com/meta-llama/llama-stack/releases/tag/v0.2.0) with support for the Llama 4 herd of models released by Meta.
We released [Version 0.2.0](https://github.com/meta-llama/llama-stack/releases/tag/v0.2.0) with support for the Llama 4 herd of models released by Meta.
You can now run Llama 4 models on Llama Stack.
<details>
<summary>👋 Click here to see how to run Llama 4 models on Llama Stack </summary>
\
*Note you need 8xH100 GPU-host to run these models*
*Note you need 8xH100 GPU-host to run these models*
As more providers start supporting Llama 4, you can use them in Llama Stack as well. We are adding to the list. Stay tuned!
As more providers start supporting Llama 4, you can use them in Llama Stack as well. We are adding to the list. Stay tuned!
</details>
### 🚀 One-Line Installer 🚀
To try Llama Stack locally, run:
```bash
curl -LsSf https://github.com/meta-llama/llama-stack/raw/main/install.sh | sh
```
### Overview
### Overview
Llama Stack standardizes the core building blocks that simplify AI application development. It codifies best practices across the Llama ecosystem. More specifically, it provides
Llama Stack standardizes the core building blocks that simplify AI application development. It codifies best practices across the Llama ecosystem. More specifically, it provides
@ -96,25 +107,29 @@ By reducing friction and complexity, Llama Stack empowers developers to focus on
### API Providers
### API Providers
Here is a list of the various API providers and available distributions that can help developers get started easily with Llama Stack.
Here is a list of the various API providers and available distributions that can help developers get started easily with Llama Stack.
Here's a collection of comprehensive guides, examples, and resources for building AI applications with Llama Stack. For the complete documentation, visit our [ReadTheDocs page](https://llama-stack.readthedocs.io/en/latest/index.html).
Here's a collection of comprehensive guides, examples, and resources for building AI applications with Llama Stack. For the complete documentation, visit our [ReadTheDocs page](https://llama-stack.readthedocs.io/en/latest/index.html).
## Render locally
## Render locally
From the llama-stack root directory, run the following command to render the docs locally:
```bash
```bash
pip install -r requirements.txt
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
cd docs
python -m sphinx_autobuild source _build
```
```
You can open up the docs in your browser at http://localhost:8000
You can open up the docs in your browser at http://localhost:8000
for more strongly typed interaction use the typed dicts found [here](https://github.com/meta-llama/llama-stack-client-python/blob/38cd91c9e396f2be0bec1ee96a19771582ba6f17/src/llama_stack_client/types/shared_params/document.py).
The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is `{"<provider_name>_api_key": <your api key>}`.
The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is `{"<provider_name>_api_key": <your api key>}`.
> **NOTE:** When using Tavily Search and Bing Search, the inference output will still display "Brave Search." This is because Llama models have been trained with Brave Search as a built-in tool. Tavily and bing is just being used in lieu of Brave search.
#### Code Interpreter
The Code Interpreter allows execution of Python code within a controlled environment.
- Secure execution environment using `bwrap` sandboxing
- Matplotlib support for generating plots
- Disabled dangerous system operations
- Configurable execution timeouts
> ⚠️ Important: The code interpreter tool can operate in a controlled environment locally or on Podman containers. To ensure proper functionality in containerized environments:
> - The container requires privileged access (e.g., --privileged).
> - Users without sufficient permissions may encounter permission errors. (`bwrap: Can't mount devpts on /newroot/dev/pts: Permission denied`)
> - 🔒 Security Warning: Privileged mode grants elevated access and bypasses security restrictions. Use only in local, isolated, or controlled environments.
#### WolframAlpha
#### WolframAlpha
The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
@ -102,7 +81,7 @@ Features:
- Context retrieval with token limits
- Context retrieval with token limits
> **Note:** By default, llama stack run.yaml defines toolgroups for web search, code interpreter and rag, that are provided by tavily-search, code-interpreter and rag providers.
> **Note:** By default, llama stack run.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers.
@ -6,7 +6,7 @@ This guide will walk you through the process of adding a new API provider to Lla
- Begin by reviewing the [core concepts](../concepts/index.md) of Llama Stack and choose the API your provider belongs to (Inference, Safety, VectorIO, etc.)
- Begin by reviewing the [core concepts](../concepts/index.md) of Llama Stack and choose the API your provider belongs to (Inference, Safety, VectorIO, etc.)
- Determine the provider type ({repopath}`Remote::llama_stack/providers/remote` or {repopath}`Inline::llama_stack/providers/inline`). Remote providers make requests to external services, while inline providers execute implementation locally.
- Determine the provider type ({repopath}`Remote::llama_stack/providers/remote` or {repopath}`Inline::llama_stack/providers/inline`). Remote providers make requests to external services, while inline providers execute implementation locally.
- Add your provider to the appropriate {repopath}`Registry::llama_stack/providers/registry/`. Specify pip dependencies necessary.
- Add your provider to the appropriate {repopath}`Registry::llama_stack/providers/registry/`. Specify pip dependencies necessary.
- Update any distribution {repopath}`Templates::llama_stack/templates/` build.yaml and run.yaml files if they should include your provider by default. Run {repopath}`./scripts/distro_codegen.py` if necessary. Note that `distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. This usually means the distribution's `get_distribution_template()` code path should only import any necessary Config or model alias definitions from each provider and not the provider's actual implementation.
- Update any distribution {repopath}`Templates::llama_stack/templates/``build.yaml` and `run.yaml` files if they should include your provider by default. Run {repopath}`./scripts/distro_codegen.py` if necessary. Note that `distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. This usually means the distribution's `get_distribution_template()` code path should only import any necessary Config or model alias definitions from each provider and not the provider's actual implementation.
Here are some example PRs to help you get started:
Here are some example PRs to help you get started:
@ -33,6 +33,7 @@ Note that each provider's `sample_run_config()` method (in the configuration cla
Unit tests are located in {repopath}`tests/unit`. Provider-specific unit tests are located in {repopath}`tests/unit/providers`. These tests are all run automatically as part of the CI process.
Unit tests are located in {repopath}`tests/unit`. Provider-specific unit tests are located in {repopath}`tests/unit/providers`. These tests are all run automatically as part of the CI process.
Consult {repopath}`tests/unit/README.md` for more details on how to run the tests manually.
Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently or use community-provided providers.
To build a distribution with external providers, you need to:
1. Configure the `external_providers_dir` in your build configuration file:
```yaml
# Example my-external-stack.yaml with external providers
version: '2'
distribution_spec:
description: Custom distro for CI tests
providers:
inference:
- remote::custom_ollama
# Add more providers as needed
image_type: container
image_name: ci-test
# Path to external provider implementations
external_providers_dir: ~/.llama/providers.d
```
Here's an example for a custom Ollama provider:
```yaml
adapter:
adapter_type: custom_ollama
pip_packages:
- ollama
- aiohttp
- llama-stack-provider-ollama # This is the provider package
The `pip_packages` section lists the Python packages required by the provider, as well as the
provider package itself. The package must be available on PyPI or can be provided from a local
directory or a git repository (git must be installed on the build environment).
2. Build your distribution using the config file:
```
llama stack build --config my-external-stack.yaml
```
For more information on external providers, including directory structure, provider types, and implementation requirements, see the [External Providers documentation](../providers/external.md).
:::
:::{tab-item} Building Container
:::{tab-item} Building Container
```{admonition} Podman Alternative
```{admonition} Podman Alternative
@ -218,7 +271,7 @@ Now, let's start the Llama Stack Distribution Server. You will need the YAML con
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK
INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK
```
```
### Listing Distributions
Using the list command, you can view all existing Llama Stack distributions, including stacks built from templates, from scratch, or using custom configuration files.
```
llama stack list -h
usage: llama stack list [-h]
list the build stacks
options:
-h, --help show this help message and exit
```
Example Usage
```
llama stack list
```
### Removing a Distribution
Use the remove command to delete a distribution you've previously built.
```
llama stack rm -h
usage: llama stack rm [-h] [--all] [name]
Remove the build stack
positional arguments:
name Name of the stack to delete (default: None)
options:
-h, --help show this help message and exit
--all, -a Delete all stacks (use with caution) (default: False)
```
Example
```
llama stack rm llamastack-test
```
To keep your environment organized and avoid clutter, consider using `llama stack list` to review old or unused distributions and `llama stack rm <name>` to delete them when they’re no longer needed.
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
@ -102,6 +109,227 @@ A Model is an instance of a "Resource" (see [Concepts](../concepts/index)) and i
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
## Server Configuration
The `server` section configures the HTTP server that serves the Llama Stack APIs:
```yaml
server:
port: 8321 # Port to listen on (default: 8321)
tls_certfile: "/path/to/cert.pem" # Optional: Path to TLS certificate for HTTPS
tls_keyfile: "/path/to/key.pem" # Optional: Path to TLS key for HTTPS
```
### Authentication Configuration
The `auth` section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:
```
Authorization: Bearer <token>
```
The server supports multiple authentication providers:
#### OAuth 2.0/OpenID Connect Provider with Kubernetes
The Kubernetes cluster must be configured to use a service account for authentication.
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
@ -37,11 +37,7 @@ For local inferencing, it is required to include the ExecuTorch library into you
Include the ExecuTorch library by:
Include the ExecuTorch library by:
1. Download the `download-prebuilt-et-lib.sh` script file from the [llama-stack-client-kotlin-client-local](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/llama-stack-client-kotlin-client-local/download-prebuilt-et-lib.sh) directory to your local machine.
1. Download the `download-prebuilt-et-lib.sh` script file from the [llama-stack-client-kotlin-client-local](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/llama-stack-client-kotlin-client-local/download-prebuilt-et-lib.sh) directory to your local machine.
2. Move the script to the top level of your Android app where the app directory resides:
2. Move the script to the top level of your Android app where the `app` directory resides.
3. Run `sh download-prebuilt-et-lib.sh` to create an `app/libs` directory and download the `executorch.aar` in that path. This generates an ExecuTorch library for the XNNPACK delegate.
3. Run `sh download-prebuilt-et-lib.sh` to create an `app/libs` directory and download the `executorch.aar` in that path. This generates an ExecuTorch library for the XNNPACK delegate.
4. Add the `executorch.aar` dependency in your `build.gradle.kts` file:
4. Add the `executorch.aar` dependency in your `build.gradle.kts` file:
```
```
@ -52,6 +48,8 @@ dependencies {
}
}
```
```
See other dependencies for the local RAG in Android app [README](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#quick-start).
## Llama Stack APIs in Your Android App
## Llama Stack APIs in Your Android App
Breaking down the demo app, this section will show the core pieces that are used to initialize and run inference with Llama Stack using the Kotlin library.
Breaking down the demo app, this section will show the core pieces that are used to initialize and run inference with Llama Stack using the Kotlin library.
@ -60,7 +58,7 @@ Start a Llama Stack server on localhost. Here is an example of how you can do th
Make sure you have access to a watsonx API Key. You can get one by referring [watsonx.ai](https://www.ibm.com/docs/en/masv-and-l/maximo-manage/continuous-delivery?topic=setup-create-watsonx-api-key).
## Running Llama Stack with watsonx
You can do this via Conda (build code), venv or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
### Environment Variables
The following environment variables can be configured:
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
## Prerequisite: Downloading Models
Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/).
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
### Deploy NeMo Microservices Platform
The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
## Supported Services
Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
### Inference: NVIDIA NIM
NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
### Datasetio API: NeMo Data Store
The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.
### Eval API: NeMo Evaluator
The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.
### Post-Training API: NeMo Customizer
The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.
### Safety API: NeMo Guardrails
The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
See the NVIDIA Safety docs for supported features and example usage.
## Deploying models
In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
You can also remove a deployed NIM to free up GPU resources, if needed.
@ -41,10 +41,10 @@ The following environment variables can be configured:
## Setting up vLLM server
## Setting up vLLM server
In the following sections, we'll use either AMD and NVIDIA GPUs to serve as hardware accelerators for the vLLM
In the following sections, we'll use AMD, NVIDIA or Intel GPUs to serve as hardware accelerators for the vLLM
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
that we only use GPUs here for demonstration purposes.
that we only use GPUs here for demonstration purposes. Note that if you run into issues, you can include the environment variable `--env VLLM_DEBUG_LOG_API_SERVER_RESPONSE=true` (available in vLLM v0.8.3 and above) in the `docker run` command to enable log response from API server for debugging.
### Setting up vLLM server on AMD GPU
### Setting up vLLM server on AMD GPU
@ -162,6 +162,55 @@ docker run \
--port $SAFETY_PORT
--port $SAFETY_PORT
```
```
### Setting up vLLM server on Intel GPU
Refer to [vLLM Documentation for XPU](https://docs.vllm.ai/en/v0.8.2/getting_started/installation/gpu.html?device=xpu) to get a vLLM endpoint. In addition to vLLM side setup which guides towards installing vLLM from sources orself-building vLLM Docker container, Intel provides prebuilt vLLM container to use on systems with Intel GPUs supported by PyTorch XPU backend:
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
Make sure you have access to a SambaNova API Key. You can get one by visiting [SambaNova.ai](https://sambanova.ai/).
Make sure you have access to a SambaNova API Key. You can get one by visiting [SambaNova.ai](http://cloud.sambanova.ai?utm_source=llamastack&utm_medium=external&utm_campaign=cloud_signup).
## Running Llama Stack with SambaNova
## Running Llama Stack with SambaNova
You can do this via Conda (build code) or Docker which has a pre-built image.
You can do this via Conda (build code) or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
@ -10,7 +10,7 @@ Llama Stack supports external providers that live outside of the main codebase.
To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
@ -50,9 +50,12 @@ Llama Stack supports two types of external providers:
Here's a list of known external providers that you can use with Llama Stack:
Here's a list of known external providers that you can use with Llama Stack:
| Type | Name | Description | Repository |
| Name | Description | API | Type | Repository |
|------|------|-------------|------------|
|------|-------------|-----|------|------------|
| Remote | KubeFlow Training | Train models with KubeFlow | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
[HuggingFace SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) is an inline post training provider for Llama Stack. It allows you to run supervised fine tuning on a variety of models using many datasets
## Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support, CPU support, and MPS support (MacOS Metal Performance Shaders)
## Usage
To use the HF SFTTrainer in your Llama Stack project, follow these steps:
1. Configure your Llama Stack project to use this provider.
2. Kick off a SFT job using the Llama Stack post_training API.
## Setup
You can access the HuggingFace trainer via the `ollama` distribution:
[NVIDIA NEMO](https://developer.nvidia.com/nemo-framework) is a remote post training provider for Llama Stack. It provides enterprise-grade fine-tuning capabilities through NVIDIA's NeMo Customizer service.
## Features
- Enterprise-grade fine-tuning capabilities
- Support for LoRA and SFT fine-tuning
- Integration with NVIDIA's NeMo Customizer service
- Support for various NVIDIA-optimized models
- Efficient training with NVIDIA hardware acceleration
## Usage
To use NVIDIA NEMO in your Llama Stack project, follow these steps:
1. Configure your Llama Stack project to use this provider.
2. Set up your NVIDIA API credentials.
3. Kick off a fine-tuning job using the Llama Stack post_training API.
## Setup
You'll need to set the following environment variables:
[TorchTune](https://github.com/pytorch/torchtune) is an inline post training provider for Llama Stack. It provides a simple and efficient way to fine-tune language models using PyTorch.
## Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support and single device capabilities.
- Support for LoRA
## Usage
To use TorchTune in your Llama Stack project, follow these steps:
1. Configure your Llama Stack project to use this provider.
2. Kick off a fine-tuning job using the Llama Stack post_training API.
## Setup
You can access the TorchTune trainer by writing your own yaml pointing to the provider:
```yaml
post_training:
- provider_id: torchtune
provider_type: inline::torchtune
config: {}
```
you can then build and run your own stack with this provider.
## Run Training
You can access the provider and the `supervised_fine_tune` method via the post_training API: