# What does this PR do?
Don't use unicode characters in the codebase. ASCII-only is preferred
for compatibility or readability reasons
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR introduces a reusable GitHub Actions workflow for pulling and
running an Ollama model, with caching to avoid repeated downloads.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes: #1949
## Test Plan
1. Trigger a workflow that uses the Ollama setup. Confirm that:
- The model is pulled successfully.
- It is placed in the correct directory, official at the moment (not
~ollama/.ollama/models as per comment so need to confirm this).
2. Re-run the same workflow to validate that:
- The model is restored from the cache.
- Execution succeeds with the cached model.
[//]: # (## Documentation)
# What does this PR do?
For the Issue :-
#[2010](https://github.com/meta-llama/llama-stack/issues/2010)
Currently, if we try to connect the Llama stack server to a remote
Milvus instance that has TLS enabled, the connection fails because TLS
support is not implemented in the Llama stack codebase. As a result,
users are unable to use secured Milvus deployments out of the box.
After adding this , the user will be able to connect to remote::Milvus
which is TLS enabled .
if TLS enabled :-
```
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: "http://<host>:<port>"
token: "<user>:<password>"
secure: True
server_pem_path: "path/to/server.pem"
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I have already tested it by connecting to a Milvus instance which is TLS
enabled and i was able to start llama stack server .
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Add fixtures for SqliteKVStore, DiskDistributionRegistry and
CachedDiskDistributionRegistry. And use them in tests that had all been
duplicating similar setups.
## Test Plan
unit tests continue to run
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
- Clarified best practices for using `# noqa` and `# type: ignore`,
requiring justification comments
- Improved formatting for readability
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Several fixes to ensure the script runs properly on macOS & Podman:
- Automates Podman VM startup on macOS
- Fixes host-gateway handling
- Adds explicit ARM64 platform overrides (this also fixes the platform
warning on Docker)
- Switches health checks to in-container exec calls to avoid Podman
timeouts
- Minor formatting nits
# (Closes#2064 )
## Test Plan
- Manual testing on macOS and Podman
# What does this PR do?
When converting OpenAI message content for the "system" and "assistant"
roles to Llama Stack inference APIs (used for some providers when
dealing with Llama models via OpenAI API requests to get proper prompt /
tool handling), we were not properly converting any non-string content.
I discovered this while running the new Responses AI verification suite
against the Fireworks provider, but instead of fixing it as part of some
ongoing work there split this out into a separate PR.
This fixes that, by using the `openai_content_to_content` helper we used
elsewhere to ensure content parts were mapped properly.
## Test Plan
I added a couple of new tests to `test_openai_compat` to reproduce this
issue and validate its fix. I ran those as below:
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Nullable param type is not supported, e.g. ['string', 'null'], since it
fails type validation.
Tests:
Run inference with
messages:
- content: You are a helpful assistant that can use tools to get
information.
role: system
- content: What's the temperature in San Francisco in celsius?
role: user
tools:
- function:
description: Get current temperature for a given location.
name: get_weather
parameters:
additionalProperties: false
properties:
location:
description: "City and country e.g. Bogot\xE1, Colombia"
type: string
unit:
description: "Unit of temperature, default to celsius"
type: [string, "null"] # <= nullable type
required:
- location
type: object
type: function
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>
All merges produced by github are pushes to main, which makes the check
fail. The check is local by design, not meant for CI.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
When the result of a ToolCall gets passed back into vLLM for the model
to handle the tool call result (as is often the case in agentic
tool-calling workflows), we forgot to handle the case where BuiltinTool
calls are not string values but instead instances of the BuiltinTool
enum. This fixes that, properly converting those enums to string values
before trying to serialize them into an OpenAI chat completion request
to vLLM.
PR #1931 fixed a bug where we weren't passing these tool calling results
back into vLLM, but as a side-effect it created this serialization bug
when using BuiltinTools.
Closes#2070
## Test Plan
I added a new unit test to the openai_compat unit tests to cover this
scenario, ensured the new test failed before this fix, and all the
existing tests there plus the new one passed with this fix.
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add several new pre-commit hooks to improve code quality and security:
- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml
The respective fixes have been included.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Partial revert of fa68ded07c
this commit ensures users know where their new templates are generated
and how to run the newly built distro locally
discussion on Discord:
1351652390
## Test Plan
Did a local run - let me know if we want any unit testing covering this

## Documentation
Updated "Zero to Hero" guide with new output
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
- Added new Ruff lint rules to detect ambiguous or non-ASCII characters:
- Added per-file ignores where Unicode usage is still required.
- Fixed whatever had to be fixed
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When running a Llama Stack server and invoking the
`/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some
cases errors with a `422: Unprocessable Entity` due to malformed input.
For example, given an request body like:
```
{
"model": "test",
"messages": [
{ "role": "user", "content": "You are stupid." }
]
}
```
`convert_pydantic_to_json_value` converts the message to:
```
{ "role": "user", "content": "You are stupid.", "context": null }
```
Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client
Error: Unprocessable Entity for url:
http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be
included in the body.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I ran the Llama Stack server locally and manually verified that the
endpoint now succeeds.
```
message = {"role": "user", "content": "You are stupid."}
response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={})
```
Server logs:
```
14:29:09.656 [START] /v1/safety/run-shield
INFO: 127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK
14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms
```
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with
properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates
and TelemetryConfig. This prevents silent parsing issues and ensures
consistent environment variable resolution.
Slipped in https://github.com/meta-llama/llama-stack/pull/2058
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* pull the embedding model so that it's not pulled during the distro
server startup sequence
* cache the models
* collect logs at the end of the workflow
Signed-off-by: Sébastien Han <seb@redhat.com>
Distribution Template Codegen was broken
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
* new workflow job **build-ubi9-container-distribution**
* runs on the default `ubuntu-latest` runner
* uses the existing `dev` template
* invokes `uv run llama stack build` with `.container_base =
"registry.access.redhat.com/ubi9/ubi-minimal:latest"`
* inspects the resulting image to verify its entrypoint
# (Closes#1994)
## Test Plan
- CI now includes the `build-ubi9-container-distribution` job and will
turn green when that job passes on changes to build files
# What does this PR do?
The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.
This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.
All references to `SQLITE_DB_PATH` have been removed.
Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
# What does this PR do?
This builds on top of
https://github.com/meta-llama/llama-stack/pull/2037 to include some
additional changes to fix integration tests builds.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
In our OpenAI API verification tests, ollama was still calling tools
even when `tool_choice="none"` was passed in its chat completion
requests. Because ollama isn't respecting `tool_choice` properly, this
adjusts our provider implementation to remove the `tools` from the
request if `tool_choice="none"` is passed in so that it does not attempt
to call any of those tools.
## Test Plan
I tested this with a couple of Llama models, using both our OpenAI
completions integration tests and our verification test suites.
### OpenAI Completions / Chat Completions integration tests
These all passed before, and still do.
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \
llama stack build --template ollama --image-type venv --run
```
```
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -v \
tests/integration/inference/test_openai_completion.py \
--text-model "llama3.2:3b-instruct-fp16"
```
### OpenAI API Verification test suite
test_chat_*_tool_choice_none OpenAI API verification tests pass now,
when they failed before.
See
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack
for an example of these failures from a recent nightly CI run.
```
INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \
llama stack build --template ollama --image-type venv --run
```
```
cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml
base_url: http://localhost:8321/v1/openai/v1
api_key_var: OPENAI_API_KEY
models:
- llama3.3:70b-instruct-q3_K_M
model_display_names:
llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct
test_exclusions:
llama3.3:70b-instruct-q3_K_M:
- test_chat_non_streaming_image
- test_chat_streaming_image
- test_chat_multi_turn_multiple_images
EOF
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=ollama-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR updates how the `AgentType` gets set using the radio button on
the tools page of the playground. This change is needed due to the fact
with its current implementation, the chat interface will resets after
every input, preventing users from having a multi-turn conversation with
the agent.
## Test Plan
Run the Playground without these changes:
```bash
streamlit run llama_stack/distribution/ui/app.py
```
Navigate to the tools page and attempt to have a multi-turn
conversation. You should see the conversation reset after asking a
second question.
Repeat the steps above with these changes and you will see that it works
as expected when asking the agent multiple questions.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:
- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage
The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints
## Test Plan
Setup a Kube cluster using Kind or Minikube.
Run a server with:
```
server:
port: 8321
auth:
provider_type: kubernetes
config:
api_server_url: http://url
ca_cert_path: path/to/cert (optional)
```
Run:
```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```
Or replace "my-user" with your service account.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.
Open Issues:
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860
Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)
## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/unit/providers/nvidia/test_datastore.py .. [100%]
============================================================ warnings summary ============================================================
====================================================== 2 passed, 1 warning in 0.84s ======================================================
```
cc: @dglogo, @mattf, @yanxi0830
# What does this PR do?
Add installation script for Llama Stack Meta Reference distro (Docker
only).
# Closes#1374
## Test Plan
./instal.sh
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
This resolves a new critical severity on h11. See
https://access.redhat.com/security/cve/cve-2025-43859. We should
consider releasing a new patch with this fix.
This was updated via:
```
uv add "h11>=0.16.0"
uv export --frozen --no-hashes --no-emit-project --output-file=requirements.txt
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>