# What does this PR do?
The rag-runtime tool requires files API as a dependency, but the NVIDIA
distribution was missing the files provider configuration. Thus, when
running:
```
llama stack build --distro nvidia --image-type venv
```
And then:
```
llama stack run {path_to_distribution_config} --image-type venv
```
It would raise an error:
```
RuntimeError: Failed to resolve 'tool_runtime' provider 'rag-runtime' of type 'inline::rag-runtime': required dependency 'files' is not available. Please add a 'files' provider to your configuration or check if the provider is properly configured.
```
This PR fixes the issue by adding missing files provider to NVIDIA
distribution.
## Test Plan
N/A
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
Add Kubernetes authentication provider support
- Add KubernetesAuthProvider class for token validation using Kubernetes
SelfSubjectReview API
- Add KubernetesAuthProviderConfig with configurable API server URL, TLS
settings, and claims mapping
- Implement authentication via POST requests to
/apis/authentication.k8s.io/v1/selfsubjectreviews endpoint
- Add support for parsing Kubernetes SelfSubjectReview response format
to extract user information
- Add KUBERNETES provider type to AuthProviderType enum
- Update create_auth_provider factory function to handle 'kubernetes'
provider type
- Add comprehensive unit tests for KubernetesAuthProvider functionality
- Add documentation with configuration examples and usage instructions
The provider validates tokens by sending SelfSubjectReview requests to
the Kubernetes API server and extracts user information from the
userInfo structure in the response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
What This Verifies:
Authentication header validation
Token validation with Kubernetes SelfSubjectReview and kubernetes server
API endpoint
Error handling for invalid tokens and HTTP errors
Request payload structure and headers
```
python -m pytest tests/unit/server/test_auth.py -k "kubernetes" -v
```
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
Finding these issues while moving to github pages.
## Test Plan
uv run --group docs sphinx-autobuild docs/source docs/build/html
--write-all
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR removes `init()` from `LlamaStackAsLibrary`
Currently client.initialize() had to be invoked by user.
To improve dev experience and to avoid runtime errors, this PR init
LlamaStackAsLibrary implicitly upon using the client.
It prevents also multiple init of the same client, while maintaining
backward ccompatibility.
This PR does the following
- Automatic Initialization: Constructor calls initialize_impl()
automatically.
- Client is fully initialized after __init__ completes.
- Prevents consecutive initialization after the client has been
successfully initialized.
- initialize() method still exists but is now a no-op.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2946
---------
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Adds flexible CORS (Cross-Origin Resource Sharing) configuration support
to the FastAPI
server with both local development and explicit configuration modes:
- **Local development mode**: `cors: true` enables localhost-only access
with regex
pattern `https?://localhost:\d+`
- **Explicit configuration mode**: Specific origins configuration with
credential support
and validation
- Prevents insecure combinations (wildcards with credentials)
- FastAPI CORSMiddleware integration via `model_dump()`
Addresses the need for configurable CORS policies to support web
frontends and
cross-origin API access while maintaining security.
Closes#2119
## Test Plan
1. Ran Unit Tests.
2. Manual tests: FastAPI middleware integration with actual HTTP
requests
- Local development mode localhost access validation
- Explicit configuration mode origins validation
- Preflight OPTIONS request handling
Some screenshots of manual tests.
<img width="1920" height="927" alt="image"
src="https://github.com/user-attachments/assets/79322338-40c7-45c9-a9ea-e3e8d8e2f849"
/>
<img width="1911" height="1037" alt="image"
src="https://github.com/user-attachments/assets/1683524e-b0c9-48c9-a0a5-782e949cde01"
/>
cc: @leseb @rhuss @franciscojavierarceo
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Remove pure venv (without uv) references in docs
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
As the title says. Distributions is in, Templates is out.
`llama stack build --template` --> `llama stack build --distro`. For
backward compatibility, the previous option is kept but results in a
warning.
Updated `server.py` to remove the "config_or_template" backward
compatibility since it has been a couple releases since that change.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for removal of Conda support in Llama Stack
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2539
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
**Description**
This PR removes some of the warnings when uv builds the docs
- Errors appear when generating docs about .md files not appearing in
toctree. ~~Adding content to the `providers-gen.py ` file that adds `---
orphan: true ---` to to each file.~~. Added a toctree generator to the
`providers-gen.py` file, this gets rid of the errors in the builds.
- Deletes the `_openai_compat` files, extension of PR #2849
- Adds the `files` APIs section to the `providers` toctree on the index
page
- Manually adds the `--- orphan: true ---` to the advanced apis. Ill try
to find a way to modify the providers code gen so it automatically adds
it, but this fixes the errors.
- Adds the `testing.md` to the `contributing` toctree
- Adds `starting_llama_stack_server.md` to `distributions` toctree
There are some other warnings im still looking at but this PR gets rid
of most of the toctree errors
Theres also an issue with the actual distribution-codegen that I can
investigate in another PR. Opened a bug for it here #2873
We tried to always keep Ollama enabled. However doing so makes the
provider implementation half-assed -- should it error when it cannot
connect to Ollama or not? What happens during periodic model refresh?
Etc. Instead do the same thing we do for vLLM -- use the `OLLAMA_URL` to
conditionally enable the provider.
## Test Plan
Run `uv run llama stack build --template starter --image-type venv
--run` with and without `OLLAMA_URL` set. Verify using
`llama-stack-client provider list` that ollama is correctly enabled.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Updates provider template from outdated `ollama` to `starter`
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes: #2839
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
- Added ability to specify `required_scope` when declaring an API. This
is part of the `@webmethod` decorator.
- If auth is enabled, a user can access an API only if
`user.attributes['scope']` includes the `required_scope`
- We add `required_scope='telemetry.read'` to the telemetry read APIs.
## Test Plan
CI with added tests
1. Enable server.auth with github token
2. Observe `client.telemetry.query_traces()` returns 403
# What does this PR do?
This PR improves documentation clarity around run.yaml file usage. It
adds comprehensive guidance to help users understand that generated
run.yaml files are templates meant to be customized for production use,
not used as-is.
## Changes
- Add new documentation section on customizing run.yaml files
- Clarify that generated run.yaml files are templates, not production
configs
- Add guidance on customization best practices and common scenarios
- Update existing documentation to reference customization guide
- Improve clarity around run.yaml file usage for better user experience
## Test Plan
- Verified new documentation file exists at correct location
- Confirmed documentation is properly integrated into the toctree
structure
- Checked all internal links use correct paths and reference existing
files
- Validated references are added to relevant existing documentation
files
- Documentation build testing will be handled by CI environment
# What does this PR do?
Updates some broken or outdated links pointing to the Android Demo App
Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>
# What does this PR do?
- fix env variables
- use gpu for vllm
- add eks/apply.py for aws
- add template to set hf secret
## Test Plan
bash apply.sh
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
The `nvidia` distro was previously collapsed into the `starter` distro.
However, the `nvidia` distro was setup specifically to use NVIDIA NeMo
microservices as providers for all APIs and not just inference, which
means it was doing quite a bit more than what the `starter` distro
covers today.
We should work with our friends at NVIDIA to determine the best place to
maintain this distro long-term, but for now this restores the `nvidia`
distro and its docs back to where they were so that things continue to
work for their users.
## Test Plan
I ensure the `nvidia` distro could build, and run at least to the point
of complaining that I didn't provide the necessary API keys.
```
uv run llama stack build --template nvidia --image-type venv
uv run llama stack run llama_stack/templates/nvidia/run.yaml
```
I also made sure the docs website built and looks reasonable, with the
`nvidia` distro docs at the same URL it was previously (because it has
incoming links from official NVIDIA NeMo docs, among other places).
```
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- we are using `all-minilm:l6-v2` but the model we download from ollama
is `all-minilm:latest`
latest: https://ollama.com/library/all-minilm:latest 1b226e2802db
l6-v2: https://ollama.com/library/all-minilm:l6-v2 pin 1b226e2802db
- even currently they are exactly the same model but if
[all-minilm:l12-v2](https://ollama.com/library/all-minilm:l12-v2) is
updated, "latest" might not be the same for l6-v2.
- the only change in this PR is pin the model id in ollama
- also update detailed_tutorial with "starter" to replace deprecated
"ollama".
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
>INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
>llama stack build --run --template ollama --image-type venv
...
Build Successful!
You can find the newly-built template here: /home/wenzhou/zdtsw-forking/lls/llama-stack/llama_stack/templates/ollama/run.yaml
....
- metadata:
embedding_dimension: 384
model_id: all-MiniLM-L6-v2
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
- embedding
provider_id: ollama
provider_model_id: all-minilm:l6-v2
...
```
test
```
>llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
OpenAIChatCompletion(
id='chatcmpl-04f99071-3da2-44ba-a19f-03b5b7fc70b7',
choices=[
OpenAIChatCompletionChoice(
finish_reason='stop',
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role='assistant',
content="Here is a 2-sentence poem about the moon:\n\nSilver crescent in the midnight sky,\nLuna's gentle face, a beauty to the eye.",
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None
),
logprobs=None
)
],
created=1751644429,
model='llama3.2:3b-instruct-fp16',
object='chat.completion',
service_tier=None,
system_fingerprint='fp_ollama',
usage={'completion_tokens': 33, 'prompt_tokens': 36, 'total_tokens': 69, 'completion_tokens_details': None, 'prompt_tokens_details': None}
)
```
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
# What does this PR do?
* Removes a bunch of distros
* Removed distros were added into the "starter" distribution
* Doc for "starter" has been added
* Partially reverts https://github.com/meta-llama/llama-stack/pull/2482
since inference providers are disabled by default and can be turned on
manually via env variable.
* Disables safety in starter distro
Closes: https://github.com/meta-llama/llama-stack/issues/2502.
~Needs: https://github.com/meta-llama/llama-stack/pull/2482 for Ollama
to work properly in the CI.~
TODO:
- [ ] We can only update `install.sh` when we get a new release.
- [x] Update providers documentation
- [ ] Update notebooks to reference starter instead of ollama
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- add model_type in example
- change "Memory" to "VectorIO" as column name
- update index.md and README.md
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
run pre-commit to catch changes.
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
We were not using conditionals correctly, conditionals can only be used
when the env variable is set, so `${env.ENVIRONMENT:+}` would return
None is ENVIRONMENT is not set.
If you want to create a conditional value, you need to do
`${env.ENVIRONMENT:=}`, this will pick the value of ENVIRONMENT if set,
otherwise will return None.
Closes: https://github.com/meta-llama/llama-stack/issues/2564
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- change from https://github.com/meta-llama/llama-stack/issues/2110 need
update documentation. "container" is not valid value for --image-type
- chore: updates from standard output
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
# What does this PR do?
This commit significantly improves the environment variable substitution
functionality in Llama Stack configuration files:
* The version field in configuration files has been changed from string
to integer type for better type consistency across build and run
configurations.
* The environment variable substitution system for ${env.FOO:} was fixed
and properly returns an error
* The environment variable substitution system for ${env.FOO+} returns
None instead of an empty strings, it better matches type annotations in
config fields
* The system includes automatic type conversion for boolean, integer,
and float values.
* The error messages have been enhanced to provide clearer guidance when
environment variables are missing, including suggestions for using
default values or conditional syntax.
* Comprehensive documentation has been added to the configuration guide
explaining all supported syntax patterns, best practices, and runtime
override capabilities.
* Multiple provider configurations have been updated to use the new
conditional syntax for optional API keys, making the system more
flexible for different deployment scenarios. The telemetry configuration
has been improved to properly handle optional endpoints with appropriate
validation, ensuring that required endpoints are specified when their
corresponding sinks are enabled.
* There were many instances of ${env.NVIDIA_API_KEY:} that should have
caused the code to fail. However, due to a bug, the distro server was
still being started, and early validation wasn’t triggered. As a result,
failures were likely being handled downstream by the providers. I’ve
maintained similar behavior by using ${env.NVIDIA_API_KEY:+}, though I
believe this is incorrect for many configurations. I’ll leave it to each
provider to correct it as needed.
* Environment variable substitution now uses the same syntax as Bash
parameter expansion.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Our starter distro required Ollama to be running (and a large list of
models available in that Ollama) to successfully start. This adjusts
things so that Ollama does not have to be running to use the starter
template / distro.
To accomplish this, a few changes were needed:
* The Ollama provider is now configurable whether it raises an Exception
or just logs a warning when it cannot reach the Ollama server on
startup. The default is to raise an exception (same as previous
behavior), but in the starter template we adjust this to just log a
warning so that we can bring the stack up without needing a running
Ollama server.
* The starter template no longer specifies a default list of models for
Ollama, as any models specified there need to actually be pulled and
available in Ollama. Instead, it adds a new
`OLLAMA_INFERENCE_MODEL` environment variable where users can provide an
optional model to register with the Ollama provider on startup.
Additional models can also be registered via the typical
`models.register(...)` at runtime.
* The vLLM template was adjusted to also allow an optional
`VLLM_INFERENCE_MODEL` specified on startup, so that the behavior
between vLLM and Ollama was consistent here to make it easy to get up
and running quickly.
* The default vector store was changed from sqlite-vec to faiss.
sqlite-vec can enabled via setting the `ENABLE_SQLITE_VEC` environment
variable, like we do for chromadb and pgvector. This is due to
sqlite-vec not shipping proper arm64 binaries, like we previously fixed
in #1530 for the ollama distribution.
## Test Plan
With this change, the following scenarios now work with the starter
template that did not before:
* no Ollama running
* Ollama running but not all of the Llama models pulled locally
* Ollama running with a custom model registered on startup
* vLLM running with a custom model registered on startup
* running the starter template on linux/arm64, like when running
containers on Mac without rosetta emulation
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>