- Add provider_config_resolver to resolve backend references in provider configs
- Providers can now reference persistence.backends instead of duplicating kvstore configs
- Supports kvstore, metadata_store, persistence_store, responses_store references
- Updated ci-tests to use backend references with namespaces
- Eliminates massive config duplication across providers
- Both SqliteKVStoreConfig and SqliteSqlStoreConfig use type='sqlite'
- Pydantic cannot distinguish them in a union
- Solution: Custom validator parses backends based on which stores reference them
- Metadata store requires KVStore, inference/conversations require SqlStore
- Separate kvstore/sqlstore backends in configs for clarity
- Add StoresConfig to group all store references under persistence.stores
- Use single 'default' backend instead of separate metadata_backend/inference_backend
- Update resolver to access persistence.stores.{metadata,inference,conversations}
- All SQLite distributions now use single store.db file with shared backend
## Summary
Introduce `ExtraBodyField` annotation to enable parameters that arrive
via extra_body in client SDKs but are accessible server-side with full
typing.
These parameters are documented in OpenAPI specs under
**`x-llama-stack-extra-body-params`** but excluded from generated SDK
signatures.
Add `shields` parameter to `create_openai_response` as the first
implementation using this pattern.
## Test Plan
- added an integration test which checks that shields parameter passed
via extra_body reaches server implementation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
# What does this PR do?
Initial implementation for `Conversations` and `ConversationItems` using
`AuthorizedSqlStore` with endpoints to:
- CREATE
- UPDATE
- GET/RETRIEVE/LIST
- DELETE
Set `level=LLAMA_STACK_API_V1`.
NOTE: This does not currently incorporate changes for Responses, that'll
be done in a subsequent PR.
Closes https://github.com/llamastack/llama-stack/issues/3235
## Test Plan
- Unit tests
- Integration tests
Also comparison of [OpenAPI spec for OpenAI
API](https://github.com/openai/openai-openapi/tree/manual_spec)
```bash
oasdiff breaking --fail-on ERR docs/static/llama-stack-spec.yaml https://raw.githubusercontent.com/openai/openai-openapi/refs/heads/manual_spec/openapi.yaml --strip-prefix-base "/v1/openai/v1" \
--match-path '(^/v1/openai/v1/conversations.*|^/conversations.*)'
```
Note I still have some uncertainty about this, I borrowed this info from
@cdoern on https://github.com/llamastack/llama-stack/pull/3514 but need
to spend more time to confirm it's working, at the moment it suggests it
does.
UPDATE on `oasdiff`, I investigated the OpenAI spec further and it looks
like currently the spec does not list Conversations, so that analysis is
useless. Noting for future reference.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
remove unused chat_completion implementations
vllm features ported -
- requires max_tokens be set, use config value
- set tool_choice to none if no tools provided
## Test Plan
ci
This is a sweeping change to clean up some gunk around our "Tool"
definitions.
First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.
Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.
To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.
Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.
This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
# What does this PR do?
inference/rerank is the one route in the API intended to not be
deprecated. Level it as v1alpha.
Additionally, remove `experimental` and opt to instead use `v1alpha`
which itself implies an experimental state based on the original
proposal
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Mirroring the same changes that was used for inference_store:
https://github.com/llamastack/llama-stack/pull/3383
Will follow up with a shared internal API for managing these write
queues.
## Test Plan
existing tests
# What does this PR do?
APIs removed:
- POST /v1/batch-inference/completion
- POST /v1/batch-inference/chat-completion
- POST /v1/inference/batch-completion
- POST /v1/inference/batch-chat-completion
note -
- batch-completion & batch-chat-completion were only implemented for
inference=inline::meta-reference
- batch-inference were not implemented
the commit to be reverted is an public api behavior change to something
we should not support.
instead of allowing silent updates (the caller cannot see the log
messages), we should be sending an error to the caller that they must
first unregister the model before reusing the same name w/ a different
backend.
# What does this PR do?
When listing (and lazily indexing) tools, it's possible for an error to
get thrown by individual toolgroups if for example an MCP toolgroup is
unable to connect to its `mcp_endpoint`.
This logs a warning in the server when that happens, logs a full stack
trace of the error if debug logging is enabled, and just returns the
list of tools from all working toolgroups instead of throwing an error
to the client when a single toolgroup is temporarily or permanently
misbehaving.
The exception to the above is authentication errors, which we
specifically send all the way back to the client as that's how we
indicate to the client that it needs to provide authentication data for
the remote MCP servers.
Closes#2540
## Test Plan
A new unit test was added to test this exception handling, which is run
as part of our regular test suite but also manually run to specifically
verify this fix via:
```
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
To verify the additional debug logging is printing properly:
```
LLAMA_STACK_LOGGING=core=debug \
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
The mcp integration tests were run as below (and by CI):
```
ollama run llama3.2:3b
ENABLE_OLLAMA="ollama" \
OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=starter \
uv run pytest -sv tests/integration/tool_runtime/test_mcp.py \
--text-model meta-llama/Llama-3.2-3B-Instruct
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Rather than have a single `LLAMA_STACK_VERSION`, we need to have a
`_V1`, `_V1ALPHA`, and `_V1BETA` constant.
This also necessitated addition of `level` to the `WebMethod` so that
routing can be handeled properly.
For backwards compat, the `v1` routes are being kept around and marked
as `deprecated`. When used, the server will log a deprecation warning.
Deprecation log:
<img width="1224" height="134" alt="Screenshot 2025-09-25 at 2 43 36 PM"
src="https://github.com/user-attachments/assets/0cc7c245-dafc-48f0-be99-269fb9a686f9"
/>
move:
1. post_training to `v1alpha` as it is under heavy development and not
near its final state
2. eval: job scheduling is not implemented. Relies heavily on the
datasetio API which is under development missing implementations of
specific routes indicating the structure of those routes might change.
Additionally eval depends on the `inference` API which is going to be
deprecated, eval will likely need a major API surface change to conform
to using completions properly
implements leveling in #3317
note: integration tests will fail until the SDK is regenerated with
v1alpha/inference as opposed to v1/inference
## Test Plan
existing tests should pass with newly generated schema. Conformance will
also pass as these routes are not the ones we currently test for
stability
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
error:
5099094847
## Test Plan
GITHUB_ACTIONS=true BUILD_PLATFORM=linux/amd64 USE_COPY_NOT_MOUNT=true
LLAMA_STACK_DIR=. uv run --with llama-stack llama stack build --distro
starter --image-type container --image-name ehhuang/distribution-starter
succeeds
- Catch Errors from providers without API keys during model refresh
- Log as warning instead of exception to avoid a scary startup
Closes: #3492
Error message are now warnings instead of several tracebacks
```
INFO 2025-09-19 16:06:55,228 llama_stack.providers.utils.inference.inference_store:74 inference_store: Write queue disabled for SQLite to avoid
concurrency issues
WARNING 2025-09-19 16:06:59,362 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for anthropic: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"anthropic_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-09-19 16:06:59,364 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for gemini: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,367 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for groq: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,372 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for sambanova: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
INFO 2025-09-19 16:06:59,533 llama_stack.core.utils.config_resolution:45 core: Using file path:
```
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.
To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.
## Test Plan
CI
> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml
works.
> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4
works.
# What does this PR do?
currently `RemoteProviderSpec` has an `AdapterSpec` embedded in it.
Remove `AdapterSpec`, and put its leftover fields into
`RemoteProviderSpec`.
Additionally, many of the fields were duplicated between
`InlineProviderSpec` and `RemoteProviderSpec`. Move these to
`ProviderSpec` so they are shared.
Fixup the distro codegen to use `RemoteProviderSpec` directly rather
than `remote_provider_spec` which took an AdapterSpec and returned a
full provider spec
## Test Plan
existing distro tests should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Fixes this warning in llama stack build:
```bash
WARNING 2025-09-15 15:29:02,197 llama_stack.core.distribution:149 core: Failed to import module prompts: No module named
'llama_stack.providers.registry.prompts'"
```
## Test Plan
Test added
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Modified the code in registry.py.
The key changes are:
1. Removed the `return False` statement
2. Added a warning log message that includes the object type,
identifier, and provider_id for better debugging.
3. The method now continues with the registration process instead of
early returning.
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR provides functionality for users to unregister ScoringFn and
Benchmark resources for `scoring` and `eval` APIs.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3051
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Updated integration and unit tests via CI workflow
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
- Use BackgroundLogger when logging metric events.
- Reuse event loop in BackgroundLogger
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
### RPS from 57 -> 62
# What does this PR do?
Fix fireworks chat completion broken due to telemetry expecting
response.usage
Closes https://github.com/llamastack/llama-stack/issues/3391
## Test Plan
1. `uv run --with llama-stack llama stack build --distro starter
--image-type venv --run`
Try
```
curl -X POST http://0.0.0.0:8321/v1/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
```
{"id":"chatcmpl-ee922a08-0df0-4974-b0d3-b322113e8bc0","choices":[{"message":{"role":"assistant","content":"Hello! How can I assist you today?","name":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","created":1757456375,"model":"fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"}%
```
Without fix fails as mentioned in
https://github.com/llamastack/llama-stack/issues/3391
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
This PR adds support for OpenAI Prompts API.
Note, OpenAI does not explicitly expose the Prompts API but instead
makes it available in the Responses API and in the [Prompts
Dashboard](https://platform.openai.com/docs/guides/prompting#create-a-prompt).
I have added the following APIs:
- CREATE
- GET
- LIST
- UPDATE
- Set Default Version
The Set Default Version API is made available only in the Prompts
Dashboard and configures which prompt version is returned in the GET
(the latest version is the default).
Overall, the expected functionality in Responses will look like this:
```python
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
prompt={
"id": "pmpt_68b0c29740048196bd3a6e6ac3c4d0e20ed9a13f0d15bf5e",
"version": "2",
"variables": {
"city": "San Francisco",
"age": 30,
}
}
)
```
### Resolves https://github.com/llamastack/llama-stack/issues/3276
## Test Plan
Unit tests added. Integration tests can be added after client
generation.
## Next Steps
1. Update Responses API to support Prompt API
2. I'll enhance the UI to implement the Prompt Dashboard.
3. Add cache for lower latency
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add Kubernetes authentication provider support
- Add KubernetesAuthProvider class for token validation using Kubernetes
SelfSubjectReview API
- Add KubernetesAuthProviderConfig with configurable API server URL, TLS
settings, and claims mapping
- Implement authentication via POST requests to
/apis/authentication.k8s.io/v1/selfsubjectreviews endpoint
- Add support for parsing Kubernetes SelfSubjectReview response format
to extract user information
- Add KUBERNETES provider type to AuthProviderType enum
- Update create_auth_provider factory function to handle 'kubernetes'
provider type
- Add comprehensive unit tests for KubernetesAuthProvider functionality
- Add documentation with configuration examples and usage instructions
The provider validates tokens by sending SelfSubjectReview requests to
the Kubernetes API server and extracts user information from the
userInfo structure in the response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
What This Verifies:
Authentication header validation
Token validation with Kubernetes SelfSubjectReview and kubernetes server
API endpoint
Error handling for invalid tokens and HTTP errors
Request payload structure and headers
```
python -m pytest tests/unit/server/test_auth.py -k "kubernetes" -v
```
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
This change migrates the VectorDB id generation to Vector Stores.
This is a breaking change for **_some users_** that may have application
code using the `vector_db_id` parameter in the request of the VectorDB
protocol instead of the `VectorDB.identifier` in the response.
By default we will now create a Vector Store every time we register a
VectorDB. The caveat with this approach is that this maps the
`vector_db_id` → `vector_store.name`. This is a reasonable tradeoff to
transition users towards OpenAI Vector Stores.
As an added benefit, registering VectorDBs will result in them appearing
in the VectorStores admin UI.
### Why?
This PR makes the `POST` API call to `/v1/vector-dbs` swap the
`vector_db_id` parameter in the **request body** into the VectorStore's
name field and sets the `vector_db_id` to the generated vector store id
(e.g., `vs_038247dd-4bbb-4dbb-a6be-d5ecfd46cfdb`).
That means that users would have to do something like follows in their
application code:
```python
res = client.vector_dbs.register(
vector_db_id='my-vector-db-id',
embedding_model='ollama/all-minilm:l6-v2',
embedding_dimension=384,
)
vector_db_id = res.identifier
```
And then the rest of their code would behave, including `VectorIO`'s
insert protocol using `vector_db_id` in the request.
An alternative implementation would be to just delete the `vector_db_id`
parameter in `VectorDB` but the end result would still require users
having to write `vector_db_id = res.identifier` since
`VectorStores.create()` generates the ID for you.
So this approach felt the easiest way to migrate users towards
VectorStores (subsequent PRs will be added to trigger `files.create()`
and `vector_stores.files.create()`).
## Test Plan
Unit tests and integration tests have been added.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
What does this PR do?
Fixes error handling when MCP server connections fail. Instead of
returning generic 500 errors, now provides
descriptive error messages with proper HTTP status codes.
Closes#3107
Test Plan
Before fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"detail": "Internal server error: An unexpected error
occurred."} (500)
After fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"error": {"detail": "Failed to connect to MCP server at
http://localhost:9999/sse: Connection
refused"}} (502)
Tests:
- Added unit test for ConnectionError → 502 translation
- Manually tested with unreachable MCP servers (connection refused)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is eliminating hardcoded status codes: `409` CONFLICT and `404`
NOT_FOUND in `server.py` using `httpx` built-in constants. This
implementation will follow the existing structure to improve
readability, extensibility and developer experience. This is already was
implemented in #3131
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
`./scripts/unit-tests.sh`
# What does this PR do?
Sometimes the stream don't have chunks with finish_reason, e.g. canceled
stream, which throws a pydantic error as OpenAIChoice.finish_reason: str
## Test Plan
observe no more such error when benchmarking
Generated with CC:
Replace cryptic KeyError with clear, actionable error message that
shows:
- Which API the failing provider belongs to
- The provider ID and type that's failing
- Which dependency is missing
- Clear instructions on how to fix the issue
## Test plan
Use a run config with Agents API and no safety provider
Before: KeyError: <Api.safety: 'safety'>
After: Failed to resolve 'agents' provider 'meta-reference' of type
'inline::meta-reference': required dependency 'safety' is not available.
Please add a 'safety' provider to your configuration or check if the
provider is properly configured.
# What does this PR do?
BFCL scoring function is not supported, removing it.
Also minor fixes as the llama stack run is broken for open-benchmark for
test plan verification
1. Correct the model paths for supported models
2. Fix another issue as there is no `provider_id` for DatasetInput but
logger assumes it exists.
```
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 332, in construct_stack
await register_resources(run_config, impls)
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 108, in register_resources
logger.debug(f"registering {rsrc.capitalize()} {obj} for provider {obj.provider_id}")
^^^^^^^^^^^^^^^
File "/Users/swapna942/llama-stack/.venv/lib/python3.13/site-packages/pydantic/main.py", line 991, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'DatasetInput' object has no attribute 'provider_id'
```
## Test Plan
```llama stack build --distro open-benchmark --image-type venv``` and run the server succeeds
Issue Link: https://github.com/llamastack/llama-stack/issues/3282
# What does this PR do?
During env var replacement, we're implicitly converting all config types
to their apparent types (e.g., "true" to True, "123" to 123). This may
be arguably useful for when doing an env var substitution, as those are
always strings, but we should definitely avoid touching config values
that have explicit types and are uninvolved in env var substitution.
## Test Plan
Unit
The `trl` dependency brings in `accelerate` which brings in nvidia
dependencies for torch. We cannot have that in the starter distro. As
such, no CPU-only post-training for the huggingface provider.