A few changes to the storage layer to ensure we reduce unnecessary
contention arising out of our design choices (and letting the database
layer do its correct thing):
- SQL stores now share a single `SqlAlchemySqlStoreImpl` per backend,
and `kvstore_impl` caches instances per `(backend, namespace)`. This
avoids spawning multiple SQLite connections for the same file, reducing
lock contention and aligning the cache story for all backends.
- Added an async upsert API (with SQLite/Postgres dialect inserts) and
routed it through `AuthorizedSqlStore`, then switched conversations and
responses to call it. Using native `ON CONFLICT DO UPDATE` eliminates
the insert-then-update retry window that previously caused long WAL lock
retries.
### Test Plan
Existing tests, added a unit test for `upsert()`
# What does this PR do?
- Updates `/vector_stores/{vector_store_id}/files/{file_id}/content` to
allow returning `embeddings` and `metadata` using the `extra_query`
- Updates the UI accordingly to display them.
- Update UI to support CRUD operations in the Vector Stores section and
adds a new modal exposing the functionality.
- Updates Vector Store update to fail if a user tries to update Provider
ID (which doesn't make sense to allow)
```python
In [1]: client.vector_stores.files.content(
vector_store_id=vector_store.id,
file_id=file.id,
extra_query={"include_embeddings": True, "include_metadata": True}
)
Out [1]: FileContentResponse(attributes={}, content=[Content(text='This is a test document to check if embeddings are generated properly.\n', type='text', embedding=[0.33760684728622437, ...,], chunk_metadata={'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'source': None, 'created_timestamp': 1762053437, 'updated_timestamp': 1762053437, 'chunk_window': '0-13', 'chunk_tokenizer': 'DEFAULT_TIKTOKEN_TOKENIZER', 'chunk_embedding_model': 'sentence-transformers/nomic
-ai/nomic-embed-text-v1.5', 'chunk_embedding_dimension': 768, 'content_token_count': 13, 'metadata_token_count': 9}, metadata={'filename': 'test-embedding.txt', 'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'token_count': 13, 'metadata_token_count': 9})], file_id='file-27291dbc679642ac94ffac6d2810c339', filename='test-embedding.txt')
```
Screenshots of UI are displayed below:
### List Vector Store with Added "Create New Vector Store"
<img width="1912" height="491" alt="Screenshot 2025-11-06 at 10 47
25 PM"
src="https://github.com/user-attachments/assets/a3a3ddd9-758d-4005-ac9c-5047f03916f3"
/>
### Create New Vector Store
<img width="1918" height="1048" alt="Screenshot 2025-11-06 at 10 47
49 PM"
src="https://github.com/user-attachments/assets/b4dc0d31-696f-4e68-b109-27915090f158"
/>
### Edit Vector Store
<img width="1916" height="1355" alt="Screenshot 2025-11-06 at 10 48
32 PM"
src="https://github.com/user-attachments/assets/ec879c63-4cf7-489f-bb1e-57ccc7931414"
/>
### Vector Store Files Contents page (with Embeddings)
<img width="1914" height="849" alt="Screenshot 2025-11-06 at 11 54
32 PM"
src="https://github.com/user-attachments/assets/3095520d-0e90-41f7-83bd-652f6c3fbf27"
/>
### Vector Store Files Contents Details page (with Embeddings)
<img width="1916" height="1221" alt="Screenshot 2025-11-06 at 11 55
00 PM"
src="https://github.com/user-attachments/assets/e71dbdc5-5b49-472b-a43a-5785f58d196c"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Tests added for Middleware extension and Provider failures.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
the inspect API lacked any mechanism to get all
non-deprecated APIs (v1, v1alpha, v1beta)
change default to this behavior
'v1' filter can be used for user' wanting a list
of stable APIs
## Test Plan
1. pull the PR
2. launch a LLS server
3. run `curl http://beanlab3.bss.redhat.com:8321/v1/inspect/routes`
4. note there are APIs for `v1`, `v1alpha`, and `v1beta` but no
deprecated APIs
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that
was replaced by OpenAI-only API. This includes removing unused type
conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted
types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a
self-contained module.
This is the last interdependency between the .api package and "exterior"
packages, meaning that now every other package in llama stack imports
the API, not the other way around.
## Test Plan
this is a structural change, no tests needed.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
# What does this PR do?
Remove circular dependency by moving tracing from API protocol
definitions
to router implementation layer.
This gets us closer to having a self contained API package with no other
cross-cutting dependencies to other parts of the llama stack codebase.
To the best of our ability, the llama_stack.api should only be type and
protocol definitions.
Changes:
- Create apis/common/tracing.py with marker decorator (zero core
dependencies)
- Add the _new_ `@telemetry_traceable` marker decorator to 11 protocol
classes
- Apply actual tracing in core/resolver.py in `instantiate_provider`
based on protocol marker
- Move MetricResponseMixin from core to apis (it's an API response type)
- APIs package is now self-contained with zero core dependencies
The tracing functionality remains identical - actual trace_protocol from
core
is applied to router implementations at runtime when both telemetry is
enabled
and the protocol has the `__marked_for_tracing__` marker.
## Test Plan
Manual integration test confirms identical behavior to main branch:
```bash
llama stack list-deps --format uv starter | sh
export OLLAMA_URL=http://localhost:11434
llama stack run starter
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/gpt-oss:20b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 10}'
```
Verified identical between main and this branch:
- trace_id present in response
- metrics array with prompt_tokens, completion_tokens, total_tokens
- Server logs show trace_protocol applied to all routers
Existing telemetry integration tests (tests/integration/telemetry/) validate
trace context propagation and span attributes.
relates to #3895
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- when create vector store is called without chunk strategy, we actually
the strategy used so that the value is persisted instead of
strategy='None'
## Test Plan
updated tests
## What does this PR do?
The starter distribution now comes with all the required packages to
support persistent stores—like the agent store, metadata, and
inference—using PostgreSQL. Users can enable PostgreSQL support by
setting the `ENABLE_POSTGRES_STORE=1` environment variable.
This PR consolidates the functionality from the removed `postgres-demo`
distribution into the starter distribution, reducing maintenance
overhead.
**Closes: #2619**
**Supersedes: #2851** (rebased and updated)
## Changes Made
1. **Added PostgreSQL support to starter distribution**
- New `run-with-postgres-store.yaml` configuration
- Automatic config switching via `ENABLE_POSTGRES_STORE` environment
variable
- Removed separate `postgres-demo` distribution
2. **Updated to new build system**
- Integrated postgres switching logic into Containerfile entrypoint
- Uses new `storage_backends` and `storage_stores` API
- Properly configured both PostgreSQL KV store and SQL store
3. **Updated dependencies**
- Added `psycopg2-binary` and `asyncpg` to starter distribution
- All postgres-related dependencies automatically included
## How to Use
### With Docker (PostgreSQL):
```bash
docker run \
-e ENABLE_POSTGRES_STORE=1 \
-e POSTGRES_HOST=your_postgres_host \
-e POSTGRES_PORT=5432 \
-e POSTGRES_DB=llamastack \
-e POSTGRES_USER=llamastack \
-e POSTGRES_PASSWORD=llamastack \
-e OPENAI_API_KEY=your_key \
llamastack/distribution-starter
```
### PostgreSQL environment variables:
- `POSTGRES_HOST`: Postgres host (default: `localhost`)
- `POSTGRES_PORT`: Postgres port (default: `5432`)
- `POSTGRES_DB`: Postgres database name (default: `llamastack`)
- `POSTGRES_USER`: Postgres username (default: `llamastack`)
- `POSTGRES_PASSWORD`: Postgres password (default: `llamastack`)
## Test Plan
All pre-commit hooks pass (mypy, ruff, distro-codegen)
`llama stack list-deps starter` confirms psycopg2-binary is included
Storage configuration correctly uses PostgreSQL backends
Container builds successfully with postgres support
## Credits
Original work by @leseb in #2851. Rebased and updated by @r-bit-rry to
work with latest main.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Sébastien Han @leseb
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
1. Make telemetry tests as easy as possible for users by expanding the
`SpanStub` data class and creating the `MetricStub` dataclass as a way
to consistently marshal telemetry data in test fixtures and unmarshal
and handle it in tests.
2. Structure server and client tests to always follow the same standards
for consistent testing experience by using the `SpanStub` and
`MetricStub` data class objects.
3. Enable Metrics Testing for completions endpoint
4. Correct token metrics to use histograms instead of counts to capture
tokens per request rather than a cumulative count of tokens over the
lifecycle of the server.
## Test Plan
These are tests
RAG aka file search is implemented via the Responses API by specifying
the file-search tool. The backend implementation remains unchanged. This
PR merely removes the directly exposed API surface which allowed users
to directly perform searches from the client.
This facility is now available via the `client.vector_store.search()`
OpenAI compatible API.
The llama-stack-client now uses /`v1/openai/v1/models` which returns
OpenAI-compatible model objects with 'id' and 'custom_metadata' fields
instead of the Resource-style 'identifier' field. Updated api_recorder
to handle the new endpoint and modified tests to access model metadata
appropriately. Deleted stale model recordings for re-recording.
**NOTE: CI will be red on this one since it is dependent on
https://github.com/llamastack/llama-stack-client-python/pull/291/files
landing. I verified locally that it is green.**
# What does this PR do?
This API hasn't received any traction and close to zero interest from
the community. Let's revisit in the future if things change.
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
We need to remove `/v1/openai/v1` paths shortly. There is one trouble --
our current `/v1/openai/v1/models` endpoint provides different data than
`/v1/models`. Unfortunately our tests target the latter (llama-stack
customized) behavior. We need to get to true OpenAI compatibility.
This is step 1: adding `custom_metadata` field to `OpenAIModel` that
includes all the extra stuff we add in the native `/v1/models` response.
This can be extracted on the consumer end by look at
`__pydantic_extra__` or other similar fields.
This PR:
- Adds `custom_metadata` field to `OpenAIModel` class in
`src/llama_stack/apis/models/models.py`
- Modified `openai_list_models()` in
`src/llama_stack/core/routing_tables/models.py` to populate
custom_metadata
Next Steps
1. Update stainless client to use `/v1/openai/v1/models` instead of
`/v1/models`
2. Migrate tests to read from `custom_metadata`
3. Remove `/v1/openai/v1/` prefix entirely and consolidate to single
`/v1/models` endpoint
# What does this PR do?
Allow filtering for v1alpha, v1beta, deprecated and v1. Backward
incompatible change since by default it only returns v1 apis now.
## Test Plan
added unit test
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes the handling of the external_providers_dir configuration
field to align with its ongoing deprecation, in favor of the provider
`module` specification approach.
It addresses the issue in #3950, where using the default provided
run.yaml config resulted in the `external_providers_dir` parameter being
set to the literal string `None`, and crashing the llama-stack server
when starting.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3950
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- Built a new container image from `podman build . -f
containers/Containerfile --build-arg DISTRO_NAME=starter --tag
llama-stack:starter`
- Tested it locally with `podman run -it localhost/llama-stack:starter`
- Tested it on an OpenShift 4.19 cluster, deployed via the
llama-stack-k8s-operator.
Signed-off-by: Doug Edgar <dedgar@redhat.com>
## Summary
When users provide API keys via `X-LlamaStack-Provider-Data` header,
`models.list()` now returns models they can access from those providers,
not just pre-registered models from the registry.
This complements the routing fix from f88416ef8 which enabled inference
calls with `provider_id/model_id` format for unregistered models. Users
can now discover which models are available to them before making
inference requests.
The implementation reuses
`NeedsRequestProviderData.get_request_provider_data()` to validate
credentials, then dynamically fetches models from providers without
caching them since they're user-specific. Registry models take
precedence to respect any pre-configured aliases.
## Test Script
```python
#!/usr/bin/env python3
import json
import os
from openai import OpenAI
# Test 1: Without provider_data header
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="dummy")
models = client.models.list()
anthropic_without = [m.id for m in models.data if m.id and "anthropic" in m.id]
print(f"Without header: {len(models.data)} models, {len(anthropic_without)} anthropic")
# Test 2: With provider_data header containing Anthropic API key
anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]
client_with_key = OpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="dummy",
default_headers={
"X-LlamaStack-Provider-Data": json.dumps({"anthropic_api_key": anthropic_api_key})
}
)
models_with_key = client_with_key.models.list()
anthropic_with = [m.id for m in models_with_key.data if m.id and "anthropic" in m.id]
print(f"With header: {len(models_with_key.data)} models, {len(anthropic_with)} anthropic")
print(f"Anthropic models: {anthropic_with}")
assert len(anthropic_with) > len(anthropic_without), "Should have more anthropic models with API key"
print("\n✓ Test passed!")
```
Run with a stack that has Anthropic provider configured (but without API
key in config):
```bash
ANTHROPIC_API_KEY=sk-ant-... python test_provider_data_models.py
```
Remove unused methods that became obsolete after d266c59c: o
_compute_and_log_token_usage
o _count_tokens
o stream_tokens_and_compute_metrics
o count_tokens_and_compute_metrics
These methods are no longer referenced anywhere in the codebase
following the removal of deprecated inference.chat_completion
implementations.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This PR enables routing of fully qualified model IDs of the form
`provider_id/model_id` even when the models are not registered with the
Stack.
Here's the situation: assume a remote inference provider which works
only when users provide their own API keys via
`X-LlamaStack-Provider-Data` header. By definition, we cannot list
models and hence update our routing registry. But because we _require_ a
provider ID in the models now, we can identify which provider to route
to and let that provider decide.
Note that we still try to look up our registry since it may have a
pre-registered alias. Just that we don't outright fail when we are not
able to look it up.
Also, updated inference router so that the responses have the _exact_
model that the request had.
## Test Plan
Added an integration test
Closes#3929
---------
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Fixes mypy type errors in provider utilities and testing infrastructure:
- `mcp.py`: Cast incompatible client types, wrap image data properly
- `batches.py`: Rename walrus variable to avoid shadowing
- `api_recorder.py`: Use cast for Pydantic field annotation
No functional changes.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Fixes mypy type errors in OpenTelemetry integration:
- Add type aliases for AttributeValue and Attributes
- Add helper to filter None values from attributes (OpenTelemetry
doesn't accept None)
- Cast metric and tracer objects to proper types
- Update imports after refactoring
No functional changes.
## Summary
- `preserve_contexts_async_generator` left `PROVIDER_DATA_VAR` (and
other context vars) populated after a streaming generator completed on
HEAD~1, so the asyncio context for request N+1 started with request N's
provider payload.
- FastAPI dependencies and middleware execute before
`request_provider_data_context` rebinds the header data, meaning
auth/logging hooks could observe a prior tenant's credentials or treat
them as authenticated. Traces and any background work that inspects the
context outside the `with` block leak as well—this is a real security
regression, not just a CLI artifact.
- The wrapper now restores each tracked `ContextVar` to the value it
held before the iteration (falling back to clearing when necessary)
after every yield and when the generator terminates, so provider data is
wiped while callers that set their own defaults keep them.
## Test Plan
- `uv run pytest tests/unit/core/test_provider_data_context.py -q`
- `uv run pytest tests/unit/distribution/test_context.py -q`
Both suites fail on HEAD~1 and pass with this change.
Migrates package structure to src/ layout following Python packaging
best practices.
All code moved from `llama_stack/` to `src/llama_stack/`. Public API
unchanged - imports remain `import llama_stack.*`.
Updated build configs, pre-commit hooks, scripts, and GitHub workflows
accordingly. All hooks pass, package builds cleanly.
**Developer note**: Reinstall after pulling: `pip install -e .`