Removing the debug logging that was added to diagnose signature mismatch errors.
The logging served its purpose - it helped us identify that the error was coming
from api_recorder.py patched methods, not the actual provider implementations.
With the root cause now fixed in api_recorder.py, this debug logging is no longer
needed and can be safely removed to keep the code clean.
The ACTUAL root cause of the signature mismatch errors was found!
The api_recorder.py module patches tool runtime invoke_tool methods for test
recording/replay, but the patched methods were missing the new 'authorization'
parameter. The debug logging revealed:
Object method: patched_tavily_invoke_tool (from api_recorder module)
Object method's module: llama_stack.testing.api_recorder
Changes made:
1. Updated _patched_tool_invoke_method() to accept authorization parameter
2. Updated patched_tavily_invoke_tool() signature to include authorization
3. Added debug logging to resolver to help identify similar issues in the future
This fix ensures that when tests run in record/replay mode, the patched methods
preserve the full signature including the authorization parameter, allowing the
protocol compliance checks to pass.
Adding comprehensive debug logging to understand what's causing the persistent
signature mismatch errors in CI. The logging will show:
- Provider class name and module
- Both protocol and object signatures
- The actual method object
- The method's source module
This will help us identify if the issue is:
1. A cached module being loaded
2. A parent class overriding the method
3. Some other source of the wrong signature
Once we see the debug output, we can pinpoint the exact root cause.
The auto-routing layer was missing the authorization parameter:
- ToolRuntimeRouter.invoke_tool() now accepts and passes authorization
- ToolRuntimeRouter.list_runtime_tools() now accepts and passes authorization
- ToolGroupsRoutingTable.list_tools() now accepts and forwards authorization
- ToolGroupsRoutingTable._index_tools() now accepts and uses authorization
This fixes the '__autorouted__' provider signature mismatch error in CI.
Updated all ToolRuntime provider implementations to match the protocol signature:
- BraveSearchToolRuntimeImpl
- TavilySearchToolRuntimeImpl
- BingSearchToolRuntimeImpl
- WolframAlphaToolRuntimeImpl
- MemoryToolRuntimeImpl
This fixes the signature mismatch error in CI where protocol had 'authorization' parameter but implementations didn't.
- Add authorization parameter to Tool Runtime API signatures (list_runtime_tools, invoke_tool)
- Update MCP provider implementation to use authorization from request body instead of provider-data
- Deprecate mcp_authorization and mcp_headers from provider-data (MCPProviderDataValidator now empty)
- Update all Tool Runtime tests to pass authorization as request body parameter
- Responses API already uses request body authorization (no changes needed)
This provides a single, consistent way to pass MCP authentication tokens across both APIs, addressing reviewer feedback about avoiding multiple configuration paths.
A few changes to the storage layer to ensure we reduce unnecessary
contention arising out of our design choices (and letting the database
layer do its correct thing):
- SQL stores now share a single `SqlAlchemySqlStoreImpl` per backend,
and `kvstore_impl` caches instances per `(backend, namespace)`. This
avoids spawning multiple SQLite connections for the same file, reducing
lock contention and aligning the cache story for all backends.
- Added an async upsert API (with SQLite/Postgres dialect inserts) and
routed it through `AuthorizedSqlStore`, then switched conversations and
responses to call it. Using native `ON CONFLICT DO UPDATE` eliminates
the insert-then-update retry window that previously caused long WAL lock
retries.
### Test Plan
Existing tests, added a unit test for `upsert()`
Fixes issues in the storage system by guaranteeing immediate durability
for responses and ensuring background writers stay alive. Three related
fixes:
* Responses to the OpenAI-compatible API now write directly to
Postgres/SQLite inside the request instead of detouring through an async
queue that might never drain; this restores the expected
read-after-write behavior and removes the "response not found" races
reported by users.
* The access-control shim was stamping owner_principal/access_attributes
as SQL NULL, which Postgres interprets as non-public rows; fixing it to
use the empty-string/JSON-null pattern means conversations and responses
stored without an authenticated user stay queryable (matching SQLite).
* The inference-store queue remains for batching, but its worker tasks
now start lazily on the live event loop so server startup doesn't cancel
them—writes keep flowing even when the stack is launched via llama stack
run.
Closes#4115
### Test Plan
Added a matrix entry to test our "base" suite against Postgres as the
store.
# What does this PR do?
- Updates `/vector_stores/{vector_store_id}/files/{file_id}/content` to
allow returning `embeddings` and `metadata` using the `extra_query`
- Updates the UI accordingly to display them.
- Update UI to support CRUD operations in the Vector Stores section and
adds a new modal exposing the functionality.
- Updates Vector Store update to fail if a user tries to update Provider
ID (which doesn't make sense to allow)
```python
In [1]: client.vector_stores.files.content(
vector_store_id=vector_store.id,
file_id=file.id,
extra_query={"include_embeddings": True, "include_metadata": True}
)
Out [1]: FileContentResponse(attributes={}, content=[Content(text='This is a test document to check if embeddings are generated properly.\n', type='text', embedding=[0.33760684728622437, ...,], chunk_metadata={'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'source': None, 'created_timestamp': 1762053437, 'updated_timestamp': 1762053437, 'chunk_window': '0-13', 'chunk_tokenizer': 'DEFAULT_TIKTOKEN_TOKENIZER', 'chunk_embedding_model': 'sentence-transformers/nomic
-ai/nomic-embed-text-v1.5', 'chunk_embedding_dimension': 768, 'content_token_count': 13, 'metadata_token_count': 9}, metadata={'filename': 'test-embedding.txt', 'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'token_count': 13, 'metadata_token_count': 9})], file_id='file-27291dbc679642ac94ffac6d2810c339', filename='test-embedding.txt')
```
Screenshots of UI are displayed below:
### List Vector Store with Added "Create New Vector Store"
<img width="1912" height="491" alt="Screenshot 2025-11-06 at 10 47
25 PM"
src="https://github.com/user-attachments/assets/a3a3ddd9-758d-4005-ac9c-5047f03916f3"
/>
### Create New Vector Store
<img width="1918" height="1048" alt="Screenshot 2025-11-06 at 10 47
49 PM"
src="https://github.com/user-attachments/assets/b4dc0d31-696f-4e68-b109-27915090f158"
/>
### Edit Vector Store
<img width="1916" height="1355" alt="Screenshot 2025-11-06 at 10 48
32 PM"
src="https://github.com/user-attachments/assets/ec879c63-4cf7-489f-bb1e-57ccc7931414"
/>
### Vector Store Files Contents page (with Embeddings)
<img width="1914" height="849" alt="Screenshot 2025-11-06 at 11 54
32 PM"
src="https://github.com/user-attachments/assets/3095520d-0e90-41f7-83bd-652f6c3fbf27"
/>
### Vector Store Files Contents Details page (with Embeddings)
<img width="1916" height="1221" alt="Screenshot 2025-11-06 at 11 55
00 PM"
src="https://github.com/user-attachments/assets/e71dbdc5-5b49-472b-a43a-5785f58d196c"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Tests added for Middleware extension and Provider failures.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
the inspect API lacked any mechanism to get all
non-deprecated APIs (v1, v1alpha, v1beta)
change default to this behavior
'v1' filter can be used for user' wanting a list
of stable APIs
## Test Plan
1. pull the PR
2. launch a LLS server
3. run `curl http://beanlab3.bss.redhat.com:8321/v1/inspect/routes`
4. note there are APIs for `v1`, `v1alpha`, and `v1beta` but no
deprecated APIs
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that
was replaced by OpenAI-only API. This includes removing unused type
conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted
types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a
self-contained module.
This is the last interdependency between the .api package and "exterior"
packages, meaning that now every other package in llama stack imports
the API, not the other way around.
## Test Plan
this is a structural change, no tests needed.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# Problem
Responses API uses max_tool_calls parameter to limit the number of tool
calls that can be generated in a response. Currently, LLS implementation
of the Responses API does not support this parameter.
# What does this PR do?
This pull request adds the max_tool_calls field to the response object
definition and updates the inline provider. it also ensures that:
- the total number of calls to built-in and mcp tools do not exceed
max_tool_calls
- an error is thrown if max_tool_calls < 1 (behavior seen with the
OpenAI Responses API, but we can change this if needed)
Closes #[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
- Tested manually for change in model response w.r.t supplied
max_tool_calls field.
- Added integration tests to test invalid max_tool_calls parameter.
- Added integration tests to check max_tool_calls parameter with
built-in and function tools.
- Added integration tests to check max_tool_calls parameter in the
returned response object.
- Recorded OpenAI Responses API behavior using a sample script:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/max_tool_calls.py
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds OCI GenAI PaaS models for openai chat completion endpoints.
## Test Plan
In an OCI tenancy with access to GenAI PaaS, perform the following
steps:
1. Ensure you have IAM policies in place to use service (check docs
included in this PR)
2. For local development, [setup OCI
cli](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
and configure the CLI with your region, tenancy, and auth
[here](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm)
3. Once configured, go through llama-stack setup and run llama-stack
(uses config based auth) like:
```bash
OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci
```
4. Hit the `models` endpoint to list models after server is running:
```bash
curl http://localhost:8321/v1/models | jq
...
{
"identifier": "meta.llama-4-scout-17b-16e-instruct",
"provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
"provider_id": "oci",
"type": "model",
"metadata": {
"display_name": "meta.llama-4-scout-17b-16e-instruct",
"capabilities": [
"CHAT"
],
"oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
},
"model_type": "llm"
},
...
```
5. Use the "display_name" field to use the model in a
`/chat/completions` request:
```bash
# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": true,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": false,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
```
6. Try out other models from the `/models` endpoint.
Mark all register_* / unregister_* APIs as deprecated across models,
shields, tool groups, datasets, benchmarks, and scoring functions. This
is the first step toward moving resource mutations to an `/admin`
namespace as outlined in
https://github.com/llamastack/llama-stack/issues/3809#issuecomment-3492931585.
The deprecation flag will be reflected in the OpenAPI schema to warn API
users that these endpoints are being phased out. Next step will be
implementing the `/admin` route namespace for these resource management
operations.
- `register_model` / `unregister_model`
- `register_shield` / `unregister_shield`
- `register_tool_group` / `unregister_toolgroup`
- `register_dataset` / `unregister_dataset`
- `register_benchmark` / `unregister_benchmark`
- `register_scoring_function` / `unregister_scoring_function`
Added Field(exclude=True) to mcp_authorization field to ensure tokens
are NEVER exposed in:
- API responses (model_dump())
- JSON serialization (model_dump_json())
- Logs
- Any Pydantic serialization
This prevents accidental token leakage through:
- Error messages
- Debug logs
- API response payloads
- Monitoring/telemetry systems
The field is still accessible within the application code but will be
automatically excluded from all Pydantic serialization operations.
# What does this PR do?
This PR fixes a bug in LlamaStack 0.3.0 where vector stores created via
the OpenAI-compatible API (`POST /v1/vector_stores`) would fail with
`VectorStoreNotFoundError` after server restart when attempting
operations like `vector_io.insert()` or `vector_io.query()`.
The bug affected **6 vector IO providers**: `pgvector`, `sqlite_vec`,
`chroma`, `milvus`, `qdrant`, and `weaviate`.
Created with the assistance of: claude-4.5-sonnet
## Root Cause
All affected providers had a broken
`_get_and_cache_vector_store_index()` method that:
1. Did not load existing vector stores from persistent storage during
initialization
2. Attempted to use `vector_store_table` (which was either `None` or a
`KVStore` without the required `get_vector_store()` method)
3. Could not reload vector stores after server restart or cache miss
## Solution
This PR implements a consistent pattern across all 6 providers:
1. **Load vector stores during initialization** - Pre-populate the cache
from KV store on startup
2. **Fix lazy loading** - Modified `_get_and_cache_vector_store_index()`
to load directly from KV store instead of relying on
`vector_store_table`
3. **Remove broken dependency** - Eliminated reliance on the
`vector_store_table` pattern
## Testing steps
### 1.1 Configure the stack
Create or use an existing configuration with a vector IO provider.
**Example `run.yaml`:**
```yaml
vector_io_store:
- provider_id: pgvector
provider_type: remote::pgvector
config:
host: localhost
port: 5432
db: llamastack
user: llamastack
password: llamastack
inference:
- provider_id: sentence-transformers
provider_type: inline::sentence-transformers
config:
model: sentence-transformers/all-MiniLM-L6-v2
```
### 1.2 Start the server
```bash
llama stack run run.yaml --port 5000
```
Wait for the server to fully start. You should see:
```
INFO: Started server process
INFO: Application startup complete
```
---
## Step 2: Create a Vector Store
### 2.1 Create via API
```bash
curl -X POST http://localhost:5000/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{
"name": "test-persistence-store",
"extra_body": {
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
"provider_id": "pgvector"
}
}' | jq
```
### 2.2 Expected Response
```json
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"object": "vector_store",
"name": "test-persistence-store",
"status": "completed",
"created_at": 1730304000,
"file_counts": {
"total": 0,
"completed": 0,
"in_progress": 0,
"failed": 0,
"cancelled": 0
},
"usage_bytes": 0
}
```
**Save the `id` field** (e.g.,
`vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d`) — you’ll need it for the next
steps.
---
## Step 3: Insert Data (Before Restart)
### 3.1 Insert chunks into the vector store
```bash
export VS_ID="vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d"
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"Python is a high-level programming language known for its readability.\",
\"metadata\": {\"source\": \"doc1\", \"page\": 1}
},
{
\"content\": \"Machine learning enables computers to learn from data without explicit programming.\",
\"metadata\": {\"source\": \"doc2\", \"page\": 1}
},
{
\"content\": \"Neural networks are inspired by biological neurons in the brain.\",
\"metadata\": {\"source\": \"doc3\", \"page\": 1}
}
]
}"
```
### 3.2 Expected Response
Status: **200 OK**
Response: *Empty or success confirmation*
---
## Step 4: Query Data (Before Restart – Baseline)
### 4.1 Query the vector store
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"What is machine learning?\"
}" | jq
```
### 4.2 Expected Response
```json
{
"chunks": [
{
"content": "Machine learning enables computers to learn from data without explicit programming.",
"metadata": {"source": "doc2", "page": 1}
},
{
"content": "Neural networks are inspired by biological neurons in the brain.",
"metadata": {"source": "doc3", "page": 1}
}
],
"scores": [0.85, 0.72]
}
```
**Checkpoint:** Works correctly before restart.
---
## Step 5: Restart the Server (Critical Test)
### 5.1 Stop the server
In the terminal where it’s running:
```
Ctrl + C
```
Wait for:
```
Shutting down...
```
### 5.2 Restart the server
```bash
llama stack run run.yaml --port 5000
```
Wait for:
```
INFO: Started server process
INFO: Application startup complete
```
The vector store cache is now empty, but data should persist.
---
## Step 6: Verify Vector Store Exists (After Restart)
### 6.1 List vector stores
```bash
curl http://localhost:5000/v1/vector_stores | jq
```
### 6.2 Expected Response
```json
{
"object": "list",
"data": [
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"name": "test-persistence-store",
"status": "completed"
}
]
}
```
**Checkpoint:** Vector store should be listed.
---
## Step 7: Insert Data (After Restart – THE BUG TEST)
### 7.1 Insert new chunks
```bash
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"This chunk was inserted AFTER the server restart.\",
\"metadata\": {\"source\": \"post-restart\", \"test\": true}
}
]
}"
```
### 7.2 Expected Results
**With Fix (Correct):**
```
Status: 200 OK
Response: Success
```
**Without Fix (Bug):**
```json
{
"detail": "VectorStoreNotFoundError: Vector Store 'vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d' not found."
}
```
**Critical Test:** If insertion succeeds, the fix works.
---
## Step 8: Query Data (After Restart – Verification)
### 8.1 Query all data
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"restart\"
}" | jq
```
### 8.2 Expected Response
```json
{
"chunks": [
{
"content": "This chunk was inserted AFTER the server restart.",
"metadata": {"source": "post-restart", "test": true}
}
],
"scores": [0.95]
}
```
**Checkpoint:** Both old and new data are queryable.
---
## Step 9: Multiple Restart Test (Extra Verification)
### 9.1 Restart again
```bash
Ctrl + C
llama stack run run.yaml --port 5000
```
### 9.2 Query after restart
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"programming\"
}" | jq
```
**Expected:** Works correctly across multiple restarts.
---------
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Based on user feedback, improved comments to distinguish between
the two security layers:
1. PRIMARY: Line 89 - Architectural prevention
- get_request_provider_data() only reads from request body
- Never accesses HTTP Authorization header
- This is what actually prevents inference token leakage
2. SECONDARY: Lines 97-104 - Validation prevention
- Rejects Authorization in mcp_headers dict
- Enforces using dedicated mcp_authorization field
- Prevents users from misusing the API
Previous comment was misleading by suggesting the validation
prevented inference token leakage, when the architecture
already ensures that isolation.
Adds inline documentation to help users understand:
- How to structure provider_data in HTTP requests
- Where to place mcp_headers vs mcp_authorization
- Security requirements (no Authorization in headers)
- Token format requirements (without Bearer prefix)
- Example usage with multiple MCP endpoints
Completes the TODO for extracting authorization from a dedicated field.
What changed:
- Added mcp_authorization field to MCPProviderDataValidator
- Updated get_headers_from_request() to extract from mcp_authorization
- Authorization is now properly isolated per MCP endpoint
API usage example:
{
"provider_data": {
"mcp_headers": {
"http://mcp-server.com": {
"X-Trace-ID": "trace-123"
}
},
"mcp_authorization": {
"http://mcp-server.com": "mcp_token_xyz789"
}
}
}
Security guarantees:
- Authorization cannot be in mcp_headers (validation rejects it)
- Each MCP endpoint gets its own dedicated token
- No cross-service token leakage possible
Addresses reviewer concern about token isolation between services.
The remote provider now rejects Authorization headers in mcp_headers
to prevent accidentally passing inference tokens to MCP servers.
This makes the remote provider consistent with the inline provider:
- Both reject Authorization in headers dict
- Both require dedicated authorization parameter
- Prevents token leakage across service boundaries
Related changes:
- Added validation in get_headers_from_request()
- Throws ValueError if Authorization found in mcp_headers
- Added TODO for dedicated authorization field in provider_data
Per reviewer feedback, validation should be in the openai_responses.py handler,
not the streaming.py file. Moved validation logic to create_openai_response()
method which is the main entry point for response creation.
- Added validation in create_openai_response() before processing
- Removed duplicate validation from _process_mcp_tool() in streaming.py
- Validation runs early and rejects malformed requests immediately
- Maintains same security check: rejects Authorization in headers dict
Per reviewer feedback, API models should be pure data structures without
business logic. Moved the Authorization header validation from the Pydantic
@model_validator in openai_responses.py to the handler in streaming.py.
- Removed @model_validator from OpenAIResponseInputToolMCP
- Added validation at handler level in _process_mcp_tool()
- Maintains same security check: rejects Authorization in headers dict
- Follows separation of concerns: models are data, handlers have logic
- Add Field(exclude=True) to authorization parameter to prevent token leakage in responses
- Add model validator to reject Authorization header in headers dict
- Users must use dedicated 'authorization' parameter instead of headers
- Headers field is preserved for legitimate non-auth headers (tracing, routing, etc.)
This implements the security requirement that authorization params are never
returned in responses, unlike generic headers which may be echoed back.
# What does this PR do?
Resolves#4102
1. Added `web_search_2025_08_26` to the `WebSearchToolTypes` list and
the `OpenAIResponseInputToolWebSearch.type` Literal union
2. No changes needed to tool execution logic - all `web_search` types
map to the same underlying tool
3. Backward compatibility is maintained - existing `web_search`,
`web_search_preview`, and `web_search_preview_2025_03_11` types continue
to work
4. Added an integration test case using {"type":
"web_search_2025_08_26"} to verify it works correctly
5. Updated `docs/docs/providers/openai_responses_limitations.mdx` to
reflect that `web_search_2025_08_26` is now supported.
6. Removed incorrect references to `MOD1/MOD2/MOD3` (which don't exist
in the codebase)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
# What does this PR do?
Remove circular dependency by moving tracing from API protocol
definitions
to router implementation layer.
This gets us closer to having a self contained API package with no other
cross-cutting dependencies to other parts of the llama stack codebase.
To the best of our ability, the llama_stack.api should only be type and
protocol definitions.
Changes:
- Create apis/common/tracing.py with marker decorator (zero core
dependencies)
- Add the _new_ `@telemetry_traceable` marker decorator to 11 protocol
classes
- Apply actual tracing in core/resolver.py in `instantiate_provider`
based on protocol marker
- Move MetricResponseMixin from core to apis (it's an API response type)
- APIs package is now self-contained with zero core dependencies
The tracing functionality remains identical - actual trace_protocol from
core
is applied to router implementations at runtime when both telemetry is
enabled
and the protocol has the `__marked_for_tracing__` marker.
## Test Plan
Manual integration test confirms identical behavior to main branch:
```bash
llama stack list-deps --format uv starter | sh
export OLLAMA_URL=http://localhost:11434
llama stack run starter
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/gpt-oss:20b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 10}'
```
Verified identical between main and this branch:
- trace_id present in response
- metrics array with prompt_tokens, completion_tokens, total_tokens
- Server logs show trace_protocol applied to all routers
Existing telemetry integration tests (tests/integration/telemetry/) validate
trace context propagation and span attributes.
relates to #3895
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
We'd like to remove the dependence of `llama-stack` on
`llama-stack-client`. This is a necessary step.
A few small cleanups
- Enables `embeddings` now also
- Remove ModelRegistryHelper dependency (unused)
- Consolidate to auth_credential field via RemoteInferenceProviderConfig
- Implement list_models() to fetch from downstream /v1/models
## Test Plan
Tested using this script
https://gist.github.com/ashwinb/6356463d10f989c0682ab3bff8589581
Output:
```
Listing models from downstream server...
Available models: ['passthrough/ollama/nomic-embed-text:latest', 'passthrough/ollama/all-minilm:l6-v2', 'passthrough/ollama/llama3.2-vision:11b', 'passthrough/ollama/llama3.2-vision:latest', 'passthrough/ollama/llama-guard3:1b', 'passthrough/o
llama/llama3.2:1b', 'passthrough/ollama/all-minilm:latest', 'passthrough/ollama/llama3.2:3b', 'passthrough/ollama/llama3.2:3b-instruct-fp16', 'passthrough/bedrock/meta.llama3-1-8b-instruct-v1:0', 'passthrough/bedrock/meta.llama3-1-70b-instruct
-v1:0', 'passthrough/bedrock/meta.llama3-1-405b-instruct-v1:0', 'passthrough/sentence-transformers/nomic-ai/nomic-embed-text-v1.5']
Using LLM model: passthrough/ollama/llama3.2-vision:11b
Making inference request...
Response: 4.
--- Testing streaming ---
Streamed response: ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='1', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='', index=0, logprobs=None)], created=1762381674, m
odel='passthrough/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
...
5ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1762381674, model='passthrou
gh/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
```
# What does this PR do?
- when create vector store is called without chunk strategy, we actually
the strategy used so that the value is persisted instead of
strategy='None'
## Test Plan
updated tests
## What does this PR do?
The starter distribution now comes with all the required packages to
support persistent stores—like the agent store, metadata, and
inference—using PostgreSQL. Users can enable PostgreSQL support by
setting the `ENABLE_POSTGRES_STORE=1` environment variable.
This PR consolidates the functionality from the removed `postgres-demo`
distribution into the starter distribution, reducing maintenance
overhead.
**Closes: #2619**
**Supersedes: #2851** (rebased and updated)
## Changes Made
1. **Added PostgreSQL support to starter distribution**
- New `run-with-postgres-store.yaml` configuration
- Automatic config switching via `ENABLE_POSTGRES_STORE` environment
variable
- Removed separate `postgres-demo` distribution
2. **Updated to new build system**
- Integrated postgres switching logic into Containerfile entrypoint
- Uses new `storage_backends` and `storage_stores` API
- Properly configured both PostgreSQL KV store and SQL store
3. **Updated dependencies**
- Added `psycopg2-binary` and `asyncpg` to starter distribution
- All postgres-related dependencies automatically included
## How to Use
### With Docker (PostgreSQL):
```bash
docker run \
-e ENABLE_POSTGRES_STORE=1 \
-e POSTGRES_HOST=your_postgres_host \
-e POSTGRES_PORT=5432 \
-e POSTGRES_DB=llamastack \
-e POSTGRES_USER=llamastack \
-e POSTGRES_PASSWORD=llamastack \
-e OPENAI_API_KEY=your_key \
llamastack/distribution-starter
```
### PostgreSQL environment variables:
- `POSTGRES_HOST`: Postgres host (default: `localhost`)
- `POSTGRES_PORT`: Postgres port (default: `5432`)
- `POSTGRES_DB`: Postgres database name (default: `llamastack`)
- `POSTGRES_USER`: Postgres username (default: `llamastack`)
- `POSTGRES_PASSWORD`: Postgres password (default: `llamastack`)
## Test Plan
All pre-commit hooks pass (mypy, ruff, distro-codegen)
`llama stack list-deps starter` confirms psycopg2-binary is included
Storage configuration correctly uses PostgreSQL backends
Container builds successfully with postgres support
## Credits
Original work by @leseb in #2851. Rebased and updated by @r-bit-rry to
work with latest main.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Sébastien Han @leseb
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>