# What does this PR do?
When we send the model names to Google's openai API, we must use the
"google" name prefix. Google does not recognize the "vertexai" model
names.
Closes#4211
## Test Plan
```bash
uv venv --python python312
. .venv/bin/activate
llama stack list-deps starter | xargs -L1 uv pip install
llama stack run starter
```
Test that this shows the gemini models with their correct names:
```bash
curl http://127.0.0.1:8321/v1/models | jq '.data | map(select(.custom_metadata.provider_id == "vertexai"))'
```
Test that this chat completion works:
```bash
curl -X POST -H "Content-Type: application/json" "http://127.0.0.1:8321/v1/chat/completions" -d '{
"model": "vertexai/google/gemini-2.5-flash",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! Can you tell me a joke?"
}
],
"temperature": 1.0,
"max_tokens": 256
}'
```
Rename `AWS_BEDROCK_API_KEY` to `AWS_BEARER_TOKEN_BEDROCK` to align with
the naming convention used in AWS Bedrock documentation and the AWS web
console UI. This reduces confusion when developers compare LLS docs with
AWS docs.
Closes#4147
# What does this PR do?
Completes #3732 by removing runtime URL transformations and requiring
users to provide full URLs in configuration. All providers now use
'base_url' consistently and respect the exact URL provided without
appending paths like /v1 or /openai/v1 at runtime.
BREAKING CHANGE: Users must update configs to include full URL paths
(e.g., http://localhost:11434/v1 instead of http://localhost:11434).
Closes#3732
## Test Plan
Existing tests should pass even with the URL changes, due to default
URLs being altered.
Add unit test to enforce URL standardization across remote inference
providers (verifies all use 'base_url' field with HttpUrl | None type)
Signed-off-by: Charlie Doern <cdoern@redhat.com>
These primitives (used both by the Stack as well as provider
implementations) can be thought of fruitfully as internal-only APIs
which can themselves have multiple implementations. We use the new
`llama_stack_api.internal` namespace for this.
In addition: the change moves kv/sql store impls, configs, and
dependency helpers under `core/storage`
## Testing
`pytest tests/unit/utils/test_authorized_sqlstore.py`, other existing CI
# What does this PR do?
- Remove backward compatibility for authorization in mcp_headers
- Enforce authorization must use dedicated parameter
- Add validation error if Authorization found in provider_data headers
- Update test_mcp.py to use authorization parameter
- Update test_mcp_json_schema.py to use authorization parameter
- Update test_tools_with_schemas.py to use authorization parameter
- Update documentation to show the change in the authorization approach
Breaking Change:
- Authorization can no longer be passed via mcp_headers in provider_data
- Users must use the dedicated 'authorization' parameter instead
- Clear error message guides users to the new approach"
## Test Plan
CI
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adding a user-facing `authorization ` parameter to MCP tool definitions
that allows users to explicitly configure credentials per MCP server,
addressing GitHub Issue #4034 in a secure manner.
## Test Plan
tests/integration/responses/test_mcp_authentication.py
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
the directory structure was src/llama-stack-api/llama_stack_api
instead it should just be src/llama_stack_api to match the other
packages.
update the structure and pyproject/linting config
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Extract API definitions and provider specifications into a standalone
llama-stack-api package that can be published to PyPI independently of
the main llama-stack server.
see: https://github.com/llamastack/llama-stack/pull/2978 and
https://github.com/llamastack/llama-stack/pull/2978#issuecomment-3145115942
Motivation
External providers currently import from llama-stack, which overrides
the installed version and causes dependency conflicts. This separation
allows external providers to:
- Install only the type definitions they need without server
dependencies
- Avoid version conflicts with the installed llama-stack package
- Be versioned and released independently
This enables us to re-enable external provider module tests that were
previously blocked by these import conflicts.
Changes
- Created llama-stack-api package with minimal dependencies (pydantic,
jsonschema)
- Moved APIs, providers datatypes, strong_typing, and schema_utils
- Updated all imports from llama_stack.* to llama_stack_api.*
- Configured local editable install for development workflow
- Updated linting and type-checking configuration for both packages
Next Steps
- Publish llama-stack-api to PyPI
- Update external provider dependencies
- Re-enable external provider module tests
Pre-cursor PRs to this one:
- #4093
- #3954
- #4064
These PRs moved key pieces _out_ of the Api pkg, limiting the scope of
change here.
relates to #3237
## Test Plan
Package builds successfully and can be imported independently. All
pre-commit hooks pass with expected exclusions maintained.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Adds OCI GenAI PaaS models for openai chat completion endpoints.
## Test Plan
In an OCI tenancy with access to GenAI PaaS, perform the following
steps:
1. Ensure you have IAM policies in place to use service (check docs
included in this PR)
2. For local development, [setup OCI
cli](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
and configure the CLI with your region, tenancy, and auth
[here](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm)
3. Once configured, go through llama-stack setup and run llama-stack
(uses config based auth) like:
```bash
OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci
```
4. Hit the `models` endpoint to list models after server is running:
```bash
curl http://localhost:8321/v1/models | jq
...
{
"identifier": "meta.llama-4-scout-17b-16e-instruct",
"provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
"provider_id": "oci",
"type": "model",
"metadata": {
"display_name": "meta.llama-4-scout-17b-16e-instruct",
"capabilities": [
"CHAT"
],
"oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
},
"model_type": "llm"
},
...
```
5. Use the "display_name" field to use the model in a
`/chat/completions` request:
```bash
# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": true,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": false,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
```
6. Try out other models from the `/models` endpoint.
# What does this PR do?
This PR fixes a bug in LlamaStack 0.3.0 where vector stores created via
the OpenAI-compatible API (`POST /v1/vector_stores`) would fail with
`VectorStoreNotFoundError` after server restart when attempting
operations like `vector_io.insert()` or `vector_io.query()`.
The bug affected **6 vector IO providers**: `pgvector`, `sqlite_vec`,
`chroma`, `milvus`, `qdrant`, and `weaviate`.
Created with the assistance of: claude-4.5-sonnet
## Root Cause
All affected providers had a broken
`_get_and_cache_vector_store_index()` method that:
1. Did not load existing vector stores from persistent storage during
initialization
2. Attempted to use `vector_store_table` (which was either `None` or a
`KVStore` without the required `get_vector_store()` method)
3. Could not reload vector stores after server restart or cache miss
## Solution
This PR implements a consistent pattern across all 6 providers:
1. **Load vector stores during initialization** - Pre-populate the cache
from KV store on startup
2. **Fix lazy loading** - Modified `_get_and_cache_vector_store_index()`
to load directly from KV store instead of relying on
`vector_store_table`
3. **Remove broken dependency** - Eliminated reliance on the
`vector_store_table` pattern
## Testing steps
### 1.1 Configure the stack
Create or use an existing configuration with a vector IO provider.
**Example `run.yaml`:**
```yaml
vector_io_store:
- provider_id: pgvector
provider_type: remote::pgvector
config:
host: localhost
port: 5432
db: llamastack
user: llamastack
password: llamastack
inference:
- provider_id: sentence-transformers
provider_type: inline::sentence-transformers
config:
model: sentence-transformers/all-MiniLM-L6-v2
```
### 1.2 Start the server
```bash
llama stack run run.yaml --port 5000
```
Wait for the server to fully start. You should see:
```
INFO: Started server process
INFO: Application startup complete
```
---
## Step 2: Create a Vector Store
### 2.1 Create via API
```bash
curl -X POST http://localhost:5000/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{
"name": "test-persistence-store",
"extra_body": {
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
"provider_id": "pgvector"
}
}' | jq
```
### 2.2 Expected Response
```json
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"object": "vector_store",
"name": "test-persistence-store",
"status": "completed",
"created_at": 1730304000,
"file_counts": {
"total": 0,
"completed": 0,
"in_progress": 0,
"failed": 0,
"cancelled": 0
},
"usage_bytes": 0
}
```
**Save the `id` field** (e.g.,
`vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d`) — you’ll need it for the next
steps.
---
## Step 3: Insert Data (Before Restart)
### 3.1 Insert chunks into the vector store
```bash
export VS_ID="vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d"
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"Python is a high-level programming language known for its readability.\",
\"metadata\": {\"source\": \"doc1\", \"page\": 1}
},
{
\"content\": \"Machine learning enables computers to learn from data without explicit programming.\",
\"metadata\": {\"source\": \"doc2\", \"page\": 1}
},
{
\"content\": \"Neural networks are inspired by biological neurons in the brain.\",
\"metadata\": {\"source\": \"doc3\", \"page\": 1}
}
]
}"
```
### 3.2 Expected Response
Status: **200 OK**
Response: *Empty or success confirmation*
---
## Step 4: Query Data (Before Restart – Baseline)
### 4.1 Query the vector store
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"What is machine learning?\"
}" | jq
```
### 4.2 Expected Response
```json
{
"chunks": [
{
"content": "Machine learning enables computers to learn from data without explicit programming.",
"metadata": {"source": "doc2", "page": 1}
},
{
"content": "Neural networks are inspired by biological neurons in the brain.",
"metadata": {"source": "doc3", "page": 1}
}
],
"scores": [0.85, 0.72]
}
```
**Checkpoint:** Works correctly before restart.
---
## Step 5: Restart the Server (Critical Test)
### 5.1 Stop the server
In the terminal where it’s running:
```
Ctrl + C
```
Wait for:
```
Shutting down...
```
### 5.2 Restart the server
```bash
llama stack run run.yaml --port 5000
```
Wait for:
```
INFO: Started server process
INFO: Application startup complete
```
The vector store cache is now empty, but data should persist.
---
## Step 6: Verify Vector Store Exists (After Restart)
### 6.1 List vector stores
```bash
curl http://localhost:5000/v1/vector_stores | jq
```
### 6.2 Expected Response
```json
{
"object": "list",
"data": [
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"name": "test-persistence-store",
"status": "completed"
}
]
}
```
**Checkpoint:** Vector store should be listed.
---
## Step 7: Insert Data (After Restart – THE BUG TEST)
### 7.1 Insert new chunks
```bash
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"This chunk was inserted AFTER the server restart.\",
\"metadata\": {\"source\": \"post-restart\", \"test\": true}
}
]
}"
```
### 7.2 Expected Results
**With Fix (Correct):**
```
Status: 200 OK
Response: Success
```
**Without Fix (Bug):**
```json
{
"detail": "VectorStoreNotFoundError: Vector Store 'vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d' not found."
}
```
**Critical Test:** If insertion succeeds, the fix works.
---
## Step 8: Query Data (After Restart – Verification)
### 8.1 Query all data
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"restart\"
}" | jq
```
### 8.2 Expected Response
```json
{
"chunks": [
{
"content": "This chunk was inserted AFTER the server restart.",
"metadata": {"source": "post-restart", "test": true}
}
],
"scores": [0.95]
}
```
**Checkpoint:** Both old and new data are queryable.
---
## Step 9: Multiple Restart Test (Extra Verification)
### 9.1 Restart again
```bash
Ctrl + C
llama stack run run.yaml --port 5000
```
### 9.2 Query after restart
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"programming\"
}" | jq
```
**Expected:** Works correctly across multiple restarts.
---------
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
We'd like to remove the dependence of `llama-stack` on
`llama-stack-client`. This is a necessary step.
A few small cleanups
- Enables `embeddings` now also
- Remove ModelRegistryHelper dependency (unused)
- Consolidate to auth_credential field via RemoteInferenceProviderConfig
- Implement list_models() to fetch from downstream /v1/models
## Test Plan
Tested using this script
https://gist.github.com/ashwinb/6356463d10f989c0682ab3bff8589581
Output:
```
Listing models from downstream server...
Available models: ['passthrough/ollama/nomic-embed-text:latest', 'passthrough/ollama/all-minilm:l6-v2', 'passthrough/ollama/llama3.2-vision:11b', 'passthrough/ollama/llama3.2-vision:latest', 'passthrough/ollama/llama-guard3:1b', 'passthrough/o
llama/llama3.2:1b', 'passthrough/ollama/all-minilm:latest', 'passthrough/ollama/llama3.2:3b', 'passthrough/ollama/llama3.2:3b-instruct-fp16', 'passthrough/bedrock/meta.llama3-1-8b-instruct-v1:0', 'passthrough/bedrock/meta.llama3-1-70b-instruct
-v1:0', 'passthrough/bedrock/meta.llama3-1-405b-instruct-v1:0', 'passthrough/sentence-transformers/nomic-ai/nomic-embed-text-v1.5']
Using LLM model: passthrough/ollama/llama3.2-vision:11b
Making inference request...
Response: 4.
--- Testing streaming ---
Streamed response: ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='1', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='', index=0, logprobs=None)], created=1762381674, m
odel='passthrough/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
...
5ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1762381674, model='passthrou
gh/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
```
# What does this PR do?
It avoids model_limit KeyError while trying to get embedding models for
Watsonx
<!-- If resolving an issue, uncomment and update the line below -->
Closes https://github.com/llamastack/llama-stack/issues/4059
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Start server with watsonx distro:
```bash
llama stack list-deps watsonx | xargs -L1 uv pip install
uv run llama stack run watsonx
```
Run
```python
client = LlamaStackClient(base_url=base_url)
client.models.list()
```
Check if there is any embedding model available (currently there is not
a single one)
# What does this PR do?
Add rerank API for NVIDIA Inference Provider.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3278
## Test Plan
Unit test:
```
pytest tests/unit/providers/nvidia/test_rerank_inference.py
```
Integration test:
```
pytest -s -v tests/integration/inference/test_rerank.py --stack-config="inference=nvidia" --rerank-model=nvidia/nvidia/nv-rerankqa-mistral-4b-v3 --env NVIDIA_API_KEY="" --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
## Summary
When users provide API keys via `X-LlamaStack-Provider-Data` header,
`models.list()` now returns models they can access from those providers,
not just pre-registered models from the registry.
This complements the routing fix from f88416ef8 which enabled inference
calls with `provider_id/model_id` format for unregistered models. Users
can now discover which models are available to them before making
inference requests.
The implementation reuses
`NeedsRequestProviderData.get_request_provider_data()` to validate
credentials, then dynamically fetches models from providers without
caching them since they're user-specific. Registry models take
precedence to respect any pre-configured aliases.
## Test Script
```python
#!/usr/bin/env python3
import json
import os
from openai import OpenAI
# Test 1: Without provider_data header
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="dummy")
models = client.models.list()
anthropic_without = [m.id for m in models.data if m.id and "anthropic" in m.id]
print(f"Without header: {len(models.data)} models, {len(anthropic_without)} anthropic")
# Test 2: With provider_data header containing Anthropic API key
anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]
client_with_key = OpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="dummy",
default_headers={
"X-LlamaStack-Provider-Data": json.dumps({"anthropic_api_key": anthropic_api_key})
}
)
models_with_key = client_with_key.models.list()
anthropic_with = [m.id for m in models_with_key.data if m.id and "anthropic" in m.id]
print(f"With header: {len(models_with_key.data)} models, {len(anthropic_with)} anthropic")
print(f"Anthropic models: {anthropic_with}")
assert len(anthropic_with) > len(anthropic_without), "Should have more anthropic models with API key"
print("\n✓ Test passed!")
```
Run with a stack that has Anthropic provider configured (but without API
key in config):
```bash
ANTHROPIC_API_KEY=sk-ant-... python test_provider_data_models.py
```
# What does this PR do?
- Adds OpenAI files provider
- Note that file content retrieval is pretty limited by `purpose`
https://community.openai.com/t/file-uploads-error-why-can-t-i-download-files-with-purpose-user-data/1357013?utm_source=chatgpt.com
## Test Plan
Modify run yaml to use openai files provider:
```
files:
- provider_id: openai
provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY:=}
metadata_store:
backend: sql_default
table_name: openai_files_metadata
# Then run files tests
❯ uv run --no-sync ./scripts/integration-tests.sh --stack-config server:ci-tests --inference-mode replay --setup ollama --suite base --pattern test_files
```
Adds type stubs and fixes mypy errors for better type coverage.
Changes:
- Added type_checking dependency group with type stubs (torchtune, trl,
etc.)
- Added lm-format-enforcer to pre-commit hook
- Created HFAutoModel Protocol for type-safe HuggingFace model handling
- Added mypy.overrides for untyped libraries (torchtune, fairscale,
etc.)
- Fixed type issues in post-training providers, databricks, and
api_recorder
Note: ~1,200 errors remain in excluded files (see pyproject.toml exclude
list).
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- Fix OpenAI SDK NotGiven/Omit type mismatches in embeddings calls
- Fix incorrect OpenAIChatCompletionChunk import in vllm provider
- Refactor to avoid type:ignore comments by using conditional kwargs
## Changes
**openai_mixin.py (9 errors fixed):**
- Build kwargs conditionally for embeddings.create() to avoid
NotGiven/Omit mismatch
- Only include parameters when they have actual values (not None)
**gemini.py (9 errors fixed):**
- Apply same conditional kwargs pattern
- Add missing Any import
**vllm.py (2 errors fixed):**
- Use correct OpenAIChatCompletionChunk from llama_stack.apis.inference
- Remove incorrect alias from openai package
## Technical Notes
The OpenAI SDK has a type system quirk where `NOT_GIVEN` has type
`NotGiven` but parameter signatures expect `Omit`. By only passing
parameters with actual values, we avoid this mismatch entirely without
needing `# type: ignore` comments.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Fixes mypy type errors in provider utilities and testing infrastructure:
- `mcp.py`: Cast incompatible client types, wrap image data properly
- `batches.py`: Rename walrus variable to avoid shadowing
- `api_recorder.py`: Use cast for Pydantic field annotation
No functional changes.
---------
Co-authored-by: Claude <noreply@anthropic.com>
# What does this PR do?
add provider-data key passing support to Cerebras, Databricks, NVIDIA
and RunPod
also, added missing tests for Fireworks, Anthropic, Gemini, SambaNova,
and vLLM
addresses #3517
## Test Plan
ci w/ new tests
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Migrates package structure to src/ layout following Python packaging
best practices.
All code moved from `llama_stack/` to `src/llama_stack/`. Public API
unchanged - imports remain `import llama_stack.*`.
Updated build configs, pre-commit hooks, scripts, and GitHub workflows
accordingly. All hooks pass, package builds cleanly.
**Developer note**: Reinstall after pulling: `pip install -e .`