# What does this PR do?
- Enables users to configure prompts used throughout the File Search /
Vector Retrieval
- Configuration is defined in the Vector Stores Config so they can be
modified at runtime
- Backwards compatible, which means the fields are optional and default
to the previously used values
This is the summary of the new options in the `run.yaml`
```yaml
vector_stores:
file_search_params:
header_template: 'knowledge_search tool found {num_chunks} chunks:\nBEGIN of knowledge_search tool results.\n'
footer_template: 'END of knowledge_search tool results.\n'
context_prompt_params:
chunk_annotation_template: 'Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n'
context_template: 'The above results were retrieved to help answer the user\'s query: "{query}". Use them as supporting information only in answering this query.{annotation_instruction}\n'
annotation_prompt_params:
enable_annotations: true
annotation_instruction_template: 'Cite sources immediately at the end of sentences before punctuation, using `<|file-id|>` format like \'This is a fact <|file-Cn3MSNn72ENTiiq11Qda4A|>.\'. Do not add
extra punctuation. Use only the file IDs provided, do not invent new ones.'
chunk_annotation_template: '[{index}] {metadata_text} cite as <|{file_id}|>\n{chunk_text}\n'
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Added tests.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Fix provider header API key handling by correctly unwrapping `SecretStr`
values for provider data API keys. Previously the validator cast header
keys to `SecretStr` but the value wasn’t unwrapped before use, causing
authentication failures with providers like Azure.
Closes https://github.com/llamastack/llama-stack/issues/4370
This PR fixes issue #3185
The code calls `await event_gen.aclose()` but OpenAI's `AsyncStream`
doesn't have an `aclose()` method - it has `close()` (which is async).
when clients cancel streaming requests, the server tries to clean up
with:
```python
await event_gen.aclose() # ❌ AsyncStream doesn't have aclose()!
```
But `AsyncStream` has never had a public `aclose()` method. The error
message literally tells us:
```
AttributeError: 'AsyncStream' object has no attribute 'aclose'. Did you mean: 'close'?
^^^^^^^^
```
## Verification
* Reproduction script
[`reproduce_issue_3185.sh`](https://gist.github.com/r-bit-rry/dea4f8fbb81c446f5db50ea7abd6379b)
can be used to verify the fix.
* Manual checks, validation against original OpenAI library code
# What does this PR do?
The PR validates and allow access to OCI object-storage through the S3
compatibility API. Additional documentation for OCI is supplied, in
notebook form, as well.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Co-authored-by: raghotham <rsm@meta.com>
# Problem
As an Application Developer, I want to use the include parameter with
the value message.output_text.logprobs, so that I can receive log
probabilities for output tokens to assess the model's confidence in its
response.
# What does this PR do?
- Updates the include parameter in various resource definitions
- Updates the inline provider to return logprobs when
"message.output_text.logprobs" is passed in the include parameter
- Converts the logprobs returned by the inference provider from chat
completion format to responses format
Closes #[4260](https://github.com/llamastack/llama-stack/issues/4260)
## Test Plan
- Created a script to explore OpenAI behavior:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/include.py
- Added integration tests and new recordings
---------
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds a new API for connectors and MCP registry support along with
required types.
Does not include any implementation for it
<!-- If resolving an issue, uncomment and update the line below -->
Closes#4235 and #4061 (partially)
## Test Plan
no tests included
---------
Signed-off-by: Jaideep Rao <jrao@redhat.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
# What does this PR do?
The InferenceStore class was ignoring the table_name field from
InferenceStoreReference and always using the hardcoded value
"chat_completions". This meant that any custom table_name configured in
the run config (e.g., "inference_store" in run-with-postgres-store.yaml)
was silently ignored.
This change updates all SQL operations in InferenceStore to use
self.reference.table_name instead of the hardcoded string, ensuring the
configured table name is properly respected.
A new test has been added to verify that custom table names work
correctly for storing, retrieving, and listing chat completions.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Actualize query rewrite in search API, add
`default_query_expansion_model` and `query_expansion_prompt` in
`VectorStoresConfig`.
Makes `rewrite_query` parameter functional in vector store search.
- `rewrite_query=false` (default): Use original query
- `rewrite_query=true`: Expand query via LLM, or fail gracefully if no
LLM available
Adds 4 parameters to`VectorStoresConfig`:
- `default_query_expansion_model`: LLM model for query expansion
(optional)
- `query_expansion_prompt`: Custom prompt template (optional, uses
built-in default)
- `query_expansion_max_tokens`: Configurable token limit (default: 100)
- `query_expansion_temperature`: Configurable temperature (default: 0.3)
Enabled `run.yaml`:
```yaml
vector_stores:
rewrite_query_params:
model:
provider_id: "ollama"
model_id: "llama3.2:3b-instruct-fp16"
# prompt defaults to built-in
# max_tokens defaults to 100
# temperature defaults to 0.3
```
Fully customized `run.yaml`:
```yaml
vector_stores:
default_provider_id: faiss
default_embedding_model:
provider_id: sentence-transformers
model_id: nomic-ai/nomic-embed-text-v1.5
rewrite_query_params:
model:
provider_id: ollama
model_id: llama3.2:3b-instruct-fp16
prompt: "Rewrite this search query to improve retrieval results by expanding it with relevant synonyms and related terms: {query}"
max_tokens: 100
temperature: 0.3
```
## Test Plan
Added test and recording
Example script as well:
```python
import asyncio
from llama_stack_client import LlamaStackClient
from io import BytesIO
def gen_file(client, text: str=""):
file_buffer = BytesIO(text.encode('utf-8'))
file_buffer.name = "my_file.txt"
uploaded_file = client.files.create(
file=file_buffer,
purpose="assistants"
)
return uploaded_file
async def test_query_rewriting():
client = LlamaStackClient(base_url="http://0.0.0.0:8321/")
uploaded_file = gen_file(client, "banana banana apple")
uploaded_file2 = gen_file(client, "orange orange kiwi")
vs = client.vector_stores.create()
xf_vs = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
xf_vs1 = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file2.id)
response1 = client.vector_stores.search(
vector_store_id=vs.id,
query="apple",
max_num_results=3,
rewrite_query=False
)
response2 = client.vector_stores.search(
vector_store_id=vs.id,
query="kiwi",
max_num_results=3,
rewrite_query=True,
)
print(f"\n🔵 Response 1 (rewrite_query=False):\n\033[94m{response1}\033[0m")
print(f"\n🟢 Response 2 (rewrite_query=True):\n\033[92m{response2}\033[0m")
for f in [uploaded_file.id, uploaded_file2.id]:
client.files.delete(file_id=f)
client.vector_stores.delete(vector_store_id=vs.id)
if __name__ == "__main__":
asyncio.run(test_query_rewriting())
```
And see the screen shot of the server logs showing it worked.
<img width="1111" height="826" alt="Screenshot 2025-11-19 at 1 16 03 PM"
src="https://github.com/user-attachments/assets/2d188b44-1fef-4df5-b465-2d6728ca49ce"
/>
Notice the log:
```bash
Query rewritten:
'kiwi' → 'kiwi, a small brown or green fruit native to New Zealand, or a person having a fuzzy brown outer skin similar in appearance.'
```
So `kiwi` was expanded.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Convert the Benchmarks API from @webmethod decorators to FastAPI router
pattern, matching the Batches API structure.
One notable change is the update of stack.py to handle request models in
register_resources().
Closes: #4308
## Test Plan
CI and `curl http://localhost:8321/v1/inspect/routes | jq '.data[] |
select(.route | contains("benchmark"))'`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the build.yaml is only used in the following ways:
1. list-deps
2. distribution code-gen
since `llama stack build` no longer exists, I found myself asking "why
do we need two different files for list-deps and run"?
Removing the BuildConfig and altering the usage of the
DistributionTemplate in llama stack list-deps is the first step in
removing the build yaml entirely.
Removing the BuildConfig and build.yaml cuts the files users need to
maintain in half, and allows us to focus on the stability of _just_ the
run.yaml
This PR removes the build.yaml, BuildConfig datatype, and its usage
throughout the codebase. Users are now expected to point to run.yaml
files when running list-deps, and our codebase automatically uses these
types now for things like `get_provider_registry`.
**Additionally, two renames: `StackRunConfig` -> `StackConfig` and
`run.yaml` -> `config.yaml`.**
The build.yaml made sense for when we were managing the build process
for the user and actually _producing_ a run.yaml _from_ the build.yaml,
but now that we are simply just getting the provider registry and
listing the deps, switching to config.yaml simplifies the scope here
greatly.
## Test Plan
existing list-deps usage should work in the tests.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add "token" to sensitive field patterns in redact_sensitive_fields() to
prevent JWT tokens from being logged in plaintext. Previously only
api_key, api_token, password, and secret were filtered.
This prevents tokens like server.auth.provider_config.jwks.token from
being exposed in server logs.
Closes: #4324
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
previously the runpod provider would fail if the
RUNPOD_API_TOKEN was not set
modify the impl to default to an empty string to
align with similar providers' behavior
Closes#4296
## Test Plan
Run `uv run llama stack run --providers inference=remote::runpod` with
`RUNPOD_API_TOKEN` unset - server now boots where it previously crashed
```
INFO 2025-12-04 13:52:59,920 uvicorn.error:84 uncategorized: Started server process [233656]
INFO 2025-12-04 13:52:59,921 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-12-04 13:52:59,926 llama_stack.core.server.server:168 core::server: Starting up Llama Stack server
(version: 0.4.0.dev0)
INFO 2025-12-04 13:52:59,927 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-12-04 13:52:59,928 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-12-04 13:52:59,929 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321
(Press CTRL+C to quit)
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
previously the nvidia provider would throw an exception if a hosted
instance was being used but no API key was set
modify this behavior to instead log an error informing users that a key
is needed to use a hosted NIM but still allow the server to boot
Closes#4295
## Test Plan
Run `uv run llama stack run --providers inference=remote::nvidia` with
`NVIDIA_API_KEY` unset - server now boots with logged error, where it
previously crashed
```
INFO 2025-12-04 14:16:26,156 llama_stack.providers.remote.inference.nvidia.nvidia:47 inference::nvidia: Initializing
NVIDIAInferenceAdapter(https://integrate.api.nvidia.com/v1)...
ERROR 2025-12-04 14:16:26,157 llama_stack.providers.remote.inference.nvidia.nvidia:51 inference::nvidia: API key is
required for hosted NVIDIA NIM. Either provide an API key or use a self-hosted NIM.
INFO 2025-12-04 14:16:26,239 uvicorn.error:84 uncategorized: Started server process [251651]
INFO 2025-12-04 14:16:26,240 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-12-04 14:16:26,244 llama_stack.core.server.server:168 core::server: Starting up Llama Stack server
(version: 0.4.0.dev0)
INFO 2025-12-04 14:16:26,245 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-12-04 14:16:26,246 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-12-04 14:16:26,246 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321
(Press CTRL+C to quit)
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
when publishing llama_stack_api, `inspect.py` causes issues and gets
confused to be the builtin stdlib inspect module.
This is due to the top level __init__.py we have. We need to rename
inspect.py to inspect_api.py to avoid this conflict.
Also, uv sync
1993161624
for reference .
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Bumps [next](https://github.com/vercel/next.js) from 15.5.4 to 15.5.7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v15.5.7</h2>
<p>Please see <a
href="https://nextjs.org/blog/CVE-2025-66478">CVE-2025-66478</a> for
additional details about this release.</p>
<h2>v15.5.6</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Turbopack: don't define process.cwd() in node_modules <a
href="https://redirect.github.com/vercel/next.js/issues/83452">#83452</a></li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/mischnic"><code>@mischnic</code></a> for
helping!</p>
<h2>v15.5.5</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Split code-frame into separate compiled package (<a
href="https://redirect.github.com/vercel/next.js/issues/84238">#84238</a>)</li>
<li>Add deprecation warning to Runtime config (<a
href="https://redirect.github.com/vercel/next.js/issues/84650">#84650</a>)</li>
<li>fix: unstable_cache should perform blocking revalidation during ISR
revalidation (<a
href="https://redirect.github.com/vercel/next.js/issues/84716">#84716</a>)</li>
<li>feat: <code>experimental.middlewareClientMaxBodySize</code> body
cloning limit (<a
href="https://redirect.github.com/vercel/next.js/issues/84722">#84722</a>)</li>
<li>fix: missing next/link types with typedRoutes (<a
href="https://redirect.github.com/vercel/next.js/issues/84779">#84779</a>)</li>
</ul>
<h3>Misc Changes</h3>
<ul>
<li>docs: early October improvements and fixes (<a
href="https://redirect.github.com/vercel/next.js/issues/84334">#84334</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/devjiwonchoi"><code>@devjiwonchoi</code></a>,
<a href="https://github.com/ztanner"><code>@ztanner</code></a>, and <a
href="https://github.com/icyJoseph"><code>@icyJoseph</code></a> for
helping!</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3eaf68b09b"><code>3eaf68b</code></a>
v15.5.7</li>
<li><a
href="8367ce592a"><code>8367ce5</code></a>
update version script</li>
<li><a
href="9115040008"><code>9115040</code></a>
Update React Version for Next.js 15.5.7 (<a
href="https://redirect.github.com/vercel/next.js/issues/10">#10</a>)</li>
<li><a
href="96f699902a"><code>96f6999</code></a>
update tag</li>
<li><a
href="55ef0e3ebc"><code>55ef0e3</code></a>
v15.5.6</li>
<li><a
href="92bbbb1bec"><code>92bbbb1</code></a>
Backport: don't define <code>process.cwd()</code> in node_modules (<a
href="https://redirect.github.com/vercel/next.js/issues/84957">#84957</a>)</li>
<li><a
href="f895b72762"><code>f895b72</code></a>
Fix url-imports test on 15-5 (<a
href="https://redirect.github.com/vercel/next.js/issues/84966">#84966</a>)</li>
<li><a
href="81f530db26"><code>81f530d</code></a>
v15.5.5</li>
<li><a
href="9abbc0e9eb"><code>9abbc0e</code></a>
[backport] fix: missing <code>next/link</code> types with
<code>typedRoutes</code> (<a
href="https://redirect.github.com/vercel/next.js/issues/82814">#82814</a>)
(<a
href="https://redirect.github.com/vercel/next.js/issues/84779">#84779</a>)</li>
<li><a
href="121e1b566f"><code>121e1b5</code></a>
[backport] docs: early October improvements and fixes (<a
href="https://redirect.github.com/vercel/next.js/issues/84334">#84334</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/vercel/next.js/compare/v15.5.4...v15.5.7">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/llamastack/llama-stack/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
DISTRO_DIR and DISTRIBS_BASE_DIR need to exist for them to be iterated.
our current logic allows us to iterdir without checking if they exist
## Test Plan
rm ~/.llama/distributions
```
llama stack list-deps starter --format uv | sh
Using Python 3.12.11 environment at: venv
Audited 51 packages in 12ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 2ms
Using Python 3.12.11 environment at: venv
Audited 1 package in 3ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 5ms
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- Part of #3009
- Implement hybrid search using Qdrant's native query filtering
- Add keyword search support
- Update test suites to include qdrant for keyword and hybrid modes
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
pytest -sv tests/unit/providers/vector_io/
.......
============================================================================================== slowest 10 durations ===============================================================================================
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[qdrant]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[pgvector]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[sqlite_vec]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[faiss]
0.06s setup tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_insert_chunks_with_missing_document_id[pgvector]
0.04s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_tie_breaking
0.04s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_weighted_reranker_parametrization
0.03s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_score_selection
0.03s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_edge_cases
0.03s setup tests/unit/providers/vector_io/test_faiss.py::test_faiss_query_vector_returns_infinity_when_query_and_embedding_are_identical
======================================================================================== 180 passed, 47 warnings in 2.78s =========================================================================================
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
Changes SqlRecord creation in AuthorizedSqlStore.fetch_all to use
owner=None when owner_principal is empty/missing, matching the
ResourceWithOwner pattern used in routing tables. This fixes an
inconsistency where SQL store was creating User(principal="") while
routing tables use owner=None for public resources.
Changes:
o Update ProtectedResource Protocol to allow owner: User | None
o Update SqlRecord.__init__ to accept owner: User | None
o Update fetch_all to create owner=None for records without
owner_principal
Signed-off-by: Derek Higgins <derekh@redhat.com>
Closes security gaps where RBAC checks could be bypassed:
o Inference router: Added RBAC enforcement in the fallback
path to ensure access control is applied consistently.
o Model listing: Dynamic models fetched via provider_data were returned
without RBAC checks. Added filtering to ensure users only see models
they have permission to access.
Both fixes create temporary ModelWithOwner objects for RBAC validation,
maintaining security through consistent access control enforcement.
Closes: #4269
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This commit introduces a new FastAPI router-based system for defining
API endpoints, enabling a migration path away from the legacy @webmethod
decorator system. The implementation includes router infrastructure,
migration of the Batches API as the first example, and updates to
server, OpenAPI generation, and inspection systems to support both
routing approaches.
The router infrastructure consists of a router registry system that
allows APIs to register FastAPI router factories, which are then
automatically discovered and included in the server application.
Standard error responses are centralized in router_utils to ensure
consistent OpenAPI specification generation with proper $ref references
to component responses.
The Batches API has been migrated to demonstrate the new pattern. The
protocol definition and models remain in llama_stack_api/batches,
maintaining clear separation between API contracts and server
implementation. The FastAPI router implementation lives in
llama_stack/core/server/routers/batches, following the established
pattern where API contracts are defined in llama_stack_api and server
routing logic lives in
llama_stack/core/server.
The server now checks for registered routers before falling back to the
legacy webmethod-based route discovery, ensuring backward compatibility
during the migration period. The OpenAPI generator has been updated to
handle both router-based and webmethod-based routes, correctly
extracting metadata from FastAPI route decorators and Pydantic Field
descriptions. The inspect endpoint now includes routes from both
systems, with proper filtering for deprecated routes and API levels.
Response descriptions are now explicitly defined in router decorators,
ensuring the generated OpenAPI specification matches the previous
format. Error responses use $ref references to component responses
(BadRequest400, TooManyRequests429, etc.) as required by the
specification. This is neat and will allow us to remove a lot of boiler
plate code from our generator once the migration is done.
This implementation provides a foundation for incrementally migrating
other APIs to the router system while maintaining full backward
compatibility with existing webmethod-based APIs.
Closes: https://github.com/llamastack/llama-stack/issues/4188
## Test Plan
CI, the server should start, same routes should be visible.
```
curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("batches"))'
```
Also:
```
uv run pytest tests/integration/batches/ -vv --stack-config=http://localhost:8321
================================================== test session starts ==================================================
platform darwin -- Python 3.12.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'macOS-26.0.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 24 items
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None] SKIPPED [ 4%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_listing[None] SKIPPED [ 8%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_immediate_cancellation[None] SKIPPED [ 12%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_chat_completions[None] SKIPPED [ 16%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_completions[None] SKIPPED [ 20%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_endpoint[None] SKIPPED [ 25%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_completed[None] SKIPPED [ 29%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_fields[None] SKIPPED [ 33%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_completion_window[None] SKIPPED [ 37%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_streaming_not_supported[None] SKIPPED [ 41%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_mixed_streaming_requests[None] SKIPPED [ 45%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_endpoint_mismatch[None] SKIPPED [ 50%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_body_fields[None] SKIPPED [ 54%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_metadata_types[None] SKIPPED [ 58%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_embeddings[None] SKIPPED [ 62%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id PASSED [ 66%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl PASSED [ 70%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty] XFAIL [ 75%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed] XFAIL [ 79%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent PASSED [ 83%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent PASSED [ 87%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model PASSED [ 91%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful PASSED [ 95%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params PASSED [100%]
================================================= slowest 10 durations ==================================================
1.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful
0.21s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id
0.17s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl
0.12s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model
0.05s setup tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None]
0.02s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty]
0.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed]
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent
0.00s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent
======================================= 7 passed, 15 skipped, 2 xfailed in 1.78s ========================================
```
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Category-specific log levels from LLAMA_STACK_LOGGING were not applied
to
loggers created before setup_logging() was called. This fix moves the
setup_logging() call earlier in the initialization sequence to ensure
all
loggers respect their configured levels regardless of initialization
timing.
Closes: #4252
Signed-off-by: Derek Higgins <derekh@redhat.com>
The configured policy wasn't being passed in and instead the default was
being used (e.g. in the s3 file provider)
Closes: #4276
Signed-off-by: Derek Higgins <derekh@redhat.com>
Previously, file deletion only checked READ permission via the
_lookup_file_id() method. This meant any user with READ access to a file
could also delete it, making it impossible to configure read-only file
access.
This change adds an 'action' parameter to fetch_all() and fetch_one() in
AuthorizedSqlStore, defaulting to Action.READ for backward
compatibility. The openai_delete_file() method now passes Action.DELETE,
ensuring proper RBAC enforcement.
With this fix, access policies can now distinguish between Users who can
read/list files but not delete them
Closes: #4274
Signed-off-by: Derek Higgins <derekh@redhat.com>
fix: use string annotations for S3Client type hints
Remove future annotations import and use quoted string annotations for
S3Client to avoid import issues.
Changes:
o Remove __future__ annotations import
o Use "S3Client" string annotations in type hints
closes: #4241
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
marks `toolgroup` and `tool_runtime` APIs for deprecation
<!-- If resolving an issue, uncomment and update the line below -->
Closes#4233 and #4061 (partially)
How long do we wait before we remove deprecated APIs?
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Jaideep Rao <jrao@redhat.com>
# What does this PR do?
Removes stale data from llama stack about old telemetry system
**Depends on** https://github.com/llamastack/llama-stack/pull/4127
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Fixes: https://github.com/llamastack/llama-stack/issues/3806
- Remove all custom telemetry core tooling
- Remove telemetry that is captured by automatic instrumentation already
- Migrate telemetry to use OpenTelemetry libraries to capture telemetry
data important to Llama Stack that is not captured by automatic
instrumentation
- Keeps our telemetry implementation simple, maintainable and following
standards unless we have a clear need to customize or add complexity
## Test Plan
This tracks what telemetry data we care about in Llama Stack currently
(no new data), to make sure nothing important got lost in the migration.
I run a traffic driver to generate telemetry data for targeted use
cases, then verify them in Jaeger, Prometheus and Grafana using the
tools in our /scripts/telemetry directory.
### Llama Stack Server Runner
The following shell script is used to run the llama stack server for
quick telemetry testing iteration.
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter
```
### Test Traffic Driver
This python script drives traffic to the llama stack server, which sends
telemetry to a locally hosted instance of the OTLP collector, Grafana,
Prometheus, and Jaeger.
```sh
export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"
export GITHUB_TOKEN="REDACTED"
export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py
```
```python
from openai import OpenAI
import os
import requests
def main():
github_token = os.getenv("GITHUB_TOKEN")
if github_token is None:
raise ValueError("GITHUB_TOKEN is not set")
client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print("Sync response: ", response.choices[0].message.content)
streaming_response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
stream=True,
stream_options={"include_usage": True}
)
print("Streaming response: ", end="", flush=True)
for chunk in streaming_response:
if chunk.usage is not None:
print("Usage: ", chunk.usage)
if chunk.choices and chunk.choices[0].delta is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
ollama_response = client.chat.completions.create(
model="ollama/llama3.2:3b-instruct-fp16",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("Ollama response: ", ollama_response.choices[0].message.content)
vllm_response = client.chat.completions.create(
model="vllm/Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("VLLM response: ", vllm_response.choices[0].message.content)
responses_list_tools_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "What tools are available?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses list tools response: ", responses_list_tools_response.output_text)
responses_tool_call_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses tool call response: ", responses_tool_call_response.output_text)
# make shield call using http request until the client version error is resolved
llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
base_url = "http://localhost:8321/v1/"
shield_id = "llama-guard-ollama"
shields_url = f"{base_url}safety/run-shield"
headers = {
"Authorization": f"Bearer {llama_stack_api_key}",
"Content-Type": "application/json"
}
payload = {
"shield_id": shield_id,
"messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
"params": {}
}
shields_response = requests.post(shields_url, json=payload, headers=headers)
shields_response.raise_for_status()
print("risk assessment response: ", shields_response.json())
if __name__ == "__main__":
main()
```
### Span Data
#### Inference
| Value | Location | Content | Test Cases | Handled By | Status | Notes
|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working | None |
| Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | working | None |
| Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt | Client | string | Any Inference Provider, responses | Auto
Instrument | Working, no responses | None |
#### Safety
| Value | Location | Content | Testing | Handled By | Status | Notes |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [Shield
ID](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Metadata](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Messages](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Response](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Status](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
#### Remote Tool Listing & Execution
| Value | Location | Content | Testing | Handled By | Status | Notes |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: |
| Tool name | server | string | Tool call occurs | Custom Code | working
| [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server URL | server | string | List tools or execute tool call |
Custom Code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server Label | server | string | List tools or execute tool call |
Custom code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| mcp\_list\_tools\_id | server | string | List tools | Custom code |
working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
### Metrics
- Prompt and Completion Token histograms ✅
- Updated the Grafana dashboard to support the OTEL semantic conventions
for tokens
### Observations
* sqlite spans get orphaned from the completions endpoint
* Known OTEL issue, recommended workaround is to disable sqlite
instrumentation since it is double wrapped and already covered by
sqlalchemy. This is covered in documentation.
```shell
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
```
* Responses API instrumentation is
[missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436)
in open telemetry for OpenAI clients, even with traceloop or openllmetry
* Upstream issues in opentelemetry-pyton-contrib
* Span created for each streaming response, so each chunk → very large
spans get created, which is not ideal, but it’s the intended behavior
* MCP telemetry needs to be updated to follow semantic conventions. We
can probably use a library for this and handle it in a separate issue.
### Updated Grafana Dashboard
<img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53
52 PM"
src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a"
/>
## Status
✅ Everything appears to be working and the data we expect is getting
captured in the format we expect it.
## Follow Ups
1. Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
2. Make shield spans follow semconv
3. Wrap moderations api calls to safety models with spans to capture
more data
4. Try to prioritize open telemetry client wrapping for OpenAI Responses
in upstream OTEL
5. This would break the telemetry tests, and they are currently
disabled. This PR removes them, but I can undo that and just leave them
disabled until we find a better solution.
6. Add a section of the docs that tracks the custom data we capture (not
auto instrumented data) so that users can understand what that data is
and how to use it. Commit those changes to the OTEL-gen_ai SIG if
possible as well. Here is an
[example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/)
of how bedrock handles it.
# What does this PR do?
we used to have ` host = config.server.host or ["::", "0.0.0.0"]` but
now only bind to ` host = config.server.host or "0.0.0.0"`
revert back to the old logic, this allows us to curl
http://localhost:8321/v1/models on fedora, which defaults to using IPv6.
resolves#4210
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When we send the model names to Google's openai API, we must use the
"google" name prefix. Google does not recognize the "vertexai" model
names.
Closes#4211
## Test Plan
```bash
uv venv --python python312
. .venv/bin/activate
llama stack list-deps starter | xargs -L1 uv pip install
llama stack run starter
```
Test that this shows the gemini models with their correct names:
```bash
curl http://127.0.0.1:8321/v1/models | jq '.data | map(select(.custom_metadata.provider_id == "vertexai"))'
```
Test that this chat completion works:
```bash
curl -X POST -H "Content-Type: application/json" "http://127.0.0.1:8321/v1/chat/completions" -d '{
"model": "vertexai/google/gemini-2.5-flash",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! Can you tell me a joke?"
}
],
"temperature": 1.0,
"max_tokens": 256
}'
```
Rename `AWS_BEDROCK_API_KEY` to `AWS_BEARER_TOKEN_BEDROCK` to align with
the naming convention used in AWS Bedrock documentation and the AWS web
console UI. This reduces confusion when developers compare LLS docs with
AWS docs.
Closes#4147
The `allowed_models` configuration was only being applied when listing
models via the `/v1/models` endpoint, but the actual inference requests
weren't checking this restriction. This meant users could directly
request any model the provider supports by specifying it in their
inference call, completely bypassing the intended cost controls.
The fix adds validation to all three inference methods (chat
completions, completions, and embeddings) that checks the requested
model against the allowed_models list before making the provider API
call.
### Test plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for providing actual implementation of OpenAI
compatible prompts in Responses API. This is the follow up PR with
actual implementation after introducing #3942
The need of this functionality was initiated in #3514.
> Note, https://github.com/llamastack/llama-stack/pull/3514 is divided
on three separate PRs. Current PR is the third of three.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3321
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Manual testing, CI workflow with added unit tests
Comprehensive manual testing with new implementation:
**Test Prompts with Images with text on them in Responses API:**
I used this image for testing purposes: [iphone 17
image](https://github.com/user-attachments/assets/9e2ee821-e394-4bbd-b1c8-d48a3fa315de)
1. Upload an image:
```
curl -X POST http://localhost:8321/v1/files \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/ianmiller/iphone.jpeg" \
-F "purpose=assistants"
```
`{"object":"file","id":"file-d6d375f238e14f21952cc40246bc8504","bytes":556241,"created_at":1761750049,"expires_at":1793286049,"filename":"iphone.jpeg","purpose":"assistants"}%`
2. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "You are a product analysis expert. Analyze the following product:\n\nProduct Name: {{product_name}}\nDescription: {{description}}\n\nImage: {{product_photo}}\n\nProvide a detailed analysis including quality assessment, target audience, and pricing recommendations.",
"variables": ["product_name", "description", "product_photo"]
}'
```
`{"prompt":"You are a product analysis expert. Analyze the following
product:\n\nProduct Name: {{product_name}}\nDescription:
{{description}}\n\nImage: {{product_photo}}\n\nProvide a detailed
analysis including quality assessment, target audience, and pricing
recommendations.","version":1,"prompt_id":"pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62","variables":["product_name","description","product_photo"],"is_default":false}%`
3. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{
"input": "Please analyze this product",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62",
"version": "1",
"variables": {
"product_name": {
"type": "input_text",
"text": "iPhone 17 Pro Max"
},
"product_photo": {
"type": "input_image",
"file_id": "file-d6d375f238e14f21952cc40246bc8504",
"detail": "high"
}
}
}
}'
```
`{"created_at":1761750427,"error":null,"id":"resp_f897f914-e3b8-4783-8223-3ed0d32fcbc6","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"###
Product Analysis: iPhone 17 Pro Max\n\n**Quality Assessment:**\n\n-
**Display & Design:**\n - The 6.9-inch display is large, ideal for
streaming and productivity.\n - Anti-reflective technology and 120Hz
refresh rate enhance viewing experience, providing smoother visuals and
reducing glare.\n - Titanium frame suggests a premium build, offering
durability and a sleek appearance.\n\n- **Performance:**\n - The Apple
A19 Pro chip promises significant performance improvements, likely
leading to faster processing and efficient multitasking.\n - 12GB RAM is
substantial for a smartphone, ensuring smooth operation for demanding
apps and games.\n\n- **Camera System:**\n - The triple 48MP camera setup
(wide, ultra-wide, telephoto) is designed for versatile photography
needs, capturing high-resolution photos and videos.\n - The 24MP front
camera will appeal to selfie enthusiasts and content creators needing
quality front-facing shots.\n\n- **Connectivity:**\n - Wi-Fi 7 support
indicates future-proof wireless capabilities, providing faster and more
reliable internet connectivity.\n\n**Target Audience:**\n\n- **Tech
Enthusiasts:** Individuals interested in cutting-edge technology and
performance.\n- **Content Creators:** Users who need a robust camera
system for photo and video production.\n- **Luxury Consumers:** Those
who prefer premium materials and top-of-the-line specs.\n-
**Professionals:** Users who require efficient multitasking and
productivity features.\n\n**Pricing Recommendations:**\n\n- Given the
premium specifications, a higher price point is expected. Consider
pricing competitively within the high-end smartphone market while
justifying cost through unique features like the titanium frame and
advanced connectivity options.\n- Positioning around the $1,200 to
$1,500 range would align with expectations for top-tier devices,
catering to its target audience while ensuring
profitability.\n\nOverall, the iPhone 17 Pro Max showcases a blend of
innovative features and premium design, aimed at users seeking high
performance and superior
aesthetics.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_66f4d844-4d9e-4102-80fc-eb75b34b6dbd","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62","variables":{"product_name":{"text":"iPhone
17 Pro
Max","type":"input_text"},"product_photo":{"detail":"high","type":"input_image","file_id":"file-d6d375f238e14f21952cc40246bc8504","image_url":null}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":830,"output_tokens":394,"total_tokens":1224,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
**Test Prompts with PDF files in Responses API:**
I used this PDF file for testing purposes:
[invoicesample.pdf](https://github.com/user-attachments/files/22958943/invoicesample.pdf)
1. Upload PDF:
```
curl -X POST http://localhost:8321/v1/files \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/ianmiller/invoicesample.pdf" \
-F "purpose=assistants"
```
`{"object":"file","id":"file-7fbb1043a4bb468cab60ffe4b8631d8e","bytes":149568,"created_at":1761750730,"expires_at":1793286730,"filename":"invoicesample.pdf","purpose":"assistants"}%`
2. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "You are an accounting and financial analysis expert. Analyze the following invoice document:\n\nInvoice Document: {{invoice_doc}}\n\nProvide a comprehensive analysis",
"variables": ["invoice_doc"]
}'
```
`{"prompt":"You are an accounting and financial analysis expert. Analyze
the following invoice document:\n\nInvoice Document:
{{invoice_doc}}\n\nProvide a comprehensive
analysis","version":1,"prompt_id":"pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc","variables":["invoice_doc"],"is_default":false}%`
3. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Content-Type: application/json" \
-d '{
"input": "Please provide a detailed analysis of this invoice",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc",
"version": "1",
"variables": {
"invoice_doc": {
"type": "input_file",
"file_id": "file-7fbb1043a4bb468cab60ffe4b8631d8e",
"filename": "invoicesample.pdf"
}
}
}
}'
```
`{"created_at":1761750881,"error":null,"id":"resp_da866913-db06-4702-8000-174daed9dbbb","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"Here's
a detailed analysis of the invoice provided:\n\n### Seller
Information\n- **Business Name:** The invoice features a logo with
\"Sunny Farm\" indicating the business identity.\n- **Address:** 123
Somewhere St, Melbourne VIC 3000\n- **Contact Information:** Phone
number (03) 1234 5678\n\n### Buyer Information\n- **Name:** Denny
Gunawan\n- **Address:** 221 Queen St, Melbourne VIC 3000\n\n###
Transaction Details\n- **Invoice Number:** #20130304\n- **Date of
Transaction:** Not explicitly mentioned, likely inferred from the
invoice number or needs clarification.\n\n### Items Purchased\n1.
**Apple**\n - Price: $5.00/kg\n - Quantity: 1 kg\n - Subtotal:
$5.00\n\n2. **Orange**\n - Price: $1.99/kg\n - Quantity: 2 kg\n -
Subtotal: $3.98\n\n3. **Watermelon**\n - Price: $1.69/kg\n - Quantity: 3
kg\n - Subtotal: $5.07\n\n4. **Mango**\n - Price: $9.56/kg\n - Quantity:
2 kg\n - Subtotal: $19.12\n\n5. **Peach**\n - Price: $2.99/kg\n -
Quantity: 1 kg\n - Subtotal: $2.99\n\n### Financial Summary\n-
**Subtotal for Items:** $36.00\n- **GST (Goods and Services Tax):** 10%
of $36.00, which amounts to $3.60\n- **Total Amount Due:** $39.60\n\n###
Notes\n- The invoice includes a placeholder text: \"Lorem ipsum dolor
sit amet...\" which is typically used as filler text. This might
indicate a section intended for terms, conditions, or additional notes
that haven’t been completed.\n\n### Visual and Design Elements\n- The
invoice uses a simple and clear layout, featuring the business logo
prominently and stating essential information such as contact and
transaction details in a structured manner.\n- There is a \"Thank You\"
note at the bottom, which adds a professional and courteous
touch.\n\n### Considerations\n- Ensure the date of the transaction is
clear if there are any future references needed.\n- Replace filler text
with relevant terms and conditions or any special instructions
pertaining to the transaction.\n\nThis invoice appears standard,
representing a small business transaction with clearly itemized products
and applicable
taxes.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_39f3b39e-4684-4444-8e4d-e7395f88c9dc","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc","variables":{"invoice_doc":{"type":"input_file","file_data":null,"file_id":"file-7fbb1043a4bb468cab60ffe4b8631d8e","file_url":null,"filename":"invoicesample.pdf"}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":529,"output_tokens":513,"total_tokens":1042,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
**Test simple text Prompt in Responses API:**
1. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello {{name}}! You are working at {{company}}. Your role is {{role}} at {{company}}. Remember, {{name}}, to be {{tone}}.",
"variables": ["name", "company", "role", "tone"]
}'
```
`{"prompt":"Hello {{name}}! You are working at {{company}}. Your role is
{{role}} at {{company}}. Remember, {{name}}, to be
{{tone}}.","version":1,"prompt_id":"pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef","variables":["name","company","role","tone"],"is_default":false}%`
2. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{
"input": "What is the capital of Ireland?",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef",
"version": "1",
"variables": {
"name": {
"type": "input_text",
"text": "Alice"
},
"company": {
"type": "input_text",
"text": "Dummy Company"
},
"role": {
"type": "input_text",
"text": "Geography expert"
},
"tone": {
"type": "input_text",
"text": "professional and helpful"
}
}
}
}'
```
`{"created_at":1761751097,"error":null,"id":"resp_1b037b95-d9ae-4ad0-8e76-d953897ecaef","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"The
capital of Ireland is
Dublin.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_8e7c72b6-2aa2-4da6-8e57-da4e12fa3ce2","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef","variables":{"name":{"text":"Alice","type":"input_text"},"company":{"text":"Dummy
Company","type":"input_text"},"role":{"text":"Geography
expert","type":"input_text"},"tone":{"text":"professional and
helpful","type":"input_text"}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":47,"output_tokens":7,"total_tokens":54,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
# Problem
OpenAI gpt-4 returned an error when built-in and mcp calls were skipped
due to max_tool_calls parameter. Following is from the server log:
```
RuntimeError: OpenAI response failed: Error code: 400 - {'error': {'message': "An assistant message with
'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids
did not have response messages: call_Yi9V1QNpN73dJCAgP2Arcjej", 'type': 'invalid_request_error', 'param':
'messages', 'code': None}}
```
# What does this PR do?
- Fixes error returned by openai/gpt when calls were skipped due to
max_tool_calls. We now return a tool message that explicitly mentions
that the call is skipped.
- Adds integration tests as a follow-up to
PR#[4062](https://github.com/llamastack/llama-stack/pull/4062)
<!-- If resolving an issue, uncomment and update the line below -->
Part 2 for issue
#[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- Added integration tests
- Added new recordings
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# Fix for Issue #3797
## Problem
Vector store search failed with Pydantic ValidationError when chunk
metadata contained list-type values.
**Error:**
```
ValidationError: 3 validation errors for VectorStoreSearchResponse
attributes.tags.str: Input should be a valid string
attributes.tags.float: Input should be a valid number
attributes.tags.bool: Input should be a valid boolean
```
**Root Cause:**
- `Chunk.metadata` accepts `dict[str, Any]` (any type allowed)
- `VectorStoreSearchResponse.attributes` requires `dict[str, str | float
| bool]` (primitives only)
- Direct assignment at line 641 caused validation failure for
non-primitive types
## Solution
Added utility function to filter metadata to primitive types before
creating search response.
## Impact
**Fixed:**
- Vector search works with list metadata (e.g., `tags: ["transformers",
"gpu"]`)
- Lists become searchable as comma-separated strings
- No ValidationError on search responses
**Preserved:**
- Full metadata still available in `VectorStoreContent.metadata`
- No API schema changes
- Backward compatible with existing primitive metadata
**Affected:**
All vector store providers using `OpenAIVectorStoreMixin`: FAISS,
Chroma, Qdrant, Milvus, Weaviate, PGVector, SQLite-vec
## Testing
tests/unit/providers/vector_io/test_vector_utils.py::test_sanitize_metadata_for_attributes
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Change Safety API from required to optional dependency, following the
established pattern used for other optional dependencies in Llama Stack.
The provider now starts successfully without Safety API configured.
Requests that explicitly include guardrails will receive a clear error
message when Safety API is unavailable.
This enables local development and testing without Safety API while
maintaining clear error messages when guardrail features are requested.
Closes#4165
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. New unit tests added in
`tests/unit/providers/agents/meta_reference/test_safety_optional.py`
2. Integration tests performed with the files in
https://gist.github.com/anik120/c33cef497ec7085e1fe2164e0705b8d6
(i) test with `test_integration_no_safety_fail.yaml`:
Config WITHOUT Safety API, should fail with helpful error since
`required_safety_api` is `true` by default
```
$ uv run llama stack run test_integration_no_safety_fail.yaml 2>&1 | grep -B 5 -A 15 "ValueError.*Safety\|Safety API is
required"
File "/Users/anbhatta/go/src/github.com/llamastack/llama-stack/src/llama_stack/providers/inline/agents/meta_reference
/__init__.py", line 27, in get_provider_impl
raise ValueError(
...<9 lines>...
)
ValueError: Safety API is required but not configured.
To run without safety checks, explicitly set in your configuration:
providers:
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
require_safety_api: false
Warning: This disables all safety guardrails for this agents provider.
```
(ii) test with `test_integration_no_safety_works.yaml`
Config WITHOUT Safety API, **but** `require_safety_api=false` is
explicitly set, should succeed
```
$ uv run llama stack run test_integration_no_safety_works.yaml
INFO 2025-11-16 09:49:10,044 llama_stack.cli.stack.run:169 cli: Using run configuration:
/Users/anbhatta/go/src/github.com/llamastack/llama-stack/test_integration_no_safety_works.yaml
INFO 2025-11-16 09:49:10,052 llama_stack.cli.stack.run:228 cli: HTTPS enabled with certificates:
Key: None
Cert: None
.
.
.
INFO 2025-11-16 09:49:38,528 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-11-16 09:49:38,534 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-11-16 09:49:38,535 uvicorn.error:216 uncategorized: Uvicorn running on http://0.0.0.0:8321 (Press CTRL+C
```
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
# What does this PR do?
Completes #3732 by removing runtime URL transformations and requiring
users to provide full URLs in configuration. All providers now use
'base_url' consistently and respect the exact URL provided without
appending paths like /v1 or /openai/v1 at runtime.
BREAKING CHANGE: Users must update configs to include full URL paths
(e.g., http://localhost:11434/v1 instead of http://localhost:11434).
Closes#3732
## Test Plan
Existing tests should pass even with the URL changes, due to default
URLs being altered.
Add unit test to enforce URL standardization across remote inference
providers (verifies all use 'base_url' field with HttpUrl | None type)
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
since `StackRunConfig` requires certain parts of `StorageConfig`, it'd
probably make sense to template in some defaults that will "just work"
for most usecases
specifically introduce`ServerStoresConfig` defaults for inference,
metadata, conversations and prompts. We already actually funnel in
defaults for these sections ad-hoc throughout the codebase
additionally set some `backends` defaults for the `StorageConfig`.
This will alleviate some weirdness for `--providers` for run/list-deps
and also some work I have to better align our list-deps/run datatypes
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
These primitives (used both by the Stack as well as provider
implementations) can be thought of fruitfully as internal-only APIs
which can themselves have multiple implementations. We use the new
`llama_stack_api.internal` namespace for this.
In addition: the change moves kv/sql store impls, configs, and
dependency helpers under `core/storage`
## Testing
`pytest tests/unit/utils/test_authorized_sqlstore.py`, other existing CI
# What does this PR do?
Initial PR against #4123
Adds `parallel_tool_calls` spec to Responses API and basic initial
implementation where no more than one function call is generated when
set to `False`.
## Test Plan
* Unit tests have been added to verify no more than one function call is
generated.
* A followup PR will verify passing through `parallel_tool_calls` to
providers.
* A followup PR will address verification and/or implementation of
incremental function calling across multiple conversational turns.
---------
Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>
# What does this PR do?
- Remove backward compatibility for authorization in mcp_headers
- Enforce authorization must use dedicated parameter
- Add validation error if Authorization found in provider_data headers
- Update test_mcp.py to use authorization parameter
- Update test_mcp_json_schema.py to use authorization parameter
- Update test_tools_with_schemas.py to use authorization parameter
- Update documentation to show the change in the authorization approach
Breaking Change:
- Authorization can no longer be passed via mcp_headers in provider_data
- Users must use the dedicated 'authorization' parameter instead
- Clear error message guides users to the new approach"
## Test Plan
CI
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
It was referencing strong_typing which was removed in
https://github.com/llamastack/llama-stack/pull/3944
## Test Plan
New CI build test.
Signed-off-by: Sébastien Han <seb@redhat.com>