- simplify search result processing in demo script
- optimize demo script by using inline text instead of big external file
- improve printouts clarity and user experience
---------
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
# What does this PR do?
- Add new /admin API (v1alpha) for administrative operations including
provider management, health checks, version info, and route listing
- Implement using FastAPI routers following batches pattern with proper
request/response models
- Endpoints: /admin/providers, /admin/providers/{id},
/admin/inspect/routes, /admin/health, /admin/version
- Create admin module structure: models.py, api.py, fastapi_routes.py,
init.py
- Add AdminImpl in llama_stack/core combining provider and inspect
functionality
- Deprecate standalone /providers and /inspect APIs (remain functional
for backward compatibility)
- Consolidate duplicate types: ProviderInfo, HealthInfo, RouteInfo, etc.
now defined once in admin.models
## Test Plan
new admin integration suite, uses generated stainless SDK, and records
new tests on this PR.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Fixed issue where code was injecting `run_config.vector_stores` even
when it was `None`, which overrode the `default_factory` in
`RagToolRuntimeConfig`. Currently, _most_ providers don't have a default
implementation for vectors_stores:
- nvidia
- meta-reference-gpu
- dell
- oci
- open-benchmark
- postgres-demo
- watsonx
The only ones which do are:
- ci-tests
- starter
- starter-gpu
## Test Plan
Prior to the change, I could not start llama-stack with the oci
distribution:
```
Traceback (most recent call last):
File "/home/opc/llama-stack/.venv/bin/llama", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/opc/llama-stack/src/llama_stack/cli/llama.py", line 52, in main
parser.run(args)
File "/home/opc/llama-stack/src/llama_stack/cli/llama.py", line 46, in run
args.func(args)
File "/home/opc/llama-stack/src/llama_stack/cli/stack/run.py", line 184, in _run_stack_run_cmd
self._uvicorn_run(config_file, args)
File "/home/opc/llama-stack/src/llama_stack/cli/stack/run.py", line 242, in _uvicorn_run
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/uvicorn/main.py", line 580, in run
server.run()
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/uvicorn/server.py", line 67, in run
return asyncio.run(self.serve(sockets=sockets))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/uvicorn/server.py", line 71, in serve
await self._serve(sockets)
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/uvicorn/server.py", line 78, in _serve
config.load()
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/uvicorn/config.py", line 442, in load
self.loaded_app = self.loaded_app()
^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/server/server.py", line 403, in create_app
app = StackApp(
^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/server/server.py", line 161, in __init__
future.result()
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/stack.py", line 534, in initialize
impls = await resolve_impls(
^^^^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/resolver.py", line 180, in resolve_impls
return await instantiate_providers(sorted_providers, router_apis, dist_registry, run_config, policy, internal_impls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/resolver.py", line 321, in instantiate_providers
impl = await instantiate_provider(provider, deps, inner_impls, dist_registry, run_config, policy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/src/llama_stack/core/resolver.py", line 417, in instantiate_provider
config = config_type(**provider_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/opc/llama-stack/.venv/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for RagToolRuntimeConfig
vector_stores_config
Input should be a valid dictionary or instance of VectorStoresConfig [type=model_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.11/v/model_type
```
Afer tracing through and finding a simple solution to the change, I was
able to run the distribution again. I also executed the integration
tests for pytest:
```bash
OCI_COMPARTMENT_OCID="ocid1.compartment.oc1..xxx" OCI_REGION="us-chicago-1" OCI_AUTH_TYPE=instance_principal OCI_CLI_PROFILE=CHICAGO uv run pytest -sv tests/integration/inference/ --stack-config oci --text-model oci/meta.llama-3.3-70b-instruct --inference-mode live
```
Changes:
o Remove TELEMETRY_SINKS environment variable from scripts (unused)
o Replace with OTEL_EXPORTER_OTLP_PROTOCOL in install scripts
The TELEMETRY_SINKS variable is no longer use by Python code and has
been replaced with the standard OpenTelemetry environment variable
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
Inject `stream_options={"include_usage": True} `when streaming and
OpenTelemetry telemetry is active. Telemetry always overrides any caller
preference to ensure complete and consistent observability metrics.
Changes:
- Add conditional stream_options injection to OpenAIMixin (benefits
OpenAI, Bedrock, Runpod, Together, Fireworks providers)
- Add conditional stream_options injection to LiteLLMOpenAIMixin
(benefits WatsonX and other litellm-based providers)
- Check telemetry status using trace.get_current_span().is_recording()
- Override include_usage=False when telemetry active to prevent metric
gaps
- Unit tests for this functionality
Fixes#3981
Note: this work originated in PR #4200, which I closed after rebasing on
the telemetry changes. This PR rebases those commits, incorporates the
Bedrock feedback, and carries forward the same scope described there.
## Test Plan
#### OpenAIMixin + telemetry injection tests
PYTHONPATH=src python -m pytest
tests/unit/providers/utils/inference/test_openai_mixin.py
#### LiteLLM OpenAIMixin tests
PYTHONPATH=src python -m pytest
tests/unit/providers/inference/test_litellm_openai_mixin.py -v
#### Broader inference provider
PYTHONPATH=src python -m pytest tests/unit/providers/inference/
--ignore=tests/unit/providers/inference/test_inference_client_caching.py
-v
The Dataset model no longer has a dataset_schema attribute it was remove
during a refactor (5287b437a) so this validation can no longer run.
Changes:
o basic scoring: removed validate_dataset_schema call and related
imports o llm_as_judge scoring: removed validate_dataset_schema call and
related imports o braintrust scoring: removed validate_dataset_schema
call and related imports
Validation is no longer needed at the dataset level since: o Dataset
model changed from having dataset_schema to purpose/source fields o
Scoring functions validate required fields when processing rows o
Invalid data will fail naturally with clear error messages
Fixes: #4419
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Various fixes to integration test recording + stainless calling of
integration tests:
1. only the library client was being run, they all should be
2. the git check grabs diffs like:
M tests/integration/client-typescript/package-lock.json
M tests/integration/client-typescript/package.json
it should not
additionally:
Fixes rebase conflicts when stainless workflow runs integration tests
with
record-if-missing mode on PRs. Previously, the workflow would:
1. Commit all files in tests/integration/ (including non-recordings)
2. Try to rebase and push to 'main' instead of the PR branch
3. Fail with merge conflicts on PR-specific changes
Changes:
- Add pr_head_ref and is_fork_pr parameters flowing through workflow
chain
- Use target-branch input instead of github.ref_name in recording
commits
- Detect and handle fork PRs by skipping push and uploading recordings
as artifacts
- Add 7-day artifact retention for fork PR recordings
- Support both workflow_call and direct pull_request trigger contexts
For same-repo PRs: recordings now commit/push to the PR branch correctly
For fork PRs: recordings upload as downloadable artifacts with
instructions
you can see a failing workflow:
5846590613
with the rebase issues.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
we will typically need to record the missing json for net new APIs. use
record-if-missing so that the integration tests can re-record and commit
the files to the PR
set the stainless inference mode to record-if-missing, and properly pass
the pr_head_sha on workflow_call.
## Test Plan
see 2031824567
which uses this commit.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Enhances the Vector Stores config with full set of appropriate
configurations
- Add FileIngestionParams, ChunkRetrievalParams, and FileBatchParams
subconfigs
- Update RAG memory, OpenAI vector store mixin, and vector store utils
to use configuration
- Fix import organization across vector store components
- Add comprehensive vector stores configuration documentation
- Update docs navigation to include vector store configuration guide
- Delete `memory/constants.py` and move constant values directly into
Pydantic models
## Test Plan
Tests updated + CI
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Migrate the Inspect API to the FastAPI router pattern.
Changes:
- Add inspect API to FastAPI router registry
- Add PUBLIC_ROUTE_KEY support for routes that don't require auth
- Update WebMethod creation to respect route's openapi_extra for
authentication requirements
Fixes: https://github.com/llamastack/llama-stack/issues/4346
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI and various curls on /v1/inspect/routes, /v1/health, /v1/version
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Convert Providers API from @webmethod decorators to FastAPI router
pattern.
Fixes: https://github.com/llamastack/llama-stack/issues/4350
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Consolidates provider data context handling into middleware, eliminating
duplication between FastAPI router routes and legacy @webmethod routes.
Closes#4366
## Test Plan
Added unit test suite `test_test_context_middleware`, specifically
`test_middleware_extracts_test_id_from_header` to validate the expected
behavior.
```
❯ ./scripts/unit-tests.sh tests/unit/
```
Integration of the middleware test context with the `files` FastAPI
router migration from
[pull/4339](https://github.com/llamastack/llama-stack/pull/4339).
```
❯ git switch migrate-files-api
Switched to branch 'migrate-files-api'
❯ git rebase fix-test-ctx-middleware
Successfully rebased and updated refs/heads/migrate-files-api.
❯ ./scripts/integration-tests.sh --inference-mode replay --suite base --setup ollama --stack-config server:starter --subdirs files
```
Signed-off-by: Matthew F Leader <mleader@redhat.com>
Vector store operations were bypassing ABAC checks by calling providers
directly instead of going through the routing table. This allowed
unauthorized access to vector store data and operations.
Changes:
o Route all VectorIORouter methods through routing table instead of
directly to providers
o Update routing table to enforce ABAC checks on all vector store
operations (read, update, delete)
o Add test suite verifying ABAC enforcement for all vector store
operations
o Ensure providers are never called when authorization fails
Fixes security issue where users could access vector stores they don't
have permission for.
Fixes: #4393
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Enable stainless-builds workflow to test preview SDKs by calling
integration-tests workflow with python_url parameter. Add stainless
matrix config for faster CI runs on SDK changes.
- Make integration-tests.yml reusable with workflow_call inputs
- Thread python_url through test setup actions to install preview SDK
- Add matrix_key parameter to generate_ci_matrix.py for custom matrices
- Update stainless-builds.yml to call integration tests with preview URL
This allows us to test a client on the PR introducing the new changes
before merging. Contributors can even write new tests using the
generated client which should pass on the PR, indicating that they will
pass on main upon merge
## Test Plan
see triggered action using the workflows on this branch:
5810594042
which installs the stainless SDK from the given url.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
since run.yaml is gone, update logs to say "stack config" or "stack
configuration" rather than run
## Test Plan
check logs
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Convert the Datasets API from webmethod decorators to FastAPI router
pattern.
Fixes: https://github.com/llamastack/llama-stack/issues/4344
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Adds support for enforcing tool usage via responses api. See
https://platform.openai.com/docs/api-reference/responses/create#responses_create-tool_choice
for details from official documentation.
Note: at present this PR only supports `file_search` and `web_search` as
options to enforce builtin tool usage
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3548
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
`./scripts/unit-tests.sh
tests/unit/providers/agents/meta_reference/test_response_tool_context.py
`
---------
Signed-off-by: Jaideep Rao <jrao@redhat.com>
# What does this PR do?
- Enables users to configure prompts used throughout the File Search /
Vector Retrieval
- Configuration is defined in the Vector Stores Config so they can be
modified at runtime
- Backwards compatible, which means the fields are optional and default
to the previously used values
This is the summary of the new options in the `run.yaml`
```yaml
vector_stores:
file_search_params:
header_template: 'knowledge_search tool found {num_chunks} chunks:\nBEGIN of knowledge_search tool results.\n'
footer_template: 'END of knowledge_search tool results.\n'
context_prompt_params:
chunk_annotation_template: 'Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n'
context_template: 'The above results were retrieved to help answer the user\'s query: "{query}". Use them as supporting information only in answering this query.{annotation_instruction}\n'
annotation_prompt_params:
enable_annotations: true
annotation_instruction_template: 'Cite sources immediately at the end of sentences before punctuation, using `<|file-id|>` format like \'This is a fact <|file-Cn3MSNn72ENTiiq11Qda4A|>.\'. Do not add
extra punctuation. Use only the file IDs provided, do not invent new ones.'
chunk_annotation_template: '[{index}] {metadata_text} cite as <|{file_id}|>\n{chunk_text}\n'
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Added tests.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Fix provider header API key handling by correctly unwrapping `SecretStr`
values for provider data API keys. Previously the validator cast header
keys to `SecretStr` but the value wasn’t unwrapped before use, causing
authentication failures with providers like Azure.
Closes https://github.com/llamastack/llama-stack/issues/4370
Allow users to specify the inference model through the INFERENCE_MODEL
environment variable instead of hardcoding it, with fallback to
ollama/llama3.2:3b if not set.
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact)
from 5.0.0 to 6.0.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/upload-artifact/releases">actions/upload-artifact's
releases</a>.</em></p>
<blockquote>
<h2>v6.0.0</h2>
<h2>v6 - What's new</h2>
<blockquote>
<p>[!IMPORTANT]
actions/upload-artifact@v6 now runs on Node.js 24 (<code>runs.using:
node24</code>) and requires a minimum Actions Runner version of 2.327.1.
If you are using self-hosted runners, ensure they are updated before
upgrading.</p>
</blockquote>
<h3>Node.js 24</h3>
<p>This release updates the runtime to Node.js 24. v5 had preliminary
support for Node.js 24, however this action was by default still running
on Node.js 20. Now this action by default will run on Node.js 24.</p>
<h2>What's Changed</h2>
<ul>
<li>Upload Artifact Node 24 support by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/719">actions/upload-artifact#719</a></li>
<li>fix: update <code>@actions/artifact</code> for Node.js 24 punycode
deprecation by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/744">actions/upload-artifact#744</a></li>
<li>prepare release v6.0.0 for Node.js 24 support by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/745">actions/upload-artifact#745</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/upload-artifact/compare/v5.0.0...v6.0.0">https://github.com/actions/upload-artifact/compare/v5.0.0...v6.0.0</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="b7c566a772"><code>b7c566a</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-artifact/issues/745">#745</a>
from actions/upload-artifact-v6-release</li>
<li><a
href="e516bc8500"><code>e516bc8</code></a>
docs: correct description of Node.js 24 support in README</li>
<li><a
href="ddc45ed9bc"><code>ddc45ed</code></a>
docs: update README to correct action name for Node.js 24 support</li>
<li><a
href="615b319bd2"><code>615b319</code></a>
chore: release v6.0.0 for Node.js 24 support</li>
<li><a
href="017748b48f"><code>017748b</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-artifact/issues/744">#744</a>
from actions/fix-storage-blob</li>
<li><a
href="38d4c7997f"><code>38d4c79</code></a>
chore: rebuild dist</li>
<li><a
href="7d27270e0c"><code>7d27270</code></a>
chore: add missing license cache files for <code>@actions/core</code>,
<code>@actions/io</code>, and mi...</li>
<li><a
href="5f643d3c94"><code>5f643d3</code></a>
chore: update license files for <code>@actions/artifact</code><a
href="https://github.com/5"><code>@5</code></a>.0.1 dependencies</li>
<li><a
href="1df1684032"><code>1df1684</code></a>
chore: update package-lock.json with <code>@actions/artifact</code><a
href="https://github.com/5"><code>@5</code></a>.0.1</li>
<li><a
href="b5b1a91840"><code>b5b1a91</code></a>
fix: update <code>@actions/artifact</code> to ^5.0.0 for Node.js 24
punycode fix</li>
<li>Additional commits viewable in <a
href="330a01c490...b7c566a772">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/cache](https://github.com/actions/cache) from 4.3.0 to
5.0.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/releases">actions/cache's
releases</a>.</em></p>
<blockquote>
<h2>v5.0.1</h2>
<blockquote>
<p>[!IMPORTANT]
<strong><code>actions/cache@v5</code> runs on the Node.js 24 runtime and
requires a minimum Actions Runner version of
<code>2.327.1</code>.</strong></p>
<p>If you are using self-hosted runners, ensure they are updated before
upgrading.</p>
</blockquote>
<hr />
<h1>v5.0.1</h1>
<h2>What's Changed</h2>
<ul>
<li>fix: update <code>@actions/cache</code> for Node.js 24 punycode
deprecation by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1685">actions/cache#1685</a></li>
<li>prepare release v5.0.1 by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1686">actions/cache#1686</a></li>
</ul>
<h1>v5.0.0</h1>
<h2>What's Changed</h2>
<ul>
<li>Upgrade to use node24 by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1630">actions/cache#1630</a></li>
<li>Prepare v5.0.0 release by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1684">actions/cache#1684</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/cache/compare/v5...v5.0.1">https://github.com/actions/cache/compare/v5...v5.0.1</a></p>
<h2>v5.0.0</h2>
<blockquote>
<p>[!IMPORTANT]
<strong><code>actions/cache@v5</code> runs on the Node.js 24 runtime and
requires a minimum Actions Runner version of
<code>2.327.1</code>.</strong></p>
<p>If you are using self-hosted runners, ensure they are updated before
upgrading.</p>
</blockquote>
<hr />
<h2>What's Changed</h2>
<ul>
<li>Upgrade to use node24 by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1630">actions/cache#1630</a></li>
<li>Prepare v5.0.0 release by <a
href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1684">actions/cache#1684</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/cache/compare/v4.3.0...v5.0.0">https://github.com/actions/cache/compare/v4.3.0...v5.0.0</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/blob/main/RELEASES.md">actions/cache's
changelog</a>.</em></p>
<blockquote>
<h1>Releases</h1>
<h2>Changelog</h2>
<h3>5.0.1</h3>
<ul>
<li>Update <code>@azure/storage-blob</code> to <code>^12.29.1</code> via
<code>@actions/cache@5.0.1</code> <a
href="https://redirect.github.com/actions/cache/pull/1685">#1685</a></li>
</ul>
<h3>5.0.0</h3>
<blockquote>
<p>[!IMPORTANT]
<code>actions/cache@v5</code> runs on the Node.js 24 runtime and
requires a minimum Actions Runner version of <code>2.327.1</code>.
If you are using self-hosted runners, ensure they are updated before
upgrading.</p>
</blockquote>
<h3>4.3.0</h3>
<ul>
<li>Bump <code>@actions/cache</code> to <a
href="https://redirect.github.com/actions/toolkit/pull/2132">v4.1.0</a></li>
</ul>
<h3>4.2.4</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.5</li>
</ul>
<h3>4.2.3</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.3 (obfuscates SAS token in
debug logs for cache entries)</li>
</ul>
<h3>4.2.2</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.2</li>
</ul>
<h3>4.2.1</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.1</li>
</ul>
<h3>4.2.0</h3>
<p>TLDR; The cache backend service has been rewritten from the ground up
for improved performance and reliability. <a
href="https://github.com/actions/cache">actions/cache</a> now integrates
with the new cache service (v2) APIs.</p>
<p>The new service will gradually roll out as of <strong>February 1st,
2025</strong>. The legacy service will also be sunset on the same date.
Changes in these release are <strong>fully backward
compatible</strong>.</p>
<p><strong>We are deprecating some versions of this action</strong>. We
recommend upgrading to version <code>v4</code> or <code>v3</code> as
soon as possible before <strong>February 1st, 2025.</strong> (Upgrade
instructions below).</p>
<p>If you are using pinned SHAs, please use the SHAs of versions
<code>v4.2.0</code> or <code>v3.4.0</code></p>
<p>If you do not upgrade, all workflow runs using any of the deprecated
<a href="https://github.com/actions/cache">actions/cache</a> will
fail.</p>
<p>Upgrading to the recommended versions will not break your
workflows.</p>
<h3>4.1.2</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9255dc7a25"><code>9255dc7</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1686">#1686</a>
from actions/cache-v5.0.1-release</li>
<li><a
href="8ff5423e8b"><code>8ff5423</code></a>
chore: release v5.0.1</li>
<li><a
href="9233019a15"><code>9233019</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1685">#1685</a>
from salmanmkc/node24-storage-blob-fix</li>
<li><a
href="b975f2bb84"><code>b975f2b</code></a>
fix: add peer property to package-lock.json for dependencies</li>
<li><a
href="d0a0e18134"><code>d0a0e18</code></a>
fix: update license files for <code>@actions/cache</code>,
fast-xml-parser, and strnum</li>
<li><a
href="74de208dcf"><code>74de208</code></a>
fix: update <code>@actions/cache</code> to ^5.0.1 for Node.js 24
punycode fix</li>
<li><a
href="ac7f1152ea"><code>ac7f115</code></a>
peer</li>
<li><a
href="b0f846b50b"><code>b0f846b</code></a>
fix: update <code>@actions/cache</code> with storage-blob fix for
Node.js 24 punycode depr...</li>
<li><a
href="a783357455"><code>a783357</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1684">#1684</a>
from actions/prepare-cache-v5-release</li>
<li><a
href="3bb0d78750"><code>3bb0d78</code></a>
docs: highlight v5 runner requirement in releases</li>
<li>Additional commits viewable in <a
href="0057852bfa...9255dc7a25">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This PR fixes issue #3185
The code calls `await event_gen.aclose()` but OpenAI's `AsyncStream`
doesn't have an `aclose()` method - it has `close()` (which is async).
when clients cancel streaming requests, the server tries to clean up
with:
```python
await event_gen.aclose() # ❌ AsyncStream doesn't have aclose()!
```
But `AsyncStream` has never had a public `aclose()` method. The error
message literally tells us:
```
AttributeError: 'AsyncStream' object has no attribute 'aclose'. Did you mean: 'close'?
^^^^^^^^
```
## Verification
* Reproduction script
[`reproduce_issue_3185.sh`](https://gist.github.com/r-bit-rry/dea4f8fbb81c446f5db50ea7abd6379b)
can be used to verify the fix.
* Manual checks, validation against original OpenAI library code
# What does this PR do?
This PR updates the RAG examples included in docs/quick_start.ipynb,
docs/getting_started/demo_script.py, rag.mdx and index.md to remove
references to the deprecated vector_io and vector_db APIs and to add
examples that use /v1/vector_stores with responses and completions.
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
Document the refresh_models configuration option for remote providers
that use RemoteInferenceProviderConfig.
- Add "Automatic vs Explicit Model Registration" section to
resources.mdx
- Include examples for registering custom embedding models
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
The PR validates and allow access to OCI object-storage through the S3
compatibility API. Additional documentation for OCI is supplied, in
notebook form, as well.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Co-authored-by: raghotham <rsm@meta.com>
# Problem
As an Application Developer, I want to use the include parameter with
the value message.output_text.logprobs, so that I can receive log
probabilities for output tokens to assess the model's confidence in its
response.
# What does this PR do?
- Updates the include parameter in various resource definitions
- Updates the inline provider to return logprobs when
"message.output_text.logprobs" is passed in the include parameter
- Converts the logprobs returned by the inference provider from chat
completion format to responses format
Closes #[4260](https://github.com/llamastack/llama-stack/issues/4260)
## Test Plan
- Created a script to explore OpenAI behavior:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/include.py
- Added integration tests and new recordings
---------
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds a new API for connectors and MCP registry support along with
required types.
Does not include any implementation for it
<!-- If resolving an issue, uncomment and update the line below -->
Closes#4235 and #4061 (partially)
## Test Plan
no tests included
---------
Signed-off-by: Jaideep Rao <jrao@redhat.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
# What does this PR do?
The InferenceStore class was ignoring the table_name field from
InferenceStoreReference and always using the hardcoded value
"chat_completions". This meant that any custom table_name configured in
the run config (e.g., "inference_store" in run-with-postgres-store.yaml)
was silently ignored.
This change updates all SQL operations in InferenceStore to use
self.reference.table_name instead of the hardcoded string, ensuring the
configured table name is properly respected.
A new test has been added to verify that custom table names work
correctly for storing, retrieving, and listing chat completions.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Currently impossible to test workflow changes (pull_request_target uses
base branch definition) or manually trigger SDK builds. This adds both
capabilities.
- Add workflow_dispatch with pr_number input for manual testing
- Add workflow file to path triggers for automatic testing
- Fetch PR details via gh CLI for manual runs
- Update jobs to use computed PR data for both trigger types
## Test Plan
impossible to test until it merges unfortunately. I am doing this in a
smaller PR so that I can use it immediately in a follow up.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Actualize query rewrite in search API, add
`default_query_expansion_model` and `query_expansion_prompt` in
`VectorStoresConfig`.
Makes `rewrite_query` parameter functional in vector store search.
- `rewrite_query=false` (default): Use original query
- `rewrite_query=true`: Expand query via LLM, or fail gracefully if no
LLM available
Adds 4 parameters to`VectorStoresConfig`:
- `default_query_expansion_model`: LLM model for query expansion
(optional)
- `query_expansion_prompt`: Custom prompt template (optional, uses
built-in default)
- `query_expansion_max_tokens`: Configurable token limit (default: 100)
- `query_expansion_temperature`: Configurable temperature (default: 0.3)
Enabled `run.yaml`:
```yaml
vector_stores:
rewrite_query_params:
model:
provider_id: "ollama"
model_id: "llama3.2:3b-instruct-fp16"
# prompt defaults to built-in
# max_tokens defaults to 100
# temperature defaults to 0.3
```
Fully customized `run.yaml`:
```yaml
vector_stores:
default_provider_id: faiss
default_embedding_model:
provider_id: sentence-transformers
model_id: nomic-ai/nomic-embed-text-v1.5
rewrite_query_params:
model:
provider_id: ollama
model_id: llama3.2:3b-instruct-fp16
prompt: "Rewrite this search query to improve retrieval results by expanding it with relevant synonyms and related terms: {query}"
max_tokens: 100
temperature: 0.3
```
## Test Plan
Added test and recording
Example script as well:
```python
import asyncio
from llama_stack_client import LlamaStackClient
from io import BytesIO
def gen_file(client, text: str=""):
file_buffer = BytesIO(text.encode('utf-8'))
file_buffer.name = "my_file.txt"
uploaded_file = client.files.create(
file=file_buffer,
purpose="assistants"
)
return uploaded_file
async def test_query_rewriting():
client = LlamaStackClient(base_url="http://0.0.0.0:8321/")
uploaded_file = gen_file(client, "banana banana apple")
uploaded_file2 = gen_file(client, "orange orange kiwi")
vs = client.vector_stores.create()
xf_vs = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
xf_vs1 = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file2.id)
response1 = client.vector_stores.search(
vector_store_id=vs.id,
query="apple",
max_num_results=3,
rewrite_query=False
)
response2 = client.vector_stores.search(
vector_store_id=vs.id,
query="kiwi",
max_num_results=3,
rewrite_query=True,
)
print(f"\n🔵 Response 1 (rewrite_query=False):\n\033[94m{response1}\033[0m")
print(f"\n🟢 Response 2 (rewrite_query=True):\n\033[92m{response2}\033[0m")
for f in [uploaded_file.id, uploaded_file2.id]:
client.files.delete(file_id=f)
client.vector_stores.delete(vector_store_id=vs.id)
if __name__ == "__main__":
asyncio.run(test_query_rewriting())
```
And see the screen shot of the server logs showing it worked.
<img width="1111" height="826" alt="Screenshot 2025-11-19 at 1 16 03 PM"
src="https://github.com/user-attachments/assets/2d188b44-1fef-4df5-b465-2d6728ca49ce"
/>
Notice the log:
```bash
Query rewritten:
'kiwi' → 'kiwi, a small brown or green fruit native to New Zealand, or a person having a fuzzy brown outer skin similar in appearance.'
```
So `kiwi` was expanded.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Convert the Benchmarks API from @webmethod decorators to FastAPI router
pattern, matching the Batches API structure.
One notable change is the update of stack.py to handle request models in
register_resources().
Closes: #4308
## Test Plan
CI and `curl http://localhost:8321/v1/inspect/routes | jq '.data[] |
select(.route | contains("benchmark"))'`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the build.yaml is only used in the following ways:
1. list-deps
2. distribution code-gen
since `llama stack build` no longer exists, I found myself asking "why
do we need two different files for list-deps and run"?
Removing the BuildConfig and altering the usage of the
DistributionTemplate in llama stack list-deps is the first step in
removing the build yaml entirely.
Removing the BuildConfig and build.yaml cuts the files users need to
maintain in half, and allows us to focus on the stability of _just_ the
run.yaml
This PR removes the build.yaml, BuildConfig datatype, and its usage
throughout the codebase. Users are now expected to point to run.yaml
files when running list-deps, and our codebase automatically uses these
types now for things like `get_provider_registry`.
**Additionally, two renames: `StackRunConfig` -> `StackConfig` and
`run.yaml` -> `config.yaml`.**
The build.yaml made sense for when we were managing the build process
for the user and actually _producing_ a run.yaml _from_ the build.yaml,
but now that we are simply just getting the provider registry and
listing the deps, switching to config.yaml simplifies the scope here
greatly.
## Test Plan
existing list-deps usage should work in the tests.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The test_conversation_error_handling test was timing out in CI with a
deadlock in httpcore's connection pool. The root cause was the preceding
test_conversation_multi_turn_and_streaming test, which broke out of the
streaming response iterator early without properly closing the
underlying HTTP connection.
When a streaming response iterator is abandoned mid-stream, the HTTP
connection remains in an incomplete state. Since the openai_client
fixture is session-scoped, subsequent tests reuse the same httpcore
connection pool. The dangling connection causes the pool's internal lock
to deadlock when the next test attempts to acquire a new connection.
The fix wraps the streaming response in a context manager, which ensures
the connection is properly closed when exiting the with block, even when
breaking out of the loop early. This is a best practice when working
with streaming HTTP responses that may not be fully consumed.
Signed-off-by: Sébastien Han <seb@redhat.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.0
to 6.0.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/checkout/releases">actions/checkout's
releases</a>.</em></p>
<blockquote>
<h2>v6.0.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Update all references from v5 and v4 to v6 by <a
href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/2314">actions/checkout#2314</a></li>
<li>Add worktree support for persist-credentials includeIf by <a
href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/2327">actions/checkout#2327</a></li>
<li>Clarify v6 README by <a
href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/2328">actions/checkout#2328</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v6...v6.0.1">https://github.com/actions/checkout/compare/v6...v6.0.1</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="8e8c483db8"><code>8e8c483</code></a>
Clarify v6 README (<a
href="https://redirect.github.com/actions/checkout/issues/2328">#2328</a>)</li>
<li><a
href="033fa0dc0b"><code>033fa0d</code></a>
Add worktree support for persist-credentials includeIf (<a
href="https://redirect.github.com/actions/checkout/issues/2327">#2327</a>)</li>
<li><a
href="c2d88d3ecc"><code>c2d88d3</code></a>
Update all references from v5 and v4 to v6 (<a
href="https://redirect.github.com/actions/checkout/issues/2314">#2314</a>)</li>
<li>See full diff in <a
href="1af3b93b68...8e8c483db8">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add "token" to sensitive field patterns in redact_sensitive_fields() to
prevent JWT tokens from being logged in plaintext. Previously only
api_key, api_token, password, and secret were filtered.
This prevents tokens like server.auth.provider_config.jwks.token from
being exposed in server logs.
Closes: #4324
Signed-off-by: Derek Higgins <derekh@redhat.com>
In preperation for ABAC addition (next PR)
```
fix(ci): allow run_dir variable expansion in YAML heredoc
Remove single quotes from EOF delimiter to allow $run_dir to
be expanded by bash when creating the configuration file.
Previously the literal string "$run_dir" was being written
to the YAML instead of the actual temp directory path.
drwxr-xr-x 3 runner runner 4096 Dec 5 12:56 $run_dir
```
```
test(ci): add test_endpoint helper function to auth tests
Add reusable test_endpoint function to integration-auth-tests
workflow for consistent API testing:
```
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Fixing broken README links that were still pointing to the
https://llamastack.github.io/latest
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com>
# What does this PR do?
previously the runpod provider would fail if the
RUNPOD_API_TOKEN was not set
modify the impl to default to an empty string to
align with similar providers' behavior
Closes#4296
## Test Plan
Run `uv run llama stack run --providers inference=remote::runpod` with
`RUNPOD_API_TOKEN` unset - server now boots where it previously crashed
```
INFO 2025-12-04 13:52:59,920 uvicorn.error:84 uncategorized: Started server process [233656]
INFO 2025-12-04 13:52:59,921 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-12-04 13:52:59,926 llama_stack.core.server.server:168 core::server: Starting up Llama Stack server
(version: 0.4.0.dev0)
INFO 2025-12-04 13:52:59,927 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-12-04 13:52:59,928 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-12-04 13:52:59,929 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321
(Press CTRL+C to quit)
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
previously the nvidia provider would throw an exception if a hosted
instance was being used but no API key was set
modify this behavior to instead log an error informing users that a key
is needed to use a hosted NIM but still allow the server to boot
Closes#4295
## Test Plan
Run `uv run llama stack run --providers inference=remote::nvidia` with
`NVIDIA_API_KEY` unset - server now boots with logged error, where it
previously crashed
```
INFO 2025-12-04 14:16:26,156 llama_stack.providers.remote.inference.nvidia.nvidia:47 inference::nvidia: Initializing
NVIDIAInferenceAdapter(https://integrate.api.nvidia.com/v1)...
ERROR 2025-12-04 14:16:26,157 llama_stack.providers.remote.inference.nvidia.nvidia:51 inference::nvidia: API key is
required for hosted NVIDIA NIM. Either provide an API key or use a self-hosted NIM.
INFO 2025-12-04 14:16:26,239 uvicorn.error:84 uncategorized: Started server process [251651]
INFO 2025-12-04 14:16:26,240 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-12-04 14:16:26,244 llama_stack.core.server.server:168 core::server: Starting up Llama Stack server
(version: 0.4.0.dev0)
INFO 2025-12-04 14:16:26,245 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-12-04 14:16:26,246 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-12-04 14:16:26,246 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321
(Press CTRL+C to quit)
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
when publishing llama_stack_api, `inspect.py` causes issues and gets
confused to be the builtin stdlib inspect module.
This is due to the top level __init__.py we have. We need to rename
inspect.py to inspect_api.py to avoid this conflict.
Also, uv sync
1993161624
for reference .
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Bumps [next](https://github.com/vercel/next.js) from 15.5.4 to 15.5.7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v15.5.7</h2>
<p>Please see <a
href="https://nextjs.org/blog/CVE-2025-66478">CVE-2025-66478</a> for
additional details about this release.</p>
<h2>v15.5.6</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Turbopack: don't define process.cwd() in node_modules <a
href="https://redirect.github.com/vercel/next.js/issues/83452">#83452</a></li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/mischnic"><code>@mischnic</code></a> for
helping!</p>
<h2>v15.5.5</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Split code-frame into separate compiled package (<a
href="https://redirect.github.com/vercel/next.js/issues/84238">#84238</a>)</li>
<li>Add deprecation warning to Runtime config (<a
href="https://redirect.github.com/vercel/next.js/issues/84650">#84650</a>)</li>
<li>fix: unstable_cache should perform blocking revalidation during ISR
revalidation (<a
href="https://redirect.github.com/vercel/next.js/issues/84716">#84716</a>)</li>
<li>feat: <code>experimental.middlewareClientMaxBodySize</code> body
cloning limit (<a
href="https://redirect.github.com/vercel/next.js/issues/84722">#84722</a>)</li>
<li>fix: missing next/link types with typedRoutes (<a
href="https://redirect.github.com/vercel/next.js/issues/84779">#84779</a>)</li>
</ul>
<h3>Misc Changes</h3>
<ul>
<li>docs: early October improvements and fixes (<a
href="https://redirect.github.com/vercel/next.js/issues/84334">#84334</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/devjiwonchoi"><code>@devjiwonchoi</code></a>,
<a href="https://github.com/ztanner"><code>@ztanner</code></a>, and <a
href="https://github.com/icyJoseph"><code>@icyJoseph</code></a> for
helping!</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3eaf68b09b"><code>3eaf68b</code></a>
v15.5.7</li>
<li><a
href="8367ce592a"><code>8367ce5</code></a>
update version script</li>
<li><a
href="9115040008"><code>9115040</code></a>
Update React Version for Next.js 15.5.7 (<a
href="https://redirect.github.com/vercel/next.js/issues/10">#10</a>)</li>
<li><a
href="96f699902a"><code>96f6999</code></a>
update tag</li>
<li><a
href="55ef0e3ebc"><code>55ef0e3</code></a>
v15.5.6</li>
<li><a
href="92bbbb1bec"><code>92bbbb1</code></a>
Backport: don't define <code>process.cwd()</code> in node_modules (<a
href="https://redirect.github.com/vercel/next.js/issues/84957">#84957</a>)</li>
<li><a
href="f895b72762"><code>f895b72</code></a>
Fix url-imports test on 15-5 (<a
href="https://redirect.github.com/vercel/next.js/issues/84966">#84966</a>)</li>
<li><a
href="81f530db26"><code>81f530d</code></a>
v15.5.5</li>
<li><a
href="9abbc0e9eb"><code>9abbc0e</code></a>
[backport] fix: missing <code>next/link</code> types with
<code>typedRoutes</code> (<a
href="https://redirect.github.com/vercel/next.js/issues/82814">#82814</a>)
(<a
href="https://redirect.github.com/vercel/next.js/issues/84779">#84779</a>)</li>
<li><a
href="121e1b566f"><code>121e1b5</code></a>
[backport] docs: early October improvements and fixes (<a
href="https://redirect.github.com/vercel/next.js/issues/84334">#84334</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/vercel/next.js/compare/v15.5.4...v15.5.7">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/llamastack/llama-stack/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
CI was previously using both node 20 and 22
standardize on node 22
Closes#4294
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
this commit puts aligns all 'setup-uv' instances to the latest version
and removes the pin keeping several actions on a very old version
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
we had three different versions of uv being used
in pre-commit. bump all to the latest version.
we should probably try and find some way to automate this.
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
DISTRO_DIR and DISTRIBS_BASE_DIR need to exist for them to be iterated.
our current logic allows us to iterdir without checking if they exist
## Test Plan
rm ~/.llama/distributions
```
llama stack list-deps starter --format uv | sh
Using Python 3.12.11 environment at: venv
Audited 51 packages in 12ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 2ms
Using Python 3.12.11 environment at: venv
Audited 1 package in 3ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 5ms
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- Part of #3009
- Implement hybrid search using Qdrant's native query filtering
- Add keyword search support
- Update test suites to include qdrant for keyword and hybrid modes
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
pytest -sv tests/unit/providers/vector_io/
.......
============================================================================================== slowest 10 durations ===============================================================================================
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[qdrant]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[pgvector]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[sqlite_vec]
0.20s call tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_max_concurrent_files_per_batch[faiss]
0.06s setup tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py::test_insert_chunks_with_missing_document_id[pgvector]
0.04s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_tie_breaking
0.04s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_weighted_reranker_parametrization
0.03s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_score_selection
0.03s call tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_edge_cases
0.03s setup tests/unit/providers/vector_io/test_faiss.py::test_faiss_query_vector_returns_infinity_when_query_and_embedding_are_identical
======================================================================================== 180 passed, 47 warnings in 2.78s =========================================================================================
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
Changes SqlRecord creation in AuthorizedSqlStore.fetch_all to use
owner=None when owner_principal is empty/missing, matching the
ResourceWithOwner pattern used in routing tables. This fixes an
inconsistency where SQL store was creating User(principal="") while
routing tables use owner=None for public resources.
Changes:
o Update ProtectedResource Protocol to allow owner: User | None
o Update SqlRecord.__init__ to accept owner: User | None
o Update fetch_all to create owner=None for records without
owner_principal
Signed-off-by: Derek Higgins <derekh@redhat.com>
https://github.com/digitalbazaar/forge/security/advisories/GHSA-554w-wpv2-vw27
Taking on a direct dependency is not great
1. We don't actually use node-forge - it's only needed by
webpack-dev-server's dependency (selfsigned) for generating self-signed
certificates during development
2. Adding a direct dependency would be misleading - it suggests our code
uses node-forge when it doesn't
In the dependency chain:
```
@docusaurus/core@3.8.1
└─ webpack-dev-server@4.15.2
└─ selfsigned@2.4.1
└─ node-forge@1.3.1
```
Latest Docusaurus (3.9.2) uses webpack-dev-server 5.2.2, which still
uses selfsigned 2.4.1
So, overriding dependency on node-forge is the only option
Closes security gaps where RBAC checks could be bypassed:
o Inference router: Added RBAC enforcement in the fallback
path to ensure access control is applied consistently.
o Model listing: Dynamic models fetched via provider_data were returned
without RBAC checks. Added filtering to ensure users only see models
they have permission to access.
Both fixes create temporary ModelWithOwner objects for RBAC validation,
maintaining security through consistent access control enforcement.
Closes: #4269
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This commit introduces a new FastAPI router-based system for defining
API endpoints, enabling a migration path away from the legacy @webmethod
decorator system. The implementation includes router infrastructure,
migration of the Batches API as the first example, and updates to
server, OpenAPI generation, and inspection systems to support both
routing approaches.
The router infrastructure consists of a router registry system that
allows APIs to register FastAPI router factories, which are then
automatically discovered and included in the server application.
Standard error responses are centralized in router_utils to ensure
consistent OpenAPI specification generation with proper $ref references
to component responses.
The Batches API has been migrated to demonstrate the new pattern. The
protocol definition and models remain in llama_stack_api/batches,
maintaining clear separation between API contracts and server
implementation. The FastAPI router implementation lives in
llama_stack/core/server/routers/batches, following the established
pattern where API contracts are defined in llama_stack_api and server
routing logic lives in
llama_stack/core/server.
The server now checks for registered routers before falling back to the
legacy webmethod-based route discovery, ensuring backward compatibility
during the migration period. The OpenAPI generator has been updated to
handle both router-based and webmethod-based routes, correctly
extracting metadata from FastAPI route decorators and Pydantic Field
descriptions. The inspect endpoint now includes routes from both
systems, with proper filtering for deprecated routes and API levels.
Response descriptions are now explicitly defined in router decorators,
ensuring the generated OpenAPI specification matches the previous
format. Error responses use $ref references to component responses
(BadRequest400, TooManyRequests429, etc.) as required by the
specification. This is neat and will allow us to remove a lot of boiler
plate code from our generator once the migration is done.
This implementation provides a foundation for incrementally migrating
other APIs to the router system while maintaining full backward
compatibility with existing webmethod-based APIs.
Closes: https://github.com/llamastack/llama-stack/issues/4188
## Test Plan
CI, the server should start, same routes should be visible.
```
curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("batches"))'
```
Also:
```
uv run pytest tests/integration/batches/ -vv --stack-config=http://localhost:8321
================================================== test session starts ==================================================
platform darwin -- Python 3.12.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'macOS-26.0.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 24 items
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None] SKIPPED [ 4%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_listing[None] SKIPPED [ 8%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_immediate_cancellation[None] SKIPPED [ 12%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_chat_completions[None] SKIPPED [ 16%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_completions[None] SKIPPED [ 20%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_endpoint[None] SKIPPED [ 25%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_completed[None] SKIPPED [ 29%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_fields[None] SKIPPED [ 33%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_completion_window[None] SKIPPED [ 37%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_streaming_not_supported[None] SKIPPED [ 41%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_mixed_streaming_requests[None] SKIPPED [ 45%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_endpoint_mismatch[None] SKIPPED [ 50%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_body_fields[None] SKIPPED [ 54%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_metadata_types[None] SKIPPED [ 58%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_embeddings[None] SKIPPED [ 62%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id PASSED [ 66%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl PASSED [ 70%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty] XFAIL [ 75%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed] XFAIL [ 79%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent PASSED [ 83%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent PASSED [ 87%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model PASSED [ 91%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful PASSED [ 95%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params PASSED [100%]
================================================= slowest 10 durations ==================================================
1.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful
0.21s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id
0.17s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl
0.12s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model
0.05s setup tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None]
0.02s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty]
0.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed]
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent
0.00s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent
======================================= 7 passed, 15 skipped, 2 xfailed in 1.78s ========================================
```
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This allows llama-stack users of the Docker image to use OpenTelemetry
like previous versions.
#4127 migrated to automatic instrumentation, but unless we add those
libraries to the image, everyone needs to build a custom image to enable
otel. Also, unless we establish a convention for enabling it, users who
formerly just set config now need to override the entrypoint.
This PR bootstraps OTEL packages, so they are available (only +10MB). It
also prefixes `llama stack run` with `opentelemetry-instrument` when any
`OTEL_*` environment variable is set.
The result is implicit tracing like before, where you don't need a
custom image to use traces or metrics.
## Test Plan
```bash
# Build image
docker build -f containers/Containerfile \
--build-arg DISTRO_NAME=starter \
--build-arg INSTALL_MODE=editable \
--tag llamastack/distribution-starter:otel-test .
# Run with OTEL env to implicitly use `opentelemetry-instrument`. The
# Settings below ensure inbound traces are honored, but no
# "junk traces" like SQL connects are created.
docker run -p 8321:8321 \
-e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 \
-e OTEL_SERVICE_NAME=llama-stack \
-e OTEL_TRACES_SAMPLER=parentbased_traceidratio \
-e OTEL_TRACES_SAMPLER_ARG=0.0 \
llamastack/distribution-starter:otel-test
```
Ran a sample flight search agent which is instrumented on the client
side. This and llama-stack target
[otel-tui](https://github.com/ymtdzzz/otel-tui) I verified no root
database spans, yet database spans are attached to incoming traces.
<img width="1608" height="742" alt="screenshot"
src="https://github.com/user-attachments/assets/69f59b74-3054-42cd-947d-a6c0d9472a7c"
/>
Signed-off-by: Adrian Cole <adrian@tetrate.io>
Error messages were using --test-setup, --test-subdirs, and --test-suite
instead of the actual parameter names: --setup, --subdirs, and --suite
Signed-off-by: Derek Higgins <derekh@redhat.com>
Category-specific log levels from LLAMA_STACK_LOGGING were not applied
to
loggers created before setup_logging() was called. This fix moves the
setup_logging() call earlier in the initialization sequence to ensure
all
loggers respect their configured levels regardless of initialization
timing.
Closes: #4252
Signed-off-by: Derek Higgins <derekh@redhat.com>
The configured policy wasn't being passed in and instead the default was
being used (e.g. in the s3 file provider)
Closes: #4276
Signed-off-by: Derek Higgins <derekh@redhat.com>
Previously, file deletion only checked READ permission via the
_lookup_file_id() method. This meant any user with READ access to a file
could also delete it, making it impossible to configure read-only file
access.
This change adds an 'action' parameter to fetch_all() and fetch_one() in
AuthorizedSqlStore, defaulting to Action.READ for backward
compatibility. The openai_delete_file() method now passes Action.DELETE,
ensuring proper RBAC enforcement.
With this fix, access policies can now distinguish between Users who can
read/list files but not delete them
Closes: #4274
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Fixed the docker container name in the documentation by changing
`docker pull llama-stack/distribution-starter`
`docker pull llama-stack/distribution-meta-reference-gpu`
to
`docker pull llamastack/distribution-starter`
`docker pull llamastack/distribution-meta-reference-gpu`
Closes this
[issue](https://github.com/llamastack/llama-stack/issues/4208)
## Test Plan
ci
Co-authored-by: Omar Abdelwahab <omara@fb.com>
fix: use string annotations for S3Client type hints
Remove future annotations import and use quoted string annotations for
S3Client to avoid import issues.
Changes:
o Remove __future__ annotations import
o Use "S3Client" string annotations in type hints
closes: #4241
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
diff the `/v1/` routes that are OpenAI compatible against the OpenAI
openAPI spec. This will of course only trigger on PRs where the spec is
changed.
This will catch errors with new handwritten additions to our openAI
compat routes.
Instead of fetching the OpenAPI spec from a dynamic URL, which could
cause non-deterministic build failures,
this change uses a local copy stored at `docs/static/openai-spec.yml`.
This makes the conformance check fully reproducible and prevents CI
failures caused by uncontrolled upstream changes.
I am marking this test with `continue-on-error: true`, until we get rid
of all of the errors. Nevertheless, this is a nice utility to have so
folks know if their spec changes introduce more breaking changes or fix
breakages when comparing to the OpenAI openapi spec.
## Test Plan
test should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
marks `toolgroup` and `tool_runtime` APIs for deprecation
<!-- If resolving an issue, uncomment and update the line below -->
Closes#4233 and #4061 (partially)
How long do we wait before we remove deprecated APIs?
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Jaideep Rao <jrao@redhat.com>
# What does this PR do?
Removes stale data from llama stack about old telemetry system
**Depends on** https://github.com/llamastack/llama-stack/pull/4127
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Fixes: https://github.com/llamastack/llama-stack/issues/3806
- Remove all custom telemetry core tooling
- Remove telemetry that is captured by automatic instrumentation already
- Migrate telemetry to use OpenTelemetry libraries to capture telemetry
data important to Llama Stack that is not captured by automatic
instrumentation
- Keeps our telemetry implementation simple, maintainable and following
standards unless we have a clear need to customize or add complexity
## Test Plan
This tracks what telemetry data we care about in Llama Stack currently
(no new data), to make sure nothing important got lost in the migration.
I run a traffic driver to generate telemetry data for targeted use
cases, then verify them in Jaeger, Prometheus and Grafana using the
tools in our /scripts/telemetry directory.
### Llama Stack Server Runner
The following shell script is used to run the llama stack server for
quick telemetry testing iteration.
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter
```
### Test Traffic Driver
This python script drives traffic to the llama stack server, which sends
telemetry to a locally hosted instance of the OTLP collector, Grafana,
Prometheus, and Jaeger.
```sh
export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"
export GITHUB_TOKEN="REDACTED"
export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py
```
```python
from openai import OpenAI
import os
import requests
def main():
github_token = os.getenv("GITHUB_TOKEN")
if github_token is None:
raise ValueError("GITHUB_TOKEN is not set")
client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print("Sync response: ", response.choices[0].message.content)
streaming_response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
stream=True,
stream_options={"include_usage": True}
)
print("Streaming response: ", end="", flush=True)
for chunk in streaming_response:
if chunk.usage is not None:
print("Usage: ", chunk.usage)
if chunk.choices and chunk.choices[0].delta is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
ollama_response = client.chat.completions.create(
model="ollama/llama3.2:3b-instruct-fp16",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("Ollama response: ", ollama_response.choices[0].message.content)
vllm_response = client.chat.completions.create(
model="vllm/Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("VLLM response: ", vllm_response.choices[0].message.content)
responses_list_tools_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "What tools are available?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses list tools response: ", responses_list_tools_response.output_text)
responses_tool_call_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses tool call response: ", responses_tool_call_response.output_text)
# make shield call using http request until the client version error is resolved
llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
base_url = "http://localhost:8321/v1/"
shield_id = "llama-guard-ollama"
shields_url = f"{base_url}safety/run-shield"
headers = {
"Authorization": f"Bearer {llama_stack_api_key}",
"Content-Type": "application/json"
}
payload = {
"shield_id": shield_id,
"messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
"params": {}
}
shields_response = requests.post(shields_url, json=payload, headers=headers)
shields_response.raise_for_status()
print("risk assessment response: ", shields_response.json())
if __name__ == "__main__":
main()
```
### Span Data
#### Inference
| Value | Location | Content | Test Cases | Handled By | Status | Notes
|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working | None |
| Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | working | None |
| Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt | Client | string | Any Inference Provider, responses | Auto
Instrument | Working, no responses | None |
#### Safety
| Value | Location | Content | Testing | Handled By | Status | Notes |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [Shield
ID](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Metadata](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Messages](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Response](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Status](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
#### Remote Tool Listing & Execution
| Value | Location | Content | Testing | Handled By | Status | Notes |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: |
| Tool name | server | string | Tool call occurs | Custom Code | working
| [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server URL | server | string | List tools or execute tool call |
Custom Code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server Label | server | string | List tools or execute tool call |
Custom code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| mcp\_list\_tools\_id | server | string | List tools | Custom code |
working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
### Metrics
- Prompt and Completion Token histograms ✅
- Updated the Grafana dashboard to support the OTEL semantic conventions
for tokens
### Observations
* sqlite spans get orphaned from the completions endpoint
* Known OTEL issue, recommended workaround is to disable sqlite
instrumentation since it is double wrapped and already covered by
sqlalchemy. This is covered in documentation.
```shell
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
```
* Responses API instrumentation is
[missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436)
in open telemetry for OpenAI clients, even with traceloop or openllmetry
* Upstream issues in opentelemetry-pyton-contrib
* Span created for each streaming response, so each chunk → very large
spans get created, which is not ideal, but it’s the intended behavior
* MCP telemetry needs to be updated to follow semantic conventions. We
can probably use a library for this and handle it in a separate issue.
### Updated Grafana Dashboard
<img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53
52 PM"
src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a"
/>
## Status
✅ Everything appears to be working and the data we expect is getting
captured in the format we expect it.
## Follow Ups
1. Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
2. Make shield spans follow semconv
3. Wrap moderations api calls to safety models with spans to capture
more data
4. Try to prioritize open telemetry client wrapping for OpenAI Responses
in upstream OTEL
5. This would break the telemetry tests, and they are currently
disabled. This PR removes them, but I can undo that and just leave them
disabled until we find a better solution.
6. Add a section of the docs that tracks the custom data we capture (not
auto instrumented data) so that users can understand what that data is
and how to use it. Commit those changes to the OTEL-gen_ai SIG if
possible as well. Here is an
[example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/)
of how bedrock handles it.
# What does this PR do?
We don't do a good job at maintaining this file, also the GH action does
not seem to be running.
Let's stick with GH release notes instead.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
we used to have ` host = config.server.host or ["::", "0.0.0.0"]` but
now only bind to ` host = config.server.host or "0.0.0.0"`
revert back to the old logic, this allows us to curl
http://localhost:8321/v1/models on fedora, which defaults to using IPv6.
resolves#4210
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When we send the model names to Google's openai API, we must use the
"google" name prefix. Google does not recognize the "vertexai" model
names.
Closes#4211
## Test Plan
```bash
uv venv --python python312
. .venv/bin/activate
llama stack list-deps starter | xargs -L1 uv pip install
llama stack run starter
```
Test that this shows the gemini models with their correct names:
```bash
curl http://127.0.0.1:8321/v1/models | jq '.data | map(select(.custom_metadata.provider_id == "vertexai"))'
```
Test that this chat completion works:
```bash
curl -X POST -H "Content-Type: application/json" "http://127.0.0.1:8321/v1/chat/completions" -d '{
"model": "vertexai/google/gemini-2.5-flash",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! Can you tell me a joke?"
}
],
"temperature": 1.0,
"max_tokens": 256
}'
```
We went through the nomination process for CODEOWNERS in the codeowners
discord channel.
Welcome to the code owners group @cdoern! Thanks for your contributions
and we look forward to working with you!
Rename `AWS_BEDROCK_API_KEY` to `AWS_BEARER_TOKEN_BEDROCK` to align with
the naming convention used in AWS Bedrock documentation and the AWS web
console UI. This reduces confusion when developers compare LLS docs with
AWS docs.
Closes#4147
This adds a `--typescript-only` flag to `scripts/integration-tests.sh`
that skips pytest execution entirely while still starting the Llama
Stack server (required for TS client tests). The TypeScript client can
now be tested independently without Python test dependencies.
The `allowed_models` configuration was only being applied when listing
models via the `/v1/models` endpoint, but the actual inference requests
weren't checking this restriction. This meant users could directly
request any model the provider supports by specifying it in their
inference call, completely bypassing the intended cost controls.
The fix adds validation to all three inference methods (chat
completions, completions, and embeddings) that checks the requested
model against the allowed_models list before making the provider API
call.
### Test plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Addresses feedback from
https://github.com/llamastack/llama-stack/pull/4187#discussion_r2542797437
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for providing actual implementation of OpenAI
compatible prompts in Responses API. This is the follow up PR with
actual implementation after introducing #3942
The need of this functionality was initiated in #3514.
> Note, https://github.com/llamastack/llama-stack/pull/3514 is divided
on three separate PRs. Current PR is the third of three.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3321
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Manual testing, CI workflow with added unit tests
Comprehensive manual testing with new implementation:
**Test Prompts with Images with text on them in Responses API:**
I used this image for testing purposes: [iphone 17
image](https://github.com/user-attachments/assets/9e2ee821-e394-4bbd-b1c8-d48a3fa315de)
1. Upload an image:
```
curl -X POST http://localhost:8321/v1/files \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/ianmiller/iphone.jpeg" \
-F "purpose=assistants"
```
`{"object":"file","id":"file-d6d375f238e14f21952cc40246bc8504","bytes":556241,"created_at":1761750049,"expires_at":1793286049,"filename":"iphone.jpeg","purpose":"assistants"}%`
2. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "You are a product analysis expert. Analyze the following product:\n\nProduct Name: {{product_name}}\nDescription: {{description}}\n\nImage: {{product_photo}}\n\nProvide a detailed analysis including quality assessment, target audience, and pricing recommendations.",
"variables": ["product_name", "description", "product_photo"]
}'
```
`{"prompt":"You are a product analysis expert. Analyze the following
product:\n\nProduct Name: {{product_name}}\nDescription:
{{description}}\n\nImage: {{product_photo}}\n\nProvide a detailed
analysis including quality assessment, target audience, and pricing
recommendations.","version":1,"prompt_id":"pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62","variables":["product_name","description","product_photo"],"is_default":false}%`
3. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{
"input": "Please analyze this product",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62",
"version": "1",
"variables": {
"product_name": {
"type": "input_text",
"text": "iPhone 17 Pro Max"
},
"product_photo": {
"type": "input_image",
"file_id": "file-d6d375f238e14f21952cc40246bc8504",
"detail": "high"
}
}
}
}'
```
`{"created_at":1761750427,"error":null,"id":"resp_f897f914-e3b8-4783-8223-3ed0d32fcbc6","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"###
Product Analysis: iPhone 17 Pro Max\n\n**Quality Assessment:**\n\n-
**Display & Design:**\n - The 6.9-inch display is large, ideal for
streaming and productivity.\n - Anti-reflective technology and 120Hz
refresh rate enhance viewing experience, providing smoother visuals and
reducing glare.\n - Titanium frame suggests a premium build, offering
durability and a sleek appearance.\n\n- **Performance:**\n - The Apple
A19 Pro chip promises significant performance improvements, likely
leading to faster processing and efficient multitasking.\n - 12GB RAM is
substantial for a smartphone, ensuring smooth operation for demanding
apps and games.\n\n- **Camera System:**\n - The triple 48MP camera setup
(wide, ultra-wide, telephoto) is designed for versatile photography
needs, capturing high-resolution photos and videos.\n - The 24MP front
camera will appeal to selfie enthusiasts and content creators needing
quality front-facing shots.\n\n- **Connectivity:**\n - Wi-Fi 7 support
indicates future-proof wireless capabilities, providing faster and more
reliable internet connectivity.\n\n**Target Audience:**\n\n- **Tech
Enthusiasts:** Individuals interested in cutting-edge technology and
performance.\n- **Content Creators:** Users who need a robust camera
system for photo and video production.\n- **Luxury Consumers:** Those
who prefer premium materials and top-of-the-line specs.\n-
**Professionals:** Users who require efficient multitasking and
productivity features.\n\n**Pricing Recommendations:**\n\n- Given the
premium specifications, a higher price point is expected. Consider
pricing competitively within the high-end smartphone market while
justifying cost through unique features like the titanium frame and
advanced connectivity options.\n- Positioning around the $1,200 to
$1,500 range would align with expectations for top-tier devices,
catering to its target audience while ensuring
profitability.\n\nOverall, the iPhone 17 Pro Max showcases a blend of
innovative features and premium design, aimed at users seeking high
performance and superior
aesthetics.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_66f4d844-4d9e-4102-80fc-eb75b34b6dbd","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_7be2208cb82cdbc35356354dae1f335d1e9b7baeca21ea62","variables":{"product_name":{"text":"iPhone
17 Pro
Max","type":"input_text"},"product_photo":{"detail":"high","type":"input_image","file_id":"file-d6d375f238e14f21952cc40246bc8504","image_url":null}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":830,"output_tokens":394,"total_tokens":1224,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
**Test Prompts with PDF files in Responses API:**
I used this PDF file for testing purposes:
[invoicesample.pdf](https://github.com/user-attachments/files/22958943/invoicesample.pdf)
1. Upload PDF:
```
curl -X POST http://localhost:8321/v1/files \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/ianmiller/invoicesample.pdf" \
-F "purpose=assistants"
```
`{"object":"file","id":"file-7fbb1043a4bb468cab60ffe4b8631d8e","bytes":149568,"created_at":1761750730,"expires_at":1793286730,"filename":"invoicesample.pdf","purpose":"assistants"}%`
2. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "You are an accounting and financial analysis expert. Analyze the following invoice document:\n\nInvoice Document: {{invoice_doc}}\n\nProvide a comprehensive analysis",
"variables": ["invoice_doc"]
}'
```
`{"prompt":"You are an accounting and financial analysis expert. Analyze
the following invoice document:\n\nInvoice Document:
{{invoice_doc}}\n\nProvide a comprehensive
analysis","version":1,"prompt_id":"pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc","variables":["invoice_doc"],"is_default":false}%`
3. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Content-Type: application/json" \
-d '{
"input": "Please provide a detailed analysis of this invoice",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc",
"version": "1",
"variables": {
"invoice_doc": {
"type": "input_file",
"file_id": "file-7fbb1043a4bb468cab60ffe4b8631d8e",
"filename": "invoicesample.pdf"
}
}
}
}'
```
`{"created_at":1761750881,"error":null,"id":"resp_da866913-db06-4702-8000-174daed9dbbb","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"Here's
a detailed analysis of the invoice provided:\n\n### Seller
Information\n- **Business Name:** The invoice features a logo with
\"Sunny Farm\" indicating the business identity.\n- **Address:** 123
Somewhere St, Melbourne VIC 3000\n- **Contact Information:** Phone
number (03) 1234 5678\n\n### Buyer Information\n- **Name:** Denny
Gunawan\n- **Address:** 221 Queen St, Melbourne VIC 3000\n\n###
Transaction Details\n- **Invoice Number:** #20130304\n- **Date of
Transaction:** Not explicitly mentioned, likely inferred from the
invoice number or needs clarification.\n\n### Items Purchased\n1.
**Apple**\n - Price: $5.00/kg\n - Quantity: 1 kg\n - Subtotal:
$5.00\n\n2. **Orange**\n - Price: $1.99/kg\n - Quantity: 2 kg\n -
Subtotal: $3.98\n\n3. **Watermelon**\n - Price: $1.69/kg\n - Quantity: 3
kg\n - Subtotal: $5.07\n\n4. **Mango**\n - Price: $9.56/kg\n - Quantity:
2 kg\n - Subtotal: $19.12\n\n5. **Peach**\n - Price: $2.99/kg\n -
Quantity: 1 kg\n - Subtotal: $2.99\n\n### Financial Summary\n-
**Subtotal for Items:** $36.00\n- **GST (Goods and Services Tax):** 10%
of $36.00, which amounts to $3.60\n- **Total Amount Due:** $39.60\n\n###
Notes\n- The invoice includes a placeholder text: \"Lorem ipsum dolor
sit amet...\" which is typically used as filler text. This might
indicate a section intended for terms, conditions, or additional notes
that haven’t been completed.\n\n### Visual and Design Elements\n- The
invoice uses a simple and clear layout, featuring the business logo
prominently and stating essential information such as contact and
transaction details in a structured manner.\n- There is a \"Thank You\"
note at the bottom, which adds a professional and courteous
touch.\n\n### Considerations\n- Ensure the date of the transaction is
clear if there are any future references needed.\n- Replace filler text
with relevant terms and conditions or any special instructions
pertaining to the transaction.\n\nThis invoice appears standard,
representing a small business transaction with clearly itemized products
and applicable
taxes.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_39f3b39e-4684-4444-8e4d-e7395f88c9dc","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_72e2a184a86f32a568b6afb5455dca5c16bf3cc3f80092dc","variables":{"invoice_doc":{"type":"input_file","file_data":null,"file_id":"file-7fbb1043a4bb468cab60ffe4b8631d8e","file_url":null,"filename":"invoicesample.pdf"}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":529,"output_tokens":513,"total_tokens":1042,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
**Test simple text Prompt in Responses API:**
1. Create prompt:
```
curl -X POST http://localhost:8321/v1/prompts \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello {{name}}! You are working at {{company}}. Your role is {{role}} at {{company}}. Remember, {{name}}, to be {{tone}}.",
"variables": ["name", "company", "role", "tone"]
}'
```
`{"prompt":"Hello {{name}}! You are working at {{company}}. Your role is
{{role}} at {{company}}. Remember, {{name}}, to be
{{tone}}.","version":1,"prompt_id":"pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef","variables":["name","company","role","tone"],"is_default":false}%`
2. Create response:
```
curl -X POST http://localhost:8321/v1/responses \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{
"input": "What is the capital of Ireland?",
"model": "openai/gpt-4o",
"store": true,
"prompt": {
"id": "pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef",
"version": "1",
"variables": {
"name": {
"type": "input_text",
"text": "Alice"
},
"company": {
"type": "input_text",
"text": "Dummy Company"
},
"role": {
"type": "input_text",
"text": "Geography expert"
},
"tone": {
"type": "input_text",
"text": "professional and helpful"
}
}
}
}'
```
`{"created_at":1761751097,"error":null,"id":"resp_1b037b95-d9ae-4ad0-8e76-d953897ecaef","model":"openai/gpt-4o","object":"response","output":[{"content":[{"text":"The
capital of Ireland is
Dublin.","type":"output_text","annotations":[]}],"role":"assistant","type":"message","id":"msg_8e7c72b6-2aa2-4da6-8e57-da4e12fa3ce2","status":"completed"}],"parallel_tool_calls":false,"previous_response_id":null,"prompt":{"id":"pmpt_f340a3164a4f65d975c774ffe38ea42d15e7ce4a835919ef","variables":{"name":{"text":"Alice","type":"input_text"},"company":{"text":"Dummy
Company","type":"input_text"},"role":{"text":"Geography
expert","type":"input_text"},"tone":{"text":"professional and
helpful","type":"input_text"}},"version":"1"},"status":"completed","temperature":null,"text":{"format":{"type":"text"}},"top_p":null,"tools":[],"truncation":null,"usage":{"input_tokens":47,"output_tokens":7,"total_tokens":54,"input_tokens_details":{"cached_tokens":0},"output_tokens_details":{"reasoning_tokens":0}},"instructions":null}%`
# Problem
OpenAI gpt-4 returned an error when built-in and mcp calls were skipped
due to max_tool_calls parameter. Following is from the server log:
```
RuntimeError: OpenAI response failed: Error code: 400 - {'error': {'message': "An assistant message with
'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids
did not have response messages: call_Yi9V1QNpN73dJCAgP2Arcjej", 'type': 'invalid_request_error', 'param':
'messages', 'code': None}}
```
# What does this PR do?
- Fixes error returned by openai/gpt when calls were skipped due to
max_tool_calls. We now return a tool message that explicitly mentions
that the call is skipped.
- Adds integration tests as a follow-up to
PR#[4062](https://github.com/llamastack/llama-stack/pull/4062)
<!-- If resolving an issue, uncomment and update the line below -->
Part 2 for issue
#[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- Added integration tests
- Added new recordings
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# Fix for Issue #3797
## Problem
Vector store search failed with Pydantic ValidationError when chunk
metadata contained list-type values.
**Error:**
```
ValidationError: 3 validation errors for VectorStoreSearchResponse
attributes.tags.str: Input should be a valid string
attributes.tags.float: Input should be a valid number
attributes.tags.bool: Input should be a valid boolean
```
**Root Cause:**
- `Chunk.metadata` accepts `dict[str, Any]` (any type allowed)
- `VectorStoreSearchResponse.attributes` requires `dict[str, str | float
| bool]` (primitives only)
- Direct assignment at line 641 caused validation failure for
non-primitive types
## Solution
Added utility function to filter metadata to primitive types before
creating search response.
## Impact
**Fixed:**
- Vector search works with list metadata (e.g., `tags: ["transformers",
"gpu"]`)
- Lists become searchable as comma-separated strings
- No ValidationError on search responses
**Preserved:**
- Full metadata still available in `VectorStoreContent.metadata`
- No API schema changes
- Backward compatible with existing primitive metadata
**Affected:**
All vector store providers using `OpenAIVectorStoreMixin`: FAISS,
Chroma, Qdrant, Milvus, Weaviate, PGVector, SQLite-vec
## Testing
tests/unit/providers/vector_io/test_vector_utils.py::test_sanitize_metadata_for_attributes
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I believe that should avoid CI issues seen in
https://github.com/llamastack/llama-stack/pull/4173.
Error we see in Stainless logs:
```
(cannot lock ref 'refs/heads/preview/base/fix/issue-3797-metadata-validation': 'refs/heads/preview/base/fix' exists; cannot create 'refs/heads/preview/base/fix/issue-3797-metadata-validation')
```
The issue is that if a branch `fix` exists, `fix/<whatever>` cannot be
created (that's how git refs work unfortunately...). The fix in this PR
is to ensure PRs from forks are using the author as a prefix.
In addition we will do changes to the Stainless API to return better
error messages here, it should have been a 4xx with a meaningful error,
not a 500.
And we will likely need to delete the `fix` branch.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Integration tests can now validate the TypeScript SDK alongside Python
tests when running against server-mode stacks. Currently, this only adds
a _small_ number of tests. We should extend only if truly needed -- this
smoke check may be sufficient.
When `RUN_CLIENT_TS_TESTS=1` is set, the test script runs TypeScript
tests after Python tests pass. Tests are mapped via
`tests/integration/client-typescript/suites.json` which defines which
TypeScript test files correspond to each Python suite/setup combination.
The fact that we need exact "test_id"s (which are actually generated by
pytest) to be hardcoded inside the Typescript tests (so we hit the
recorded paths) is a big smell and it might become grating, but maybe
the benefit is worth it if we keep this test suite _small_ and targeted.
## Test Plan
Run with TypeScript tests enabled:
```bash
OPENAI_API_KEY=dummy RUN_CLIENT_TS_TESTS=1 \
scripts/integration-tests.sh --stack-config server:ci-tests --suite responses --setup gpt
```
# What does this PR do?
Change Safety API from required to optional dependency, following the
established pattern used for other optional dependencies in Llama Stack.
The provider now starts successfully without Safety API configured.
Requests that explicitly include guardrails will receive a clear error
message when Safety API is unavailable.
This enables local development and testing without Safety API while
maintaining clear error messages when guardrail features are requested.
Closes#4165
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. New unit tests added in
`tests/unit/providers/agents/meta_reference/test_safety_optional.py`
2. Integration tests performed with the files in
https://gist.github.com/anik120/c33cef497ec7085e1fe2164e0705b8d6
(i) test with `test_integration_no_safety_fail.yaml`:
Config WITHOUT Safety API, should fail with helpful error since
`required_safety_api` is `true` by default
```
$ uv run llama stack run test_integration_no_safety_fail.yaml 2>&1 | grep -B 5 -A 15 "ValueError.*Safety\|Safety API is
required"
File "/Users/anbhatta/go/src/github.com/llamastack/llama-stack/src/llama_stack/providers/inline/agents/meta_reference
/__init__.py", line 27, in get_provider_impl
raise ValueError(
...<9 lines>...
)
ValueError: Safety API is required but not configured.
To run without safety checks, explicitly set in your configuration:
providers:
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
require_safety_api: false
Warning: This disables all safety guardrails for this agents provider.
```
(ii) test with `test_integration_no_safety_works.yaml`
Config WITHOUT Safety API, **but** `require_safety_api=false` is
explicitly set, should succeed
```
$ uv run llama stack run test_integration_no_safety_works.yaml
INFO 2025-11-16 09:49:10,044 llama_stack.cli.stack.run:169 cli: Using run configuration:
/Users/anbhatta/go/src/github.com/llamastack/llama-stack/test_integration_no_safety_works.yaml
INFO 2025-11-16 09:49:10,052 llama_stack.cli.stack.run:228 cli: HTTPS enabled with certificates:
Key: None
Cert: None
.
.
.
INFO 2025-11-16 09:49:38,528 llama_stack.core.stack:495 core: starting registry refresh task
INFO 2025-11-16 09:49:38,534 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-11-16 09:49:38,535 uvicorn.error:216 uncategorized: Uvicorn running on http://0.0.0.0:8321 (Press CTRL+C
```
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
# What does this PR do?
Completes #3732 by removing runtime URL transformations and requiring
users to provide full URLs in configuration. All providers now use
'base_url' consistently and respect the exact URL provided without
appending paths like /v1 or /openai/v1 at runtime.
BREAKING CHANGE: Users must update configs to include full URL paths
(e.g., http://localhost:11434/v1 instead of http://localhost:11434).
Closes#3732
## Test Plan
Existing tests should pass even with the URL changes, due to default
URLs being altered.
Add unit test to enforce URL standardization across remote inference
providers (verifies all use 'base_url' field with HttpUrl | None type)
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
since `StackRunConfig` requires certain parts of `StorageConfig`, it'd
probably make sense to template in some defaults that will "just work"
for most usecases
specifically introduce`ServerStoresConfig` defaults for inference,
metadata, conversations and prompts. We already actually funnel in
defaults for these sections ad-hoc throughout the codebase
additionally set some `backends` defaults for the `StorageConfig`.
This will alleviate some weirdness for `--providers` for run/list-deps
and also some work I have to better align our list-deps/run datatypes
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
These primitives (used both by the Stack as well as provider
implementations) can be thought of fruitfully as internal-only APIs
which can themselves have multiple implementations. We use the new
`llama_stack_api.internal` namespace for this.
In addition: the change moves kv/sql store impls, configs, and
dependency helpers under `core/storage`
## Testing
`pytest tests/unit/utils/test_authorized_sqlstore.py`, other existing CI
# What does this PR do?
Initial PR against #4123
Adds `parallel_tool_calls` spec to Responses API and basic initial
implementation where no more than one function call is generated when
set to `False`.
## Test Plan
* Unit tests have been added to verify no more than one function call is
generated.
* A followup PR will verify passing through `parallel_tool_calls` to
providers.
* A followup PR will address verification and/or implementation of
incremental function calling across multiple conversational turns.
---------
Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>
# What does this PR do?
since llama_stack_api is meant to be _just_ the API definitions of LLS,
we should have pre-commit check that prohibits anyone from accidentally
importing `from llama_stack` or adding `llama_stack` as a dependency
into `llama_stack_api`s pyproject.
## Test Plan
pre-commit should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Generate the Stainless client config directly from code so we can
validate the config before we ever write the YAML.
This change enforces allowed HTTP verbs/paths, detects duplicate routes
across resources, and ensures README example endpoints exist and match
the OpenAPI spec. The generator now fails fast when config entries
drift, keeping the published config (hopefully) more current with the
spec. I think more validation can be done but this is a good start.
# What does this PR do?
This PR improves type hint cleanup in auto-generated provider
documentation by adding regex logic.
**Issues Fixed:**
- Type hints with missing closing brackets (e.g., `list[str` instead of
`list[str]`)
- Types showing as `<class 'bool'>`, `<class 'str'>` instead of `bool`,
`str`
- The multi-line YAML frontmatter in index documentation files wasn't
ideal, so we now add the proper `|` character.
**Changes:**
1. Replaced string replacement (`.replace`) with regex-based type
cleaning to preserve the trailing bracket in case of `list` and `dict`.
2. Adds the `|` character for multi-line YAML descriptions.
3. I have regenerated the docs. However, let me know if that's not
needed.
## Test Plan
1. Ran uv run python scripts/provider_codegen.py - successfully
regenerated all docs
2. We can see that the updated docs handle correctly type hint cleanup
and multi-line yaml descriptions have now the `|` character.
### Note to the reviewer(s)
This is my first contribution to your lovely repo! Initially I was going
thourgh docs (wanted to use `remote::gemini` as provider) and realized
the issue. I've read the
[CONTRIBUTING.md](https://github.com/llamastack/llama-stack/blob/main/CONTRIBUTING.md)
and decided to open the PR. Let me know if there's anything I did wrong
and I'll update my PR!
---------
Signed-off-by: thepetk <thepetk@gmail.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
- Remove backward compatibility for authorization in mcp_headers
- Enforce authorization must use dedicated parameter
- Add validation error if Authorization found in provider_data headers
- Update test_mcp.py to use authorization parameter
- Update test_mcp_json_schema.py to use authorization parameter
- Update test_tools_with_schemas.py to use authorization parameter
- Update documentation to show the change in the authorization approach
Breaking Change:
- Authorization can no longer be passed via mcp_headers in provider_data
- Users must use the dedicated 'authorization' parameter instead
- Clear error message guides users to the new approach"
## Test Plan
CI
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
FastAPI generator now only unwraps body params explicitly marked with
Body(embed=False) so the /eval run_eval schema once again exposes
RunEvalRequest, matching our integration tests and the server's request
parsing.
Regenerated the OpenAPI specs to capture the restored wrapper.
CI on the Stainless preview builds should be green.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
It was referencing strong_typing which was removed in
https://github.com/llamastack/llama-stack/pull/3944
## Test Plan
New CI build test.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This replaces the legacy "pyopenapi + strong_typing" pipeline with a
FastAPI-backed generator that has an explicit schema registry inside
`llama_stack_api`. The key changes:
1. **New generator architecture.** FastAPI now builds the OpenAPI schema
directly from the real routes, while helper modules
(`schema_collection`, `endpoints`, `schema_transforms`, etc.)
post-process the result. The old pyopenapi stack and its strong_typing
helpers are removed entirely, so we no longer rely on fragile AST
analysis or top-level import side effects.
2. **Schema registry in `llama_stack_api`.** `schema_utils.py` keeps a
`SchemaInfo` record for every `@json_schema_type`, `register_schema`,
and dynamically created request model. The OpenAPI generator and other
tooling query this registry instead of scanning the package tree,
producing deterministic names (e.g., `{MethodName}Request`), capturing
all optional/nullable fields, and making schema discovery testable. A
new unit test covers the registry behavior.
3. **Regenerated specs + CI alignment.** All docs/Stainless specs are
regenerated from the new pipeline, so optional/nullable fields now match
reality (expect the API Conformance workflow to report breaking
changes—this PR establishes the new baseline). The workflow itself is
back to the stock oasdiff invocation so future regressions surface
normally.
*Conformance will be RED on this PR; we choose to accept the
deviations.*
## Test Plan
- `uv run pytest tests/unit/server/test_schema_registry.py`
- `uv run python -m scripts.openapi_generator.main docs/static`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Restores the responses unit tests that were inadvertently deleted in PR
[#4055 ](https://github.com/llamastack/llama-stack/pull/4055)
## Test Plan
I ran the unit tests that I restored. They all passed with one
exception:
tests/unit/providers/agents/meta_reference/test_openai_responses.py::test_reuse_mcp_tool_list
AttributeError: module 'llama_stack.providers.utils.tools' has no
attribute 'mcp'
It's coming from this line:
@patch("llama_stack.providers.utils.tools.mcp.list_mcp_tools")
The mcp.py module (and \_\_init\_\_.py) exists under tools. There are
some 'from mcp ....' imports (mcp package in this case) within it that
python may be interpreting as circular imports (or maybe I'm overlooking
something).
# What does this PR do?
For Runtime Exception the error is not propagated to the user and can be
opaque.
Before fix:
`ERROR - Error processing message: Error code: 500 - {'detail':
'Internal server error: An unexpected error occurred.'}
`
After fix:
`[ERROR] Error code: 404 - {'detail': "Model
'claude-sonnet-4-5-20250929' not found. Use 'client.models.list()' to
list available Models."}
`
(Ran into this few times, while working with OCI + LLAMAStack and Sabre:
Agentic framework integrations with LLAMAStack)
## Test Plan
CI
# What does this PR do?
Adding a user-facing `authorization ` parameter to MCP tool definitions
that allows users to explicitly configure credentials per MCP server,
addressing GitHub Issue #4034 in a secure manner.
## Test Plan
tests/integration/responses/test_mcp_authentication.py
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Require at least 0.49.1 which fixes a security vulnerability in the
parsing logic of the Range header in FileResponse. Release note:
https://github.com/Kludex/starlette/releases/tag/0.49.1
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the directory structure was src/llama-stack-api/llama_stack_api
instead it should just be src/llama_stack_api to match the other
packages.
update the structure and pyproject/linting config
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Without this we get below in server logs
```
RuntimeError: OpenAI response failed: InferenceRouter._construct_metrics() got an unexpected keyword argument
'model_id'
```
Seems the method signature got update but this callsite was not updated
## Test Plan
CI and test with Sabre (Agent framework integration)
# What does this PR do?
Error out when creating vector store with unknown embedding model
Closes https://github.com/llamastack/llama-stack/issues/4047
## Test Plan
Added tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Extract API definitions and provider specifications into a standalone
llama-stack-api package that can be published to PyPI independently of
the main llama-stack server.
see: https://github.com/llamastack/llama-stack/pull/2978 and
https://github.com/llamastack/llama-stack/pull/2978#issuecomment-3145115942
Motivation
External providers currently import from llama-stack, which overrides
the installed version and causes dependency conflicts. This separation
allows external providers to:
- Install only the type definitions they need without server
dependencies
- Avoid version conflicts with the installed llama-stack package
- Be versioned and released independently
This enables us to re-enable external provider module tests that were
previously blocked by these import conflicts.
Changes
- Created llama-stack-api package with minimal dependencies (pydantic,
jsonschema)
- Moved APIs, providers datatypes, strong_typing, and schema_utils
- Updated all imports from llama_stack.* to llama_stack_api.*
- Configured local editable install for development workflow
- Updated linting and type-checking configuration for both packages
Next Steps
- Publish llama-stack-api to PyPI
- Update external provider dependencies
- Re-enable external provider module tests
Pre-cursor PRs to this one:
- #4093
- #3954
- #4064
These PRs moved key pieces _out_ of the Api pkg, limiting the scope of
change here.
relates to #3237
## Test Plan
Package builds successfully and can be imported independently. All
pre-commit hooks pass with expected exclusions maintained.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- force a min precommit version
- pin to >= 4.3.0 when installing
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Building/Deploying docs is failing here:
5530320962 (step):8:49
Needs the playground file. Updated it to reflect current admin status.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixed bug where models with No provider_model_id were incorrectly
filtered from the startup config display. The function was checking
multiple fields when it should only filter items with explicitly
disabled provider_id.
Changes:
o Modified remove_disabled_providers to only check provider_id field o
Changed condition from checking multiple fields with None to only
checking provider_id for "__disabled__", None or empty string
o Added comprehensive unit tests
Closes: #4131
Signed-off-by: Derek Higgins <derekh@redhat.com>
We would like to run all OpenAI compatibility tests using only the
openai-client library. This is most friendly for contributors since they
can run tests without needing to update the client-sdks (which is
getting easier but still a long pole.)
This is the first step in enabling that -- no using "library client" for
any of the Responses tests. This seems like a reasonable trade-off since
the usage of an embeddeble library client for Responses (or any
OpenAI-compatible) behavior seems to be not very common. To do this, we
needed to enable MCP tests (which only worked in library client mode)
for server mode.
docs: Add comprehensive Files API and Vector Store integration
documentation
- Add Files API documentation with OpenAI-compatible endpoints
- Create comprehensive guide for OpenAI-compatible file operations
- Reorganize documentation structure: move file operations to files/
directory
- Add vector store provider documentation for Milvus, SQLite-vec, FAISS
- Clean up redundant files and improve navigation
- Update cross-references and eliminate documentation duplication
- Support for release 0.2.14 FileResponse and Vector Store API features
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
A few changes to the storage layer to ensure we reduce unnecessary
contention arising out of our design choices (and letting the database
layer do its correct thing):
- SQL stores now share a single `SqlAlchemySqlStoreImpl` per backend,
and `kvstore_impl` caches instances per `(backend, namespace)`. This
avoids spawning multiple SQLite connections for the same file, reducing
lock contention and aligning the cache story for all backends.
- Added an async upsert API (with SQLite/Postgres dialect inserts) and
routed it through `AuthorizedSqlStore`, then switched conversations and
responses to call it. Using native `ON CONFLICT DO UPDATE` eliminates
the insert-then-update retry window that previously caused long WAL lock
retries.
### Test Plan
Existing tests, added a unit test for `upsert()`
Fixes issues in the storage system by guaranteeing immediate durability
for responses and ensuring background writers stay alive. Three related
fixes:
* Responses to the OpenAI-compatible API now write directly to
Postgres/SQLite inside the request instead of detouring through an async
queue that might never drain; this restores the expected
read-after-write behavior and removes the "response not found" races
reported by users.
* The access-control shim was stamping owner_principal/access_attributes
as SQL NULL, which Postgres interprets as non-public rows; fixing it to
use the empty-string/JSON-null pattern means conversations and responses
stored without an authenticated user stay queryable (matching SQLite).
* The inference-store queue remains for batching, but its worker tasks
now start lazily on the live event loop so server startup doesn't cancel
them—writes keep flowing even when the stack is launched via llama stack
run.
Closes#4115
### Test Plan
Added a matrix entry to test our "base" suite against Postgres as the
store.
Updated documentation to accurately reflect current behavior where
models are identified as provider_id/provider_model_id in the system.
Changes:
o Clarify that model_id is for configuration purposes only o Explain
models are accessed as provider_id/provider_model_id o Remove outdated
aliasing example that suggested model_id could be used
as a custom identifier
This corrects the documentation which previously suggested model_id
could be used to create friendly aliases, which is not how the code
actually works.
Signed-off-by: Derek Higgins <derekh@redhat.com>
Help users find the comprehensive integration testing docs by linking to
the record-replay documentation. This clarifies that the technical
README complements the main docs.
# What does this PR do?
- Updates `/vector_stores/{vector_store_id}/files/{file_id}/content` to
allow returning `embeddings` and `metadata` using the `extra_query`
- Updates the UI accordingly to display them.
- Update UI to support CRUD operations in the Vector Stores section and
adds a new modal exposing the functionality.
- Updates Vector Store update to fail if a user tries to update Provider
ID (which doesn't make sense to allow)
```python
In [1]: client.vector_stores.files.content(
vector_store_id=vector_store.id,
file_id=file.id,
extra_query={"include_embeddings": True, "include_metadata": True}
)
Out [1]: FileContentResponse(attributes={}, content=[Content(text='This is a test document to check if embeddings are generated properly.\n', type='text', embedding=[0.33760684728622437, ...,], chunk_metadata={'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'source': None, 'created_timestamp': 1762053437, 'updated_timestamp': 1762053437, 'chunk_window': '0-13', 'chunk_tokenizer': 'DEFAULT_TIKTOKEN_TOKENIZER', 'chunk_embedding_model': 'sentence-transformers/nomic
-ai/nomic-embed-text-v1.5', 'chunk_embedding_dimension': 768, 'content_token_count': 13, 'metadata_token_count': 9}, metadata={'filename': 'test-embedding.txt', 'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'token_count': 13, 'metadata_token_count': 9})], file_id='file-27291dbc679642ac94ffac6d2810c339', filename='test-embedding.txt')
```
Screenshots of UI are displayed below:
### List Vector Store with Added "Create New Vector Store"
<img width="1912" height="491" alt="Screenshot 2025-11-06 at 10 47
25 PM"
src="https://github.com/user-attachments/assets/a3a3ddd9-758d-4005-ac9c-5047f03916f3"
/>
### Create New Vector Store
<img width="1918" height="1048" alt="Screenshot 2025-11-06 at 10 47
49 PM"
src="https://github.com/user-attachments/assets/b4dc0d31-696f-4e68-b109-27915090f158"
/>
### Edit Vector Store
<img width="1916" height="1355" alt="Screenshot 2025-11-06 at 10 48
32 PM"
src="https://github.com/user-attachments/assets/ec879c63-4cf7-489f-bb1e-57ccc7931414"
/>
### Vector Store Files Contents page (with Embeddings)
<img width="1914" height="849" alt="Screenshot 2025-11-06 at 11 54
32 PM"
src="https://github.com/user-attachments/assets/3095520d-0e90-41f7-83bd-652f6c3fbf27"
/>
### Vector Store Files Contents Details page (with Embeddings)
<img width="1916" height="1221" alt="Screenshot 2025-11-06 at 11 55
00 PM"
src="https://github.com/user-attachments/assets/e71dbdc5-5b49-472b-a43a-5785f58d196c"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Tests added for Middleware extension and Provider failures.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add explicit connection cleanup and shorter timeouts to OpenAI client
fixtures. Fixes CI deadlock after 25+ tests due to connection pool
exhaustion. Also adds 60s timeout to test_conversation_context_loading
as safety net.
## Test Plan
tests pass
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
This PR adds Stainless config to specify the Meta copyright file header
for generated files.
Doing it via config instead of custom code will reduce the probability
of git conflict.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- review preview builds
Update pypdf dependency to address vulnerabilities causing potential
denial of service through infinite loops or excessive memory usage when
handling malicious PDFs. The update remains fully backward compatible,
with no changes to the PdfReader API.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Fixes#4120
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
In the **Detailed Tutorial**, at **Step 3**, the **Install with venv**
option creates a new virtual environment `client`, activates it then
attempts to install the llama-stack-client using pip.
```
uv venv client --python 3.12
source client/bin/activate
pip install llama-stack-client <- this is the problematic line
```
However, the pip command will likely fail because the `uv venv` command
doesn't, by default, include adding the pip command to the virtual
environment that is created. The pip command will error either because
pip doesn't exist at all, or, if the pip command does exist outside of
the virtual environment, return a different error message. The latter
may be unclear to the user why it is failing.
This PR changes 'pip' to 'uv pip', allowing the install action to
function in the virtual environment as intended, and without the need
for pip to be installed.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
1. Use linux or WSL (virtual environments on Windows use `Scripts`
folder instead of `bin` [virtualenv
#993ba13](993ba1316a)
which doesn't align with the tutorial)
2. Clone the `llama-stack` repo
3. Run the following and verify success:
```
uv venv client --python 3.12
source client/bin/activate
```
5. Run the updated command:
```
uv pip install llama-stack-client
```
6. Observe the console output confirms that the virtual environment
`client` was used:
> Using Python 3.12.3 environment at: **client**
# What does this PR do?
the inspect API lacked any mechanism to get all
non-deprecated APIs (v1, v1alpha, v1beta)
change default to this behavior
'v1' filter can be used for user' wanting a list
of stable APIs
## Test Plan
1. pull the PR
2. launch a LLS server
3. run `curl http://beanlab3.bss.redhat.com:8321/v1/inspect/routes`
4. note there are APIs for `v1`, `v1alpha`, and `v1beta` but no
deprecated APIs
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that
was replaced by OpenAI-only API. This includes removing unused type
conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted
types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a
self-contained module.
This is the last interdependency between the .api package and "exterior"
packages, meaning that now every other package in llama stack imports
the API, not the other way around.
## Test Plan
this is a structural change, no tests needed.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# Problem
Responses API uses max_tool_calls parameter to limit the number of tool
calls that can be generated in a response. Currently, LLS implementation
of the Responses API does not support this parameter.
# What does this PR do?
This pull request adds the max_tool_calls field to the response object
definition and updates the inline provider. it also ensures that:
- the total number of calls to built-in and mcp tools do not exceed
max_tool_calls
- an error is thrown if max_tool_calls < 1 (behavior seen with the
OpenAI Responses API, but we can change this if needed)
Closes #[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
- Tested manually for change in model response w.r.t supplied
max_tool_calls field.
- Added integration tests to test invalid max_tool_calls parameter.
- Added integration tests to check max_tool_calls parameter with
built-in and function tools.
- Added integration tests to check max_tool_calls parameter in the
returned response object.
- Recorded OpenAI Responses API behavior using a sample script:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/max_tool_calls.py
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds OCI GenAI PaaS models for openai chat completion endpoints.
## Test Plan
In an OCI tenancy with access to GenAI PaaS, perform the following
steps:
1. Ensure you have IAM policies in place to use service (check docs
included in this PR)
2. For local development, [setup OCI
cli](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
and configure the CLI with your region, tenancy, and auth
[here](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm)
3. Once configured, go through llama-stack setup and run llama-stack
(uses config based auth) like:
```bash
OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci
```
4. Hit the `models` endpoint to list models after server is running:
```bash
curl http://localhost:8321/v1/models | jq
...
{
"identifier": "meta.llama-4-scout-17b-16e-instruct",
"provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
"provider_id": "oci",
"type": "model",
"metadata": {
"display_name": "meta.llama-4-scout-17b-16e-instruct",
"capabilities": [
"CHAT"
],
"oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
},
"model_type": "llm"
},
...
```
5. Use the "display_name" field to use the model in a
`/chat/completions` request:
```bash
# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": true,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": false,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
```
6. Try out other models from the `/models` endpoint.
Mark all register_* / unregister_* APIs as deprecated across models,
shields, tool groups, datasets, benchmarks, and scoring functions. This
is the first step toward moving resource mutations to an `/admin`
namespace as outlined in
https://github.com/llamastack/llama-stack/issues/3809#issuecomment-3492931585.
The deprecation flag will be reflected in the OpenAPI schema to warn API
users that these endpoints are being phased out. Next step will be
implementing the `/admin` route namespace for these resource management
operations.
- `register_model` / `unregister_model`
- `register_shield` / `unregister_shield`
- `register_tool_group` / `unregister_toolgroup`
- `register_dataset` / `unregister_dataset`
- `register_benchmark` / `unregister_benchmark`
- `register_scoring_function` / `unregister_scoring_function`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Add documentation for llama-stack-k8s-operator under kubernetes
deployment guide.
Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
# What does this PR do?
This PR fixes a bug in LlamaStack 0.3.0 where vector stores created via
the OpenAI-compatible API (`POST /v1/vector_stores`) would fail with
`VectorStoreNotFoundError` after server restart when attempting
operations like `vector_io.insert()` or `vector_io.query()`.
The bug affected **6 vector IO providers**: `pgvector`, `sqlite_vec`,
`chroma`, `milvus`, `qdrant`, and `weaviate`.
Created with the assistance of: claude-4.5-sonnet
## Root Cause
All affected providers had a broken
`_get_and_cache_vector_store_index()` method that:
1. Did not load existing vector stores from persistent storage during
initialization
2. Attempted to use `vector_store_table` (which was either `None` or a
`KVStore` without the required `get_vector_store()` method)
3. Could not reload vector stores after server restart or cache miss
## Solution
This PR implements a consistent pattern across all 6 providers:
1. **Load vector stores during initialization** - Pre-populate the cache
from KV store on startup
2. **Fix lazy loading** - Modified `_get_and_cache_vector_store_index()`
to load directly from KV store instead of relying on
`vector_store_table`
3. **Remove broken dependency** - Eliminated reliance on the
`vector_store_table` pattern
## Testing steps
### 1.1 Configure the stack
Create or use an existing configuration with a vector IO provider.
**Example `run.yaml`:**
```yaml
vector_io_store:
- provider_id: pgvector
provider_type: remote::pgvector
config:
host: localhost
port: 5432
db: llamastack
user: llamastack
password: llamastack
inference:
- provider_id: sentence-transformers
provider_type: inline::sentence-transformers
config:
model: sentence-transformers/all-MiniLM-L6-v2
```
### 1.2 Start the server
```bash
llama stack run run.yaml --port 5000
```
Wait for the server to fully start. You should see:
```
INFO: Started server process
INFO: Application startup complete
```
---
## Step 2: Create a Vector Store
### 2.1 Create via API
```bash
curl -X POST http://localhost:5000/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{
"name": "test-persistence-store",
"extra_body": {
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
"provider_id": "pgvector"
}
}' | jq
```
### 2.2 Expected Response
```json
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"object": "vector_store",
"name": "test-persistence-store",
"status": "completed",
"created_at": 1730304000,
"file_counts": {
"total": 0,
"completed": 0,
"in_progress": 0,
"failed": 0,
"cancelled": 0
},
"usage_bytes": 0
}
```
**Save the `id` field** (e.g.,
`vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d`) — you’ll need it for the next
steps.
---
## Step 3: Insert Data (Before Restart)
### 3.1 Insert chunks into the vector store
```bash
export VS_ID="vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d"
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"Python is a high-level programming language known for its readability.\",
\"metadata\": {\"source\": \"doc1\", \"page\": 1}
},
{
\"content\": \"Machine learning enables computers to learn from data without explicit programming.\",
\"metadata\": {\"source\": \"doc2\", \"page\": 1}
},
{
\"content\": \"Neural networks are inspired by biological neurons in the brain.\",
\"metadata\": {\"source\": \"doc3\", \"page\": 1}
}
]
}"
```
### 3.2 Expected Response
Status: **200 OK**
Response: *Empty or success confirmation*
---
## Step 4: Query Data (Before Restart – Baseline)
### 4.1 Query the vector store
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"What is machine learning?\"
}" | jq
```
### 4.2 Expected Response
```json
{
"chunks": [
{
"content": "Machine learning enables computers to learn from data without explicit programming.",
"metadata": {"source": "doc2", "page": 1}
},
{
"content": "Neural networks are inspired by biological neurons in the brain.",
"metadata": {"source": "doc3", "page": 1}
}
],
"scores": [0.85, 0.72]
}
```
**Checkpoint:** Works correctly before restart.
---
## Step 5: Restart the Server (Critical Test)
### 5.1 Stop the server
In the terminal where it’s running:
```
Ctrl + C
```
Wait for:
```
Shutting down...
```
### 5.2 Restart the server
```bash
llama stack run run.yaml --port 5000
```
Wait for:
```
INFO: Started server process
INFO: Application startup complete
```
The vector store cache is now empty, but data should persist.
---
## Step 6: Verify Vector Store Exists (After Restart)
### 6.1 List vector stores
```bash
curl http://localhost:5000/v1/vector_stores | jq
```
### 6.2 Expected Response
```json
{
"object": "list",
"data": [
{
"id": "vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"name": "test-persistence-store",
"status": "completed"
}
]
}
```
**Checkpoint:** Vector store should be listed.
---
## Step 7: Insert Data (After Restart – THE BUG TEST)
### 7.1 Insert new chunks
```bash
curl -X POST http://localhost:5000/vector-io/insert \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"chunks\": [
{
\"content\": \"This chunk was inserted AFTER the server restart.\",
\"metadata\": {\"source\": \"post-restart\", \"test\": true}
}
]
}"
```
### 7.2 Expected Results
**With Fix (Correct):**
```
Status: 200 OK
Response: Success
```
**Without Fix (Bug):**
```json
{
"detail": "VectorStoreNotFoundError: Vector Store 'vs_a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d' not found."
}
```
**Critical Test:** If insertion succeeds, the fix works.
---
## Step 8: Query Data (After Restart – Verification)
### 8.1 Query all data
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"restart\"
}" | jq
```
### 8.2 Expected Response
```json
{
"chunks": [
{
"content": "This chunk was inserted AFTER the server restart.",
"metadata": {"source": "post-restart", "test": true}
}
],
"scores": [0.95]
}
```
**Checkpoint:** Both old and new data are queryable.
---
## Step 9: Multiple Restart Test (Extra Verification)
### 9.1 Restart again
```bash
Ctrl + C
llama stack run run.yaml --port 5000
```
### 9.2 Query after restart
```bash
curl -X POST http://localhost:5000/vector-io/query \
-H "Content-Type: application/json" \
-d "{
\"vector_store_id\": \"$VS_ID\",
\"query\": \"programming\"
}" | jq
```
**Expected:** Works correctly across multiple restarts.
---------
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This pull request adds a new workflow that does 2 things:
1. generate [SDK preview
builds](https://www.stainless.com/docs/guides/automate-updates#set-up-automatic-preview-builds)
whenever the OpenAPI spec file is modified in a PR
2. on PR merge, generate SDK builds that will be pushed to the different
SDK repos (i.e start the release process)
> [!NOTE]
> No repo secret `STAINLESS_API_KEY` is needed, the authentication is
done automatically via GitHub OIDC.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
I tested in my fork: https://github.com/stainless-api/llama-stack/pull/3
# What does this PR do?
Resolves#4102
1. Added `web_search_2025_08_26` to the `WebSearchToolTypes` list and
the `OpenAIResponseInputToolWebSearch.type` Literal union
2. No changes needed to tool execution logic - all `web_search` types
map to the same underlying tool
3. Backward compatibility is maintained - existing `web_search`,
`web_search_preview`, and `web_search_preview_2025_03_11` types continue
to work
4. Added an integration test case using {"type":
"web_search_2025_08_26"} to verify it works correctly
5. Updated `docs/docs/providers/openai_responses_limitations.mdx` to
reflect that `web_search_2025_08_26` is now supported.
6. Removed incorrect references to `MOD1/MOD2/MOD3` (which don't exist
in the codebase)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
https://github.com/llamastack/llama-stack/pull/4055 cleaned the agents
implementation but while doing so it removed some tests which actually
corresponded to the responses implementation. This PR brings those tests
and assocated recordings back.
(We should likely combine all responses tests into one suite, but that
is beyond the scope of this PR.)
# What does this PR do?
Remove circular dependency by moving tracing from API protocol
definitions
to router implementation layer.
This gets us closer to having a self contained API package with no other
cross-cutting dependencies to other parts of the llama stack codebase.
To the best of our ability, the llama_stack.api should only be type and
protocol definitions.
Changes:
- Create apis/common/tracing.py with marker decorator (zero core
dependencies)
- Add the _new_ `@telemetry_traceable` marker decorator to 11 protocol
classes
- Apply actual tracing in core/resolver.py in `instantiate_provider`
based on protocol marker
- Move MetricResponseMixin from core to apis (it's an API response type)
- APIs package is now self-contained with zero core dependencies
The tracing functionality remains identical - actual trace_protocol from
core
is applied to router implementations at runtime when both telemetry is
enabled
and the protocol has the `__marked_for_tracing__` marker.
## Test Plan
Manual integration test confirms identical behavior to main branch:
```bash
llama stack list-deps --format uv starter | sh
export OLLAMA_URL=http://localhost:11434
llama stack run starter
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/gpt-oss:20b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 10}'
```
Verified identical between main and this branch:
- trace_id present in response
- metrics array with prompt_tokens, completion_tokens, total_tokens
- Server logs show trace_protocol applied to all routers
Existing telemetry integration tests (tests/integration/telemetry/) validate
trace context propagation and span attributes.
relates to #3895
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
o Introduces vLLM provider support to the record/replay testing
framework
o Enabling both recording and replay of vLLM API interactions alongside
existing Ollama support.
The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API
surface
including vision features.
--
This is an alternative to #3128 , using qwen3 instead of llama 3.2 1B
appears to be more capable at structure output and tool calls.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
We'd like to remove the dependence of `llama-stack` on
`llama-stack-client`. This is a necessary step.
A few small cleanups
- Enables `embeddings` now also
- Remove ModelRegistryHelper dependency (unused)
- Consolidate to auth_credential field via RemoteInferenceProviderConfig
- Implement list_models() to fetch from downstream /v1/models
## Test Plan
Tested using this script
https://gist.github.com/ashwinb/6356463d10f989c0682ab3bff8589581
Output:
```
Listing models from downstream server...
Available models: ['passthrough/ollama/nomic-embed-text:latest', 'passthrough/ollama/all-minilm:l6-v2', 'passthrough/ollama/llama3.2-vision:11b', 'passthrough/ollama/llama3.2-vision:latest', 'passthrough/ollama/llama-guard3:1b', 'passthrough/o
llama/llama3.2:1b', 'passthrough/ollama/all-minilm:latest', 'passthrough/ollama/llama3.2:3b', 'passthrough/ollama/llama3.2:3b-instruct-fp16', 'passthrough/bedrock/meta.llama3-1-8b-instruct-v1:0', 'passthrough/bedrock/meta.llama3-1-70b-instruct
-v1:0', 'passthrough/bedrock/meta.llama3-1-405b-instruct-v1:0', 'passthrough/sentence-transformers/nomic-ai/nomic-embed-text-v1.5']
Using LLM model: passthrough/ollama/llama3.2-vision:11b
Making inference request...
Response: 4.
--- Testing streaming ---
Streamed response: ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='1', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='', index=0, logprobs=None)], created=1762381674, m
odel='passthrough/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
...
5ChatCompletionChunk(id='chatcmpl-64', choices=[Choice(delta=ChoiceDelta(content='', reasoning_content=None, refusal=None, role='assistant', tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1762381674, model='passthrou
gh/ollama/llama3.2-vision:11b', object='chat.completion.chunk', usage=None)
```
# What does this PR do?
- when create vector store is called without chunk strategy, we actually
the strategy used so that the value is persisted instead of
strategy='None'
## Test Plan
updated tests
## What does this PR do?
The starter distribution now comes with all the required packages to
support persistent stores—like the agent store, metadata, and
inference—using PostgreSQL. Users can enable PostgreSQL support by
setting the `ENABLE_POSTGRES_STORE=1` environment variable.
This PR consolidates the functionality from the removed `postgres-demo`
distribution into the starter distribution, reducing maintenance
overhead.
**Closes: #2619**
**Supersedes: #2851** (rebased and updated)
## Changes Made
1. **Added PostgreSQL support to starter distribution**
- New `run-with-postgres-store.yaml` configuration
- Automatic config switching via `ENABLE_POSTGRES_STORE` environment
variable
- Removed separate `postgres-demo` distribution
2. **Updated to new build system**
- Integrated postgres switching logic into Containerfile entrypoint
- Uses new `storage_backends` and `storage_stores` API
- Properly configured both PostgreSQL KV store and SQL store
3. **Updated dependencies**
- Added `psycopg2-binary` and `asyncpg` to starter distribution
- All postgres-related dependencies automatically included
## How to Use
### With Docker (PostgreSQL):
```bash
docker run \
-e ENABLE_POSTGRES_STORE=1 \
-e POSTGRES_HOST=your_postgres_host \
-e POSTGRES_PORT=5432 \
-e POSTGRES_DB=llamastack \
-e POSTGRES_USER=llamastack \
-e POSTGRES_PASSWORD=llamastack \
-e OPENAI_API_KEY=your_key \
llamastack/distribution-starter
```
### PostgreSQL environment variables:
- `POSTGRES_HOST`: Postgres host (default: `localhost`)
- `POSTGRES_PORT`: Postgres port (default: `5432`)
- `POSTGRES_DB`: Postgres database name (default: `llamastack`)
- `POSTGRES_USER`: Postgres username (default: `llamastack`)
- `POSTGRES_PASSWORD`: Postgres password (default: `llamastack`)
## Test Plan
All pre-commit hooks pass (mypy, ruff, distro-codegen)
`llama stack list-deps starter` confirms psycopg2-binary is included
Storage configuration correctly uses PostgreSQL backends
Container builds successfully with postgres support
## Credits
Original work by @leseb in #2851. Rebased and updated by @r-bit-rry to
work with latest main.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Sébastien Han @leseb
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
list-deps takes positional args OR things like --providers
the issue with this, is that these args need to be optional since by
nature, one or the other can be specified.
add a check to list-deps that checks `if not args.providers and not
args.config`. If this is true, help is printed and we exit.
resolves#4075
## Test Plan
before:
```
╰─ llama stack list-deps
Traceback (most recent call last):
File "/Users/charliedoern/projects/Documents/llama-stack/venv/bin/llama", line 10, in <module>
sys.exit(main())
^^^^^^
File "/Users/charliedoern/projects/Documents/llama-stack/src/llama_stack/cli/llama.py", line 52, in main
parser.run(args)
File "/Users/charliedoern/projects/Documents/llama-stack/src/llama_stack/cli/llama.py", line 43, in run
args.func(args)
File "/Users/charliedoern/projects/Documents/llama-stack/src/llama_stack/cli/stack/list_deps.py", line 51, in _run_stack_list_deps_command
return run_stack_list_deps_command(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/charliedoern/projects/Documents/llama-stack/src/llama_stack/cli/stack/_list_deps.py", line 135, in run_stack_list_deps_command
normal_deps, special_deps, external_provider_dependencies = get_provider_dependencies(build_config)
^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'build_config' where it is not associated with a value
```
after:
```
╰─ llama stack list-deps
usage: llama stack list-deps [-h] [--providers PROVIDERS] [--format {uv,deps-only}] [config | distro]
list the dependencies for a llama stack distribution
positional arguments:
config | distro Path to config file to use or name of known distro (llama stack list for a list). (default: None)
options:
-h, --help show this help message and exit
--providers PROVIDERS
sync dependencies for a list of providers and only those providers. This list is formatted like: api1=provider1,api2=provider2. Where there can be multiple
providers per API. (default: None)
--format {uv,deps-only}
Output format: 'uv' shows shell commands, 'deps-only' shows just the list of dependencies without `uv` (default) (default: deps-only)
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
It avoids model_limit KeyError while trying to get embedding models for
Watsonx
<!-- If resolving an issue, uncomment and update the line below -->
Closes https://github.com/llamastack/llama-stack/issues/4059
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Start server with watsonx distro:
```bash
llama stack list-deps watsonx | xargs -L1 uv pip install
uv run llama stack run watsonx
```
Run
```python
client = LlamaStackClient(base_url=base_url)
client.models.list()
```
Check if there is any embedding model available (currently there is not
a single one)
# What does this PR do?
1. Make telemetry tests as easy as possible for users by expanding the
`SpanStub` data class and creating the `MetricStub` dataclass as a way
to consistently marshal telemetry data in test fixtures and unmarshal
and handle it in tests.
2. Structure server and client tests to always follow the same standards
for consistent testing experience by using the `SpanStub` and
`MetricStub` data class objects.
3. Enable Metrics Testing for completions endpoint
4. Correct token metrics to use histograms instead of counts to capture
tokens per request rather than a cumulative count of tokens over the
lifecycle of the server.
## Test Plan
These are tests
# What does this PR do?
Fixes issue #3922 where `llama stack list` only showed distributions
after they were run. This PR makes the command show all available
distributions immediately on a fresh install.
Closes#3922
## Changes
- **Updated `_get_distribution_dirs()`** to discover both built-in and
built distributions:
- Built-in distributions from `src/llama_stack/distributions/` (e.g.,
starter, nvidia, dell)
- Built distributions from `~/.llama/distributions`
- **Added a "Source" column** to distinguish between "built-in" and
"built" distributions
- **Built distributions override built-in ones** with the same name
(expected behavior)
- **Updated config file detection logic** to handle both naming
conventions:
- Built-in: `build.yaml` and `run.yaml`
- Built: `{name}-build.yaml` and `{name}-run.yaml`
## Test Plan
### Unit Tests
Added comprehensive unit tests in
`tests/unit/distribution/test_stack_list.py`:
```bash
uv run pytest tests/unit/distribution/test_stack_list.py -v
```
**Result**: ✅ All 8 tests pass
- `test_builtin_distros_shown_without_running` - Verifies the core fix
for issue #3922
- `test_builtin_and_built_distros_shown_together` - Ensures both types
are shown
- `test_built_distribution_overrides_builtin` - Tests override behavior
- `test_empty_distributions` - Edge case handling
- `test_config_files_detection_builtin` - Config file detection for
built-in distros
- `test_config_files_detection_built` - Config file detection for built
distros
- `test_llamastack_prefix_stripped` - Name normalization
- `test_hidden_directories_ignored` - Filters hidden directories
### Manual Testing
**Before the fix** (simulated with empty `~/.llama/distributions`):
```bash
$ llama stack list
No stacks found in ~/.llama/distributions
```
**After the fix**:
```bash
$ llama stack list
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Stack Name ┃ Source ┃ Path ┃ Build Config ┃ Run Config ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ ci-tests │ built-in │ /path/to/src/... │ Yes │ Yes │
│ dell │ built-in │ /path/to/src/... │ Yes │ Yes │
│ meta-reference-g… │ built-in │ /path/to/src/... │ Yes │ Yes │
│ nvidia │ built-in │ /path/to/src/... │ Yes │ Yes │
│ open-benchmark │ built-in │ /path/to/src/... │ Yes │ Yes │
│ postgres-demo │ built-in │ /path/to/src/... │ Yes │ Yes │
│ starter │ built-in │ /path/to/src/... │ Yes │ Yes │
│ starter-gpu │ built-in │ /path/to/src/... │ Yes │ Yes │
│ watsonx │ built-in │ /path/to/src/... │ Yes │ Yes │
└───────────────────┴──────────┴───────────────────┴──────────────┴────────────┘
```
**After running a distribution**:
```bash
$ llama stack run starter # Creates ~/.llama/distributions/starter
$ llama stack list
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Stack Name ┃ Source ┃ Path ┃ Build Config ┃ Run Config ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ ... │ built-in │ ... │ Yes │ Yes │
│ starter │ built │ ~/.llama/distri… │ No │ No │
│ ... │ built-in │ ... │ Yes │ Yes │
└───────────────────┴──────────┴───────────────────┴──────────────┴────────────┘
```
Note how `starter` now shows as "built" and points to
`~/.llama/distributions`, overriding the built-in version.
## Breaking Changes
**No breaking changes** - This is a bug fix that improves user
experience with minimal risk:
- No programmatic parsing of output found in the codebase
- Table format is clearly for human consumption
- The new "Source" column helps users understand where distributions
come from
- The behavior change is exactly what users expect (seeing all available
distributions)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Added a script to cleanup recordings. While doing this, moved the CI
matrix generation to a separate script so there is a single source of
truth for the matrix.
Ran the cleanup script as:
```
PYTHONPATH=. python scripts/cleanup_recordings.py
```
Also added this as part of the pre-commit workflow to ensure that the
recordings are always up to date and that no stale recordings are left
in the repo.
# What does this PR do?
These were maybe be included in the webmethod?
The unit test was pointless too since the request was never used
anywhere?
This shouldn't be in the API definition, if we never consume it.
## Test Plan
CI with pre-commit on OpenAPI spec generation.
Signed-off-by: Sébastien Han <seb@redhat.com>
RAG aka file search is implemented via the Responses API by specifying
the file-search tool. The backend implementation remains unchanged. This
PR merely removes the directly exposed API surface which allowed users
to directly perform searches from the client.
This facility is now available via the `client.vector_store.search()`
OpenAI compatible API.
- Removes the deprecated agents (sessions and turns) API that was marked
alpha in 0.3.0
- Cleans up unused imports and orphaned types after the API removal
- Removes `SessionNotFoundError` and `AgentTurnInputType` which are no
longer needed
The agents API is completely superseded by the Responses + Conversations
APIs, and the client SDK Agent class already uses those implementations.
Corresponding client-side PR:
https://github.com/llamastack/llama-stack-client-python/pull/295
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR migrates `unittest` to `pytest` in
`tests/unit/providers/nvidia/test_eval.py`.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Part of https://github.com/llamastack/llama-stack/issues/2680
Supersedes https://github.com/llamastack/llama-stack/pull/2791
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
This PR removes all routes which we had marked deprecated for the 0.3.0
release.
This includes:
- all the `/v1/openai/v1/` routes (the corresponding /v1 routes still
exist of course)
- the /agents API (which is superseded completely by Responses +
Conversations)
- several alpha routes which had a "v1" route to aide transitioning to
"v1alpha"
This is the corresponding client-python change:
https://github.com/llamastack/llama-stack-client-python/pull/294
The llama-stack-client now uses /`v1/openai/v1/models` which returns
OpenAI-compatible model objects with 'id' and 'custom_metadata' fields
instead of the Resource-style 'identifier' field. Updated api_recorder
to handle the new endpoint and modified tests to access model metadata
appropriately. Deleted stale model recordings for re-recording.
**NOTE: CI will be red on this one since it is dependent on
https://github.com/llamastack/llama-stack-client-python/pull/291/files
landing. I verified locally that it is green.**
# What does this PR do?
This API hasn't received any traction and close to zero interest from
the community. Let's revisit in the future if things change.
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
We need to remove `/v1/openai/v1` paths shortly. There is one trouble --
our current `/v1/openai/v1/models` endpoint provides different data than
`/v1/models`. Unfortunately our tests target the latter (llama-stack
customized) behavior. We need to get to true OpenAI compatibility.
This is step 1: adding `custom_metadata` field to `OpenAIModel` that
includes all the extra stuff we add in the native `/v1/models` response.
This can be extracted on the consumer end by look at
`__pydantic_extra__` or other similar fields.
This PR:
- Adds `custom_metadata` field to `OpenAIModel` class in
`src/llama_stack/apis/models/models.py`
- Modified `openai_list_models()` in
`src/llama_stack/core/routing_tables/models.py` to populate
custom_metadata
Next Steps
1. Update stainless client to use `/v1/openai/v1/models` instead of
`/v1/models`
2. Migrate tests to read from `custom_metadata`
3. Remove `/v1/openai/v1/` prefix entirely and consolidate to single
`/v1/models` endpoint
Fixes race condition causing "database is locked" errors during
concurrent writes to SQLite, particularly in streaming responses with
guardrails where multiple inference calls write simultaneously.
Enable Write-Ahead Logging (WAL) mode for SQLite which allows multiple
concurrent readers and one writer without blocking. Set busy_timeout to
5s so SQLite retries instead of failing immediately. Remove the logic
that disabled write queues for SQLite since WAL mode eliminates the
locking issues that prompted disabling them.
Fixes: test_output_safety_guardrails_safe_content[stream=True] flake
# What does this PR do?
This prevents interference from already running servers, and allows
multiple concurrent integration test runs. Unleash the AIs!
## Test Plan
start a LS server at port 8321
Then observe test uses port 8322:
❯ uv run --no-sync ./scripts/integration-tests.sh --stack-config
server:ci-tests --inference-mode replay --setup ollama --suite base
--pattern '(telemetry or safety)'
=== Llama Stack Integration Test Runner ===
Stack Config: server:ci-tests
Setup: ollama
Inference Mode: replay
Test Suite: base
Test Subdirs:
Test Pattern: (telemetry or safety)
Checking llama packages
llama-stack 0.4.0.dev0 /Users/erichuang/projects/new_test_server
llama-stack-client 0.3.0
ollama 0.6.0
=== Applying Setup Environment Variables ===
Setting SQLITE_STORE_DIR:
/var/folders/cz/vyh7y1d11xg881lsxsshnc5c0000gn/T/tmp.bKLsaVAxyU
Setting stack config type: server
Setting up environment variables:
export OLLAMA_URL='http://0.0.0.0:11434'
export SAFETY_MODEL='ollama/llama-guard3:1b'
Will use port: 8322
=== Starting Llama Stack Server ===
Waiting for Llama Stack Server to start on port 8322...
✅ Llama Stack Server started successfully
This commit introduces Mergify, a powerful bot designed to assist with
automated merging and other CI-related tasks. As an initial step, we
enable a basic feature: automatically notifying users when a pull
request has merge conflicts.
When a conflict is detected, Mergify will add a label to the PR. This
label will be removed once the conflict is resolved.
This is foundation PR to activate the bot and start using it for
backports too.
In the future, we plan to expand Mergify’s role to include auto-merging,
as discussed in #1667, once the project is ready.
Signed-off-by: Sébastien Han <seb@redhat.com>
Without this hint Qwen3-0.6B tends to reply with the full name
and sometimes doesn't reply with the correct drafted year.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This seems to be an ancient artifact when we were using readthedocs? Now
docusaurus read the specs directly.
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Fixes latent bug where UV_INDEX_STRATEGY was only exported to GITHUB_ENV
but not to the current shell.
While this bug doesn't currently affect main (since UV_EXTRA_INDEX_URL
is only set on release branches), it's a latent bug that could cause
issues if the logic changes in the future or if someone tests with
UV_EXTRA_INDEX_URL set.
The setup-runner action only exported UV_INDEX_STRATEGY to GITHUB_ENV
(for subsequent steps), not to the current shell environment. Since uv
sync runs in the same step, it would never see the variable if it were
set.
This fix adds `export UV_INDEX_STRATEGY=unsafe-best-match` to make the
variable available in the current shell before running uv commands.
Related: #4019 (same fix for release-0.3.x where the bug is actively
triggered)
# What does this PR do?
llama stack run --providers takes a list of providers in the format of
api1=provider1,api2=provider2
this allows users to run with a simple list of providers.
given the architecture of `create_app`, this run config needs to be
written to disk. use ~/.llama/distribution/providers-run/run.yaml each
time for consistency
resolves#3956
## Test Plan
new unit tests to ensure --providers.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Fixes container builds failing with UV index strategy errors when build
args are passed with empty values.
Docker ARGs declared with empty defaults (ARG UV_INDEX_STRATEGY="")
become environment variables with empty string values in RUN commands.
UV interprets these as if --index-strategy "" was passed on the command
line, causing build failures with "error: a value is required for
'--index-strategy <UV_INDEX_STRATEGY>'".
This is a footgun because empty string ≠ unset variable, and ARGs
silently propagate to all RUN commands, only failing when declared with
empty defaults.
The fix unsets UV_EXTRA_INDEX_URL and UV_INDEX_STRATEGY at the start of
RUN blocks, saves the values early, and only restores them for editable
installs with RC dependencies. All other install modes (PyPI, test-pypi,
client) now run with a clean environment.
Backports UV index configuration fixes from `release-0.3.x` (PR #4002).
The main issue: when we created the release branch infrastructure, we
configured UV to use `test.pypi` as the PRIMARY index to resolve RC
dependencies. This caused UV to look for ALL packages there first, which
led to problems - some packages don't have binary wheels on `test.pypi`,
so UV tried building from source and failed (like the `psycopg2-binary`
issue we hit).
The fix is simple: use PyPI as primary (default) and `test.pypi` as an
EXTRA index. UV will check PyPI first for everything, and only fall back
to `test.pypi` for packages not found there (like our RC client
versions).
This PR includes:
- Fixed `install-llama-stack-client` action to output
`UV_EXTRA_INDEX_URL` instead of `UV_INDEX_URL`
- New `uv-run-with-index.sh` wrapper that auto-detects release branches
and sets UV env vars
- Updated pre-commit hooks (`uv-lock`, codegen, etc.) to use the wrapper
- Pass UV env vars as Docker build args in all locations
- Scope UV env vars properly in Containerfile (inline for llama-stack
install, explicitly unset before distribution deps)
- Export UV env vars to `GITHUB_ENV` in setup-runner for cross-step
persistence
The wrapper detects release branches automatically in both CI and local
environments, so this "just works" without manual configuration. On main
(non-release branch), the wrapper becomes a no-op.
Tested and validated on `release-0.3.x` where all CI checks pass.
# What does this PR do?
Allow filtering for v1alpha, v1beta, deprecated and v1. Backward
incompatible change since by default it only returns v1 apis now.
## Test Plan
added unit test
Fixes CI failures on release branches where uv sync can't resolve RC
dependencies.
The problem: on release branches like `release-0.3.x`, pyproject.toml
requires `llama-stack-client>=0.3.1rc1`. But RC versions only exist on
test.pypi, not PyPI. So uv sync fails before we even get a chance to
install the client from git.
The fix is simple - on release branches, pre-install the client from the
matching git branch first, then run uv sync. This satisfies the RC
requirement and lets dependency resolution succeed.
Modified setup-runner and pre-commit workflows to do this. Also cleaned
up some duplicate logic in setup-test-environment that's now handled
centrally.
Example failure:
5415478835
Replace unused `LLAMA_STACK_CLIENT_DIR` env var (from old `llama stack
build`) with direct `uv pip install` for release branch client
installation.
cc @ehhuang
# What does this PR do?
Add rerank API for NVIDIA Inference Provider.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3278
## Test Plan
Unit test:
```
pytest tests/unit/providers/nvidia/test_rerank_inference.py
```
Integration test:
```
pytest -s -v tests/integration/inference/test_rerank.py --stack-config="inference=nvidia" --rerank-model=nvidia/nvidia/nv-rerankqa-mistral-4b-v3 --env NVIDIA_API_KEY="" --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
Standardize CI workflows to use `release-X.Y.x` branch pattern instead
of multiple numeric variants.
That's the pattern we are settling on. See
https://github.com/llamastack/llama-stack-ops/pull/20 for reference.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes the handling of the external_providers_dir configuration
field to align with its ongoing deprecation, in favor of the provider
`module` specification approach.
It addresses the issue in #3950, where using the default provided
run.yaml config resulted in the `external_providers_dir` parameter being
set to the literal string `None`, and crashing the llama-stack server
when starting.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3950
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- Built a new container image from `podman build . -f
containers/Containerfile --build-arg DISTRO_NAME=starter --tag
llama-stack:starter`
- Tested it locally with `podman run -it localhost/llama-stack:starter`
- Tested it on an OpenShift 4.19 cluster, deployed via the
llama-stack-k8s-operator.
Signed-off-by: Doug Edgar <dedgar@redhat.com>
… case variations
The ollama/llama3.2:3b-instruct-fp16 model returns string values with
trailing whitespace in structured JSON output. Updated test assertions
to use case-insensitive substring matching instead of exact equality.
Use .lower() for case-insensitive comparison
Check if expected value is contained in actual value (handles
whitespace)
Closes: #3996
Signed-off-by: Derek Higgins <derekh@redhat.com>
We will be updating our release procedure to be more "normal" or "sane".
We will
- create release branches like normal people
- land cherry-picks onto those branches
- run releases off of those branches
- no more "rc" branch pollution either
Given that, this PR cleans things up a bit
- Remove `-maint` suffix from release branch patterns in CI workflows
- Update branch matching to `release-X.Y.x` format
This should be "remote::vllm". This causes some log probs tests to be
skipped with remote vllm. (They
fail if run).
Signed-off-by: Derek Higgins <derekh@redhat.com>
`mypy` became very slow for the common path. This can make local
pre-commit runs very slow. Let's restore that.
- restore fast mirrors-mypy hook for local runs
- add optional mypy-full hook and docs so devs can match CI
- run full mypy in CI with a hint when failures occur
### Test Plan
- uv run pre-commit run mypy --all-files
- uv run pre-commit run mypy-full --hook-stage manual --all-files
- uv run --group dev --group type_checking mypy
# What does this PR do?
chunk_id in the Chunk class executes actual logic to compute a chunk ID.
This sort of logic should not live in the API spec.
Instead, the providers should be in charge of calling generate_chunk_id,
and pass it to `Chunk`.
this removes the incorrect dependency between Provider impl and API impl
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When running ./scripts/integration-tests.sh --network host on mac fails
regularly due to how Docker runs on MacOS.
if on mac, keep network bridge mode.
before:
=== Starting Docker Container ===
Using image: localhost/distribution-ci-tests:dev
WARNING: Published ports are discarded when using host network mode
Waiting for Docker container to start...
❌ Docker container failed to start
Container logs:
INFO 2025-10-29 18:38:32,180 llama_stack.cli.stack.run:100 cli: Using
run configuration:
/workspace/src/llama_stack/distributions/ci-tests/run.yaml
... (stack starts but is not reachable on network)
after:
=== Starting Docker Container ===
Using image: localhost/distribution-ci-tests:dev
Using bridge networking with port mapping (non-Linux) Waiting for Docker
container to start...
✅ Docker container started successfully
=== Running Integration Tests ===
## Test Plan
integration tests pass!
Signed-off-by: Charlie Doern <cdoern@redhat.com>
## Summary
When users provide API keys via `X-LlamaStack-Provider-Data` header,
`models.list()` now returns models they can access from those providers,
not just pre-registered models from the registry.
This complements the routing fix from f88416ef8 which enabled inference
calls with `provider_id/model_id` format for unregistered models. Users
can now discover which models are available to them before making
inference requests.
The implementation reuses
`NeedsRequestProviderData.get_request_provider_data()` to validate
credentials, then dynamically fetches models from providers without
caching them since they're user-specific. Registry models take
precedence to respect any pre-configured aliases.
## Test Script
```python
#!/usr/bin/env python3
import json
import os
from openai import OpenAI
# Test 1: Without provider_data header
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="dummy")
models = client.models.list()
anthropic_without = [m.id for m in models.data if m.id and "anthropic" in m.id]
print(f"Without header: {len(models.data)} models, {len(anthropic_without)} anthropic")
# Test 2: With provider_data header containing Anthropic API key
anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]
client_with_key = OpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="dummy",
default_headers={
"X-LlamaStack-Provider-Data": json.dumps({"anthropic_api_key": anthropic_api_key})
}
)
models_with_key = client_with_key.models.list()
anthropic_with = [m.id for m in models_with_key.data if m.id and "anthropic" in m.id]
print(f"With header: {len(models_with_key.data)} models, {len(anthropic_with)} anthropic")
print(f"Anthropic models: {anthropic_with}")
assert len(anthropic_with) > len(anthropic_without), "Should have more anthropic models with API key"
print("\n✓ Test passed!")
```
Run with a stack that has Anthropic provider configured (but without API
key in config):
```bash
ANTHROPIC_API_KEY=sk-ant-... python test_provider_data_models.py
```
## Summary
Fixes all mypy type errors in `providers/inline/agents/meta_reference/`
and removes exclusions from pyproject.toml.
## Changes
- Fix type annotations for Safety API message parameters
(OpenAIMessageParam)
- Add Action enum usage in access control checks
- Correct method signatures to match API supertype (parameter ordering)
- Handle optional return types with proper None checks
- Remove 3 meta_reference exclusions from mypy config
**Files fixed:** 25 errors across 3 files (safety.py, persistence.py,
agents.py)
## Summary
Resolves all mypy errors in meta reference agent OpenAI responses
implementation by adding proper type narrowing, None checks, and
Sequence type support.
## Changes
- Fixed streaming.py, openai_responses.py, utils.py, tool_executor.py,
agent_instance.py
- Added Sequence type support to schema generator (ensures correct JSON
schema generation)
- Applied union type narrowing and None checks throughout
## Test plan
- All modified files pass mypy type checking (0 errors)
- Schema generator produces correct `type: array` for Sequence types
---------
Co-authored-by: Claude <noreply@anthropic.com>
Error fixes in Agents implementation (`meta-reference` provider) --
adding proper type annotations and using type narrowing for optional
attributes. Essentially a bunch of `if x and x_foo := getattr(x, "foo")`
instead of `x.foo` directly
Part of ongoing mypy remediation effort.
---------
Co-authored-by: Claude <noreply@anthropic.com>
# What does this PR do?
this commit adds a new pre-commit hook to scan for non-FIPS compliant
function usage within llama-stack
Closes#3427
## Test Plan
Ran locally
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
This adds automated backward compatibility testing for `run.yaml` files.
As we evolve `StackRunConfig`, changes can inadvertently break existing
user configurations. This workflow catches those breaks before merge.
We test old run.yaml files (from main and the latest release) against
the PR's new code. If configs that worked before now fail, the PR is
blocked unless explicitly acknowledged as a breaking change.
**Two test layers:**
- Schema validation: Quick pytest checks that configs parse without
errors
- Integration tests: Full test suite execution to catch runtime semantic
issues (cross-field validations, provider initialization, etc.)
**What we test against:**
- main branch: Breaking changes here block the PR (this is the gate)
- Latest release: Informational only - shows if we've drifted from what
users have
If tests fail, the PR author must acknowledge the breaking change by
adding `!:` to the PR title (e.g., `feat!: change xyz`) or including
`BREAKING CHANGE:` in a commit message. Once acknowledged, the check
passes with a warning.
These jobs are run:
1. `check-main-compatibility` - Schema validation of all distribution
run.yaml files from main
2. `test-integration-main` - Full integration test suite using main's
ci-tests run.yaml
3. `test-integration-release` - Integration tests with latest release
config (informational)
4. `check-schema-release-compatibility` - Schema checks against release
(informational)
The integration tests catch issues that schema validation alone would
miss, like assertion failures in
`StackRunConfig.validate_server_stores()` or provider-specific runtime
logic.
Resolves#3311
Related to #3237
Remove unused methods that became obsolete after d266c59c: o
_compute_and_log_token_usage
o _count_tokens
o stream_tokens_and_compute_metrics
o count_tokens_and_compute_metrics
These methods are no longer referenced anywhere in the codebase
following the removal of deprecated inference.chat_completion
implementations.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
- Adds OpenAI files provider
- Note that file content retrieval is pretty limited by `purpose`
https://community.openai.com/t/file-uploads-error-why-can-t-i-download-files-with-purpose-user-data/1357013?utm_source=chatgpt.com
## Test Plan
Modify run yaml to use openai files provider:
```
files:
- provider_id: openai
provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY:=}
metadata_store:
backend: sql_default
table_name: openai_files_metadata
# Then run files tests
❯ uv run --no-sync ./scripts/integration-tests.sh --stack-config server:ci-tests --inference-mode replay --setup ollama --suite base --pattern test_files
```
This PR enables routing of fully qualified model IDs of the form
`provider_id/model_id` even when the models are not registered with the
Stack.
Here's the situation: assume a remote inference provider which works
only when users provide their own API keys via
`X-LlamaStack-Provider-Data` header. By definition, we cannot list
models and hence update our routing registry. But because we _require_ a
provider ID in the models now, we can identify which provider to route
to and let that provider decide.
Note that we still try to look up our registry since it may have a
pre-registered alias. Just that we don't outright fail when we are not
able to look it up.
Also, updated inference router so that the responses have the _exact_
model that the request had.
## Test Plan
Added an integration test
Closes#3929
---------
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Adds type stubs and fixes mypy errors for better type coverage.
Changes:
- Added type_checking dependency group with type stubs (torchtune, trl,
etc.)
- Added lm-format-enforcer to pre-commit hook
- Created HFAutoModel Protocol for type-safe HuggingFace model handling
- Added mypy.overrides for untyped libraries (torchtune, fairscale,
etc.)
- Fixed type issues in post-training providers, databricks, and
api_recorder
Note: ~1,200 errors remain in excluded files (see pyproject.toml exclude
list).
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- Fix OpenAI SDK NotGiven/Omit type mismatches in embeddings calls
- Fix incorrect OpenAIChatCompletionChunk import in vllm provider
- Refactor to avoid type:ignore comments by using conditional kwargs
## Changes
**openai_mixin.py (9 errors fixed):**
- Build kwargs conditionally for embeddings.create() to avoid
NotGiven/Omit mismatch
- Only include parameters when they have actual values (not None)
**gemini.py (9 errors fixed):**
- Apply same conditional kwargs pattern
- Add missing Any import
**vllm.py (2 errors fixed):**
- Use correct OpenAIChatCompletionChunk from llama_stack.apis.inference
- Remove incorrect alias from openai package
## Technical Notes
The OpenAI SDK has a type system quirk where `NOT_GIVEN` has type
`NotGiven` but parameter signatures expect `Omit`. By only passing
parameters with actual values, we avoid this mismatch entirely without
needing `# type: ignore` comments.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Fixes mypy type errors in provider utilities and testing infrastructure:
- `mcp.py`: Cast incompatible client types, wrap image data properly
- `batches.py`: Rename walrus variable to avoid shadowing
- `api_recorder.py`: Use cast for Pydantic field annotation
No functional changes.
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Fixes mypy type errors across 4 model implementation files (Phase 2d of
mypy suppression removal plan):
- `src/llama_stack/models/llama/llama3/multimodal/image_transform.py`
(10 errors fixed)
- `src/llama_stack/models/llama/checkpoint.py` (2 errors fixed)
- `src/llama_stack/models/llama/hadamard_utils.py` (1 error fixed)
- `src/llama_stack/models/llama/llama3/multimodal/encoder_utils.py` (1
error fixed)
## Changes
### image_transform.py
- Fixed return type annotation for `find_supported_resolutions` from
`Tensor` to `list[tuple[int, int]]`
- Fixed parameter and return type annotations for
`resize_without_distortion` from `Tensor` to `Image.Image`
- Resolved variable shadowing by using separate names:
`possible_resolutions_list` for the list and
`possible_resolutions_tensor` for the tensor
### checkpoint.py
- Replaced deprecated `torch.BFloat16Tensor` and
`torch.cuda.BFloat16Tensor` with
`torch.set_default_dtype(torch.bfloat16)`
- Fixed variable shadowing by renaming numpy array to `ckpt_paths_array`
to distinguish from the parameter `ckpt_paths: list[Path]`
### hadamard_utils.py
- Added `isinstance` assertion to narrow type from `nn.Module` to
`nn.Linear` before accessing `in_features` attribute
### encoder_utils.py
- Fixed variable shadowing by using `masks_list` for list accumulation
and `masks` for the final Tensor result
## Test plan
- Verified all files pass mypy type checking (only optional dependency
import warnings remain)
- No functional changes - only type annotations and variable naming
improvements
Stacks on PR #3933
Co-authored-by: Claude <noreply@anthropic.com>
Fixes mypy type errors in OpenTelemetry integration:
- Add type aliases for AttributeValue and Attributes
- Add helper to filter None values from attributes (OpenTelemetry
doesn't accept None)
- Cast metric and tracer objects to proper types
- Update imports after refactoring
No functional changes.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for making changes to Responses API scheme to
introduce OpenAI compatible prompts there. Change to the API only,
therefore currently no implementation at all. However, the follow up PR
with actual implementation will be submitted after current PR lands.
The need of this functionality was initiated in #3514.
> Note, #3514 is divided on three separate PRs. Current PR is the second
of three.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
CI
## Summary
This PR adds mypy and essential type stub packages to dev dependencies
as Phase 1 of the mypy suppression removal plan.
**Changes:**
- Add `mypy` to dev dependencies
- Add type stubs: `types-jsonschema`, `pandas-stubs`, `types-psutil`,
`types-tqdm`, `boto3-stubs`
**Impact:**
- Enables static type checking across the codebase
- Eliminates ~30 type checking errors related to missing type
information for third-party packages
- Provides foundation for subsequent PRs to remove type suppressions
**Part of:** Mypy suppression removal plan (Phase 1/4)
**Testing:**
```bash
uv sync --group dev
uv run mypy
```
# What does this PR do?
To match https://github.com/llamastack/llama-stack/pull/3847 We must not
update the lock manually, but always reflect the update in the
pyproject.toml. The lock is a state at build time.
Signed-off-by: Sébastien Han <seb@redhat.com>
## Summary
- `preserve_contexts_async_generator` left `PROVIDER_DATA_VAR` (and
other context vars) populated after a streaming generator completed on
HEAD~1, so the asyncio context for request N+1 started with request N's
provider payload.
- FastAPI dependencies and middleware execute before
`request_provider_data_context` rebinds the header data, meaning
auth/logging hooks could observe a prior tenant's credentials or treat
them as authenticated. Traces and any background work that inspects the
context outside the `with` block leak as well—this is a real security
regression, not just a CLI artifact.
- The wrapper now restores each tracked `ContextVar` to the value it
held before the iteration (falling back to clearing when necessary)
after every yield and when the generator terminates, so provider data is
wiped while callers that set their own defaults keep them.
## Test Plan
- `uv run pytest tests/unit/core/test_provider_data_context.py -q`
- `uv run pytest tests/unit/distribution/test_context.py -q`
Both suites fail on HEAD~1 and pass with this change.
# What does this PR do?
add provider-data key passing support to Cerebras, Databricks, NVIDIA
and RunPod
also, added missing tests for Fireworks, Anthropic, Gemini, SambaNova,
and vLLM
addresses #3517
## Test Plan
ci w/ new tests
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Migrates package structure to src/ layout following Python packaging
best practices.
All code moved from `llama_stack/` to `src/llama_stack/`. Public API
unchanged - imports remain `import llama_stack.*`.
Updated build configs, pre-commit hooks, scripts, and GitHub workflows
accordingly. All hooks pass, package builds cleanly.
**Developer note**: Reinstall after pulling: `pip install -e .`
# What does this PR do?
Introduces two main fixes to enhance the stability of Responses API when
dealing with tool calling responses and structured outputs.
### Changes Made
1. It added OpenAIResponseOutputMessageMCPCall and ListTools to
OpenAIResponseInput but
https://github.com/llamastack/llama-stack/pull/3810 got merge that did
the same in a different way. Still this PR does it in a way that keep
the sync between OpenAIResponsesOutput and the allowed objects in
OpenAIResponseInput.
2. Add protection in case self.ctx.response_format does not have type
attribute
BREAKING CHANGE: OpenAIResponseInput now uses OpenAIResponseOutput union
type.
This is semantically equivalent - all previously accepted types are
still supported
via the OpenAIResponseOutput union. This improves type consistency and
maintainability.
This patch ensures if max tokens is not defined, then is set to None
instead of 0 when calling openai_chat_completion. This way some
providers (like gemini) that cannot handle the `max_tokens = 0` will not
fail
Issue: #3666
The vector_provider_wrapper was only limiting providers to
faiss/sqlite-vec for replay mode, but CI tests also run in record mode
with the same limited set of providers. This caused test failures when
trying to test against milvus, chromadb, pgvector, weaviate, and qdrant
which aren't configured in the record job.
Bumps
[@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node)
from 24.8.1 to 24.9.1.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@types/react-dom](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-dom)
from 19.2.1 to 19.2.2.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-dom">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
Add static file import system for docs
- Use `remark-code-import` plugin to embed code at build time
- Support importing Python code with syntax highlighting using
`raw-loader` + `ReactMarkdown`
One caveat is that currently when embedding markdown with code used the
syntax highlighting isn't behaving but I'll investigate that in a follow
up.
## Test Plan
Python Example:
<img width="1372" height="995" alt="Screenshot 2025-10-23 at 9 22 18 PM"
src="https://github.com/user-attachments/assets/656d2c78-4d9b-45a4-bd5e-3f8490352b85"
/>
Markdown example:
<img width="1496" height="1070" alt="Screenshot 2025-10-23 at 9 22
38 PM"
src="https://github.com/user-attachments/assets/6c0a07ec-ff7c-45aa-b05f-8c46acd4445c"
/>
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Clean up telemetry code since the telemetry API has been remove.
- moved telemetry files out of providers to core
- removed from Api
## Test Plan
❯ OTEL_SERVICE_NAME=llama_stack
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 uv run llama stack run
starter
❯ curl http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
-> verify traces in Grafana
CI
# What does this PR do?
https://platform.openai.com/docs/api-reference/moderations supports
optional model parameter.
This PR adds support for using moderations API with model=None if a
default shield id is provided via safety config.
## Test Plan
added tests
manual test:
```
> SAFETY_MODEL='together/meta-llama/Llama-Guard-4-12B' uv run llama stack run starter
> curl http://localhost:8321/v1/moderations \
-H "Content-Type: application/json" \
-d '{
"input": [
"hello"
]
}'
```
Migrates k8s run configs to match the updated run configs
- Replace storage.references with storage.stores
- Wrap resources under registered_resources section
- Update provider configs to use persistence with namespace/backend
- Add telemetry and vector_stores top-level sections
- Simplify agent/files metadata store configuration
Let us enable responses suite in CI now.
Also a minor fix: MCP tool tests intentionally trigger authentication
failures to verify error handling, but the resulting error logs clutter
test output.
Our unit test outputs are filled with all kinds of obscene logs. This
makes it really hard to spot real issues quickly. The problem is that
these logs are necessary to output at the given logging level when the
server is operating normally. It's just that we don't want to see some
of them (especially the noisy ones) during tests.
This PR begins the cleanup. We pytest's caplog fixture to for
suppression.
# What does this PR do?
## Test Plan
```
~/projects/lst3 remotes/origin/HEAD*
.venv ❯ curl http://localhost:8321/v1/moderations \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"input": [
"hello"
]
}'
{"detail":"Invalid value: No shield associated with provider_resource id gpt-4o-mini: choose from ['together/meta-llama/Llama-Guard-4-12B']"}
```
Move conversation sync logic before yield to ensure it executes even
when
streaming consumers break early after receiving response.completed
event.
## Test Plan
```
OLLAMA_URL=http://localhost:11434 \
pytest -sv tests/integration/responses/ \
--stack-config server:ci-tests \
--text-model ollama/llama3.2:3b-instruct-fp16 \
--inference-mode live \
-k conversation_multi
```
This test now passes.
Bumps [openai](https://github.com/openai/openai-python) from 1.107.0 to
2.5.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/releases">openai's
releases</a>.</em></p>
<blockquote>
<h2>v2.5.0</h2>
<h2>2.5.0 (2025-10-17)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.4.0...v2.5.0">v2.4.0...v2.5.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> api update (<a
href="8b280d57d6">8b280d5</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>bump <code>httpx-aiohttp</code> version to 0.1.9 (<a
href="67f2f0afe5">67f2f0a</a>)</li>
</ul>
<h2>v2.4.0</h2>
<h2>2.4.0 (2025-10-16)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.3.0...v2.4.0">v2.3.0...v2.4.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> Add support for gpt-4o-transcribe-diarize on
audio/transcriptions endpoint (<a
href="bdbe9b8f44">bdbe9b8</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>fix dangling comment (<a
href="da14e99606">da14e99</a>)</li>
<li><strong>internal:</strong> detect missing future annotations with
ruff (<a
href="2672b8f072">2672b8f</a>)</li>
</ul>
<h2>v2.3.0</h2>
<h2>2.3.0 (2025-10-10)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.2.0...v2.3.0">v2.2.0...v2.3.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> comparison filter in/not in (<a
href="aa49f626a6">aa49f62</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li><strong>package:</strong> bump jiter to >=0.10.0 to support
Python 3.14 (<a
href="https://redirect.github.com/openai/openai-python/issues/2618">#2618</a>)
(<a
href="aa445cab5c">aa445ca</a>)</li>
</ul>
<h2>v2.2.0</h2>
<h2>2.2.0 (2025-10-06)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.1.0...v2.2.0">v2.1.0...v2.2.0</a></p>
<h3>Features</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/blob/main/CHANGELOG.md">openai's
changelog</a>.</em></p>
<blockquote>
<h2>2.5.0 (2025-10-17)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.4.0...v2.5.0">v2.4.0...v2.5.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> api update (<a
href="8b280d57d6">8b280d5</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>bump <code>httpx-aiohttp</code> version to 0.1.9 (<a
href="67f2f0afe5">67f2f0a</a>)</li>
</ul>
<h2>2.4.0 (2025-10-16)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.3.0...v2.4.0">v2.3.0...v2.4.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> Add support for gpt-4o-transcribe-diarize on
audio/transcriptions endpoint (<a
href="bdbe9b8f44">bdbe9b8</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>fix dangling comment (<a
href="da14e99606">da14e99</a>)</li>
<li><strong>internal:</strong> detect missing future annotations with
ruff (<a
href="2672b8f072">2672b8f</a>)</li>
</ul>
<h2>2.3.0 (2025-10-10)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.2.0...v2.3.0">v2.2.0...v2.3.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> comparison filter in/not in (<a
href="aa49f626a6">aa49f62</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li><strong>package:</strong> bump jiter to >=0.10.0 to support
Python 3.14 (<a
href="https://redirect.github.com/openai/openai-python/issues/2618">#2618</a>)
(<a
href="aa445cab5c">aa445ca</a>)</li>
</ul>
<h2>2.2.0 (2025-10-06)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v2.1.0...v2.2.0">v2.1.0...v2.2.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> dev day 2025 launches (<a
href="38ac0093eb">38ac009</a>)</li>
</ul>
<h3>Bug Fixes</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="513ae76253"><code>513ae76</code></a>
release: 2.5.0 (<a
href="https://redirect.github.com/openai/openai-python/issues/2694">#2694</a>)</li>
<li><a
href="ebf32212f7"><code>ebf3221</code></a>
release: 2.4.0</li>
<li><a
href="e043d7b164"><code>e043d7b</code></a>
chore: fix dangling comment</li>
<li><a
href="25cbb74f83"><code>25cbb74</code></a>
feat(api): Add support for gpt-4o-transcribe-diarize on
audio/transcriptions ...</li>
<li><a
href="8cdfd0650e"><code>8cdfd06</code></a>
codegen metadata</li>
<li><a
href="d5c64434b7"><code>d5c6443</code></a>
codegen metadata</li>
<li><a
href="b20a9e7b81"><code>b20a9e7</code></a>
chore(internal): detect missing future annotations with ruff</li>
<li><a
href="e5f93f5dae"><code>e5f93f5</code></a>
release: 2.3.0</li>
<li><a
href="044878859c"><code>0448788</code></a>
feat(api): comparison filter in/not in</li>
<li><a
href="85a91ade61"><code>85a91ad</code></a>
chore(package): bump jiter to >=0.10.0 to support Python 3.14 (<a
href="https://redirect.github.com/openai/openai-python/issues/2618">#2618</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/openai/openai-python/compare/v1.107.0...v2.5.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- Extend the model type to include rerank models.
- Implement `rerank()` method in inference router.
- Add `rerank_model_list` to `OpenAIMixin` to enable providers to
register and identify rerank models
- Update documentation.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
pytest tests/unit/providers/utils/inference/test_openai_mixin.py
```
# What does this PR do?
Updated quickstart `demo_script.py` to use OpenAI APIs, which is simply:
```python
import io, requests
from openai import OpenAI
url="https://www.paulgraham.com/greatwork.html"
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
vs = client.vector_stores.create()
response = requests.get(url)
pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
uploaded_file = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants")
client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
resp = client.responses.create(
model="openai/gpt-4o",
input="How do you do great work? Use the existing knowledge_search tool.",
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
include=["file_search_call.results"],
)
print(resp)
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
metadata is conflicting with the default embedding model set on server
side via extra body, removing the check and just letting metadata take
precedence over extra body
`ValueError: Embedding model inconsistent between metadata
('text-embedding-3-small') and extra_body
('sentence-transformers/nomic-ai/nomic-embed-text-v1.5')`
## Test Plan
CI
# What does this PR do?
Fix segfault with load model
The cc-vec integration failed with segfault when used with default
embedding model on macOS
`model_id: nomic-ai/nomic-embed-text-v1.5` and `provider_id:
sentence-transformers`
Checked crash report and see this is due to torch OPENMP settings.
Constrainting to 1 thread works without crashes.
## Test Plan
Tested with cc-vec integration
1. start server llama stack run starter
2. Do the setup in https://github.com/raghotham/cc-vec to set env
variables and try
`uv run cc-vec index --url-patterns "%.github.io" --vector-store-name
"ml-research" --limit 50 --chunk-size 800 --overlap 400`
- Moved environment variable parsing and `setup_logging()` call from
module level to proper initialization points
- Added explicit `setup_logging()` calls in `server.py::create_app()`
and `library_client.py::AsyncLlamaStackAsLibraryClient.__init__()`
Module-level side effects are bad practice and can cause issues with
import order, testing, and circular dependencies. The previous
implementation ran logging setup on every import of the log module,
which is unpredictable and difficult to control.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Kill the `builtin::rag` tool group completely since it is no longer
targeted. We use the Responses implementation for knowledge_search which
uses the `openai_vector_stores` pathway.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
## Summary
- Link pre-commit bot comment to workflow run instead of PR for better
debugging
- Dump docker container logs before removal to ensure logs are actually
captured
## Changes
1. **Pre-commit bot**: Changed the initial bot comment to link
"pre-commit hooks" text to the actual workflow run URL instead of just
having the PR number auto-link
2. **Docker logs**: Moved docker container log dumping from GitHub
Actions to the integration-tests.sh script's stop_container() function,
ensuring logs are captured before container removal
## Test plan
- Pre-commit bot comment will now have a clickable link to the workflow
run
- Docker container logs will be successfully captured in CI runs
# What does this PR do?
similarly to `alpha:` move `v1beta` routes under a `beta` group so the
client will have `client.beta`
From what I can tell, the openapi.stainless.yml file is hand written
while the openapi.yml file is generated and copied using the shell
script so I did this by hand.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Bumps
[@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node)
from 24.3.0 to 24.8.1.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [fastapi](https://github.com/fastapi/fastapi) from 0.116.1 to
0.119.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/fastapi/fastapi/releases">fastapi's
releases</a>.</em></p>
<blockquote>
<h2>0.119.0</h2>
<p>FastAPI now (temporarily) supports both Pydantic v2 models and
<code>pydantic.v1</code> models at the same time in the same app, to
make it easier for any FastAPI apps still using Pydantic v1 to gradually
but quickly <strong>migrate to Pydantic v2</strong>.</p>
<pre lang="Python"><code>from fastapi import FastAPI
from pydantic import BaseModel as BaseModelV2
from pydantic.v1 import BaseModel
<p>class Item(BaseModel):<br />
name: str<br />
description: str | None = None</p>
<p>class ItemV2(BaseModelV2):<br />
title: str<br />
summary: str | None = None</p>
<p>app = FastAPI()</p>
<p><a
href="https://github.com/app"><code>@app</code></a>.post("/items/",
response_model=ItemV2)<br />
def create_item(item: Item):<br />
return {"title": item.name, "summary":
item.description}<br />
</code></pre></p>
<p>Adding this feature was a big effort with the main objective of
making it easier for the few applications still stuck in Pydantic v1 to
migrate to Pydantic v2.</p>
<p>And with this, support for <strong>Pydantic v1 is now
deprecated</strong> and will be <strong>removed</strong> from FastAPI in
a future version soon.</p>
<p><strong>Note</strong>: have in mind that the Pydantic team already
stopped supporting Pydantic v1 for recent versions of Python, starting
with Python 3.14.</p>
<p>You can read in the docs more about how to <a
href="https://fastapi.tiangolo.com/how-to/migrate-from-pydantic-v1-to-pydantic-v2/">Migrate
from Pydantic v1 to Pydantic v2</a>.</p>
<h3>Features</h3>
<ul>
<li>✨ Add support for <code>from pydantic.v1 import BaseModel</code>,
mixed Pydantic v1 and v2 models in the same app. PR <a
href="https://redirect.github.com/fastapi/fastapi/pull/14168">#14168</a>
by <a
href="https://github.com/tiangolo"><code>@tiangolo</code></a>.</li>
</ul>
<h2>0.118.3</h2>
<h3>Upgrades</h3>
<ul>
<li>⬆️ Add support for Python 3.14. PR <a
href="https://redirect.github.com/fastapi/fastapi/pull/14165">#14165</a>
by <a
href="https://github.com/svlandeg"><code>@svlandeg</code></a>.</li>
</ul>
<h2>0.118.2</h2>
<h3>Fixes</h3>
<ul>
<li>🐛 Fix tagged discriminated union not recognized as body field. PR <a
href="https://redirect.github.com/fastapi/fastapi/pull/12942">#12942</a>
by <a
href="https://github.com/frankie567"><code>@frankie567</code></a>.</li>
</ul>
<h3>Internal</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="2e721e1b02"><code>2e721e1</code></a>
🔖 Release version 0.119.0</li>
<li><a
href="fc7a0686af"><code>fc7a068</code></a>
📝 Update release notes</li>
<li><a
href="3a3879b2c3"><code>3a3879b</code></a>
📝 Update release notes</li>
<li><a
href="d34918abf0"><code>d34918a</code></a>
✨ Add support for <code>from pydantic.v1 import BaseModel</code>, mixed
Pydantic v1 and ...</li>
<li><a
href="352dbefc63"><code>352dbef</code></a>
🔖 Release version 0.118.3</li>
<li><a
href="96e7d6eaa4"><code>96e7d6e</code></a>
📝 Update release notes</li>
<li><a
href="3611c3fc5b"><code>3611c3f</code></a>
⬆️ Add support for Python 3.14 (<a
href="https://redirect.github.com/fastapi/fastapi/issues/14165">#14165</a>)</li>
<li><a
href="942fce394b"><code>942fce3</code></a>
🔖 Release version 0.118.2</li>
<li><a
href="13b067c9b6"><code>13b067c</code></a>
📝 Update release notes</li>
<li><a
href="185cecd891"><code>185cecd</code></a>
🐛 Fix tagged discriminated union not recognized as body field (<a
href="https://redirect.github.com/fastapi/fastapi/issues/12942">#12942</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/fastapi/fastapi/compare/0.116.1...0.119.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
mirrors build_container.sh
trying to resolve:
0.105 + [ editable = editable ]
0.105 + [ ! -d /workspace/llama-stack ]
0.105 + uv pip install --no-cache-dir -e /workspace/llama-stack
0.261 Using Python 3.12.12 environment at: /usr/local
0.479 × No solution found when resolving dependencies:
0.479 ╰─▶ Because only llama-stack-client<=0.2.23 is available and
0.479 llama-stack==0.3.0rc4 depends on llama-stack-client>=0.3.0rc4, we
can
0.479 conclude that llama-stack==0.3.0rc4 cannot be used.
0.479 And because only llama-stack==0.3.0rc4 is available and you
require
0.479 llama-stack, we can conclude that your requirements are
unsatisfiable.
------
## Test Plan
**NOTE: this is a backwards incompatible change to the run-configs.**
A small QOL update, but this will prove useful when I do a rename for
"vector_dbs" to "vector_stores" next.
Moves all the `models, shields, ...` keys in run-config under a
`registered_resources` sub-key.
# What does this PR do?
Refactor setting default vector store provider and embedding model to
use an optional `vector_stores` config in the `StackRunConfig` and clean
up code to do so (had to add back in some pieces of VectorDB). Also
added remote Qdrant and Weaviate to starter distro (based on other PR
where inference providers were added for UX).
New config is simply (default for Starter distro):
```yaml
vector_stores:
default_provider_id: faiss
default_embedding_model:
provider_id: sentence-transformers
model_id: nomic-ai/nomic-embed-text-v1.5
```
## Test Plan
CI and Unit tests.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
**This PR changes configurations in a backward incompatible way.**
Run configs today repeat full SQLite/Postgres snippets everywhere a
store is needed, which means duplicated credentials, extra connection
pools, and lots of drift between files. This PR introduces named storage
backends so the stack and providers can share a single catalog and
reference those backends by name.
## Key Changes
- Add `storage.backends` to `StackRunConfig`, register each KV/SQL
backend once at startup, and validate that references point to the right
family.
- Move server stores under `storage.stores` with lightweight references
(backend + namespace/table) instead of full configs.
- Update every provider/config/doc to use the new reference style;
docs/codegen now surface the simplified YAML.
## Migration
Before:
```yaml
metadata_store:
type: sqlite
db_path: ~/.llama/distributions/foo/registry.db
inference_store:
type: postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
conversations_store:
type: postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
```
After:
```yaml
storage:
backends:
kv_default:
type: kv_sqlite
db_path: ~/.llama/distributions/foo/kvstore.db
sql_default:
type: sql_postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
stores:
metadata:
backend: kv_default
namespace: registry
inference:
backend: sql_default
table_name: inference_store
max_write_queue_size: 10000
num_writers: 4
conversations:
backend: sql_default
table_name: openai_conversations
```
Provider configs follow the same pattern—for example, a Chroma vector
adapter switches from:
```yaml
providers:
vector_io:
- provider_id: chromadb
provider_type: remote::chromadb
config:
url: ${env.CHROMADB_URL}
kvstore:
type: sqlite
db_path: ~/.llama/distributions/foo/chroma.db
```
to:
```yaml
providers:
vector_io:
- provider_id: chromadb
provider_type: remote::chromadb
config:
url: ${env.CHROMADB_URL}
persistence:
backend: kv_default
namespace: vector_io::chroma_remote
```
Once the backends are declared, everything else just points at them, so
rotating credentials or swapping to Postgres happens in one place and
the stack reuses a single connection pool.
# Problem
The current inline provider appends the user provided instructions to
messages as a system prompt, but the returned response object does not
contain the instructions field (as specified in the OpenAI responses
spec).
# What does this PR do?
This pull request adds the instruction field to the response object
definition and updates the inline provider. It also ensures that
instructions from previous response is not carried over to the next
response (as specified in the openAI spec).
Closes #[3566](https://github.com/llamastack/llama-stack/issues/3566)
## Test Plan
- Tested manually for change in model response w.r.t supplied
instructions field.
- Added unit test to check that the instructions from previous response
is not carried over to the next response.
- Added integration tests to check instructions parameter in the
returned response object.
- Added new recordings for the integration tests.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
fix: nested claims mapping in OAuth2 token validation
The get_attributes_from_claims function was only checking for top-level
claim keys, causing token validation to fail when using nested claims
like "resource_access.llamastack.roles" (common in Keycloak JWT tokens).
Updated the function to support dot notation for traversing nested claim
structures. Give precedence to dot notation over literal keys with dots
in claims mapping.
Added test coverage.
Closes: #3812
Signed-off-by: Derek Higgins <derekh@redhat.com>
Bumps [sqlalchemy](https://github.com/sqlalchemy/sqlalchemy) from 2.0.41
to 2.0.44.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/sqlalchemy/sqlalchemy/releases">sqlalchemy's
releases</a>.</em></p>
<blockquote>
<h1>2.0.44</h1>
<p>Released: October 10, 2025</p>
<h2>platform</h2>
<ul>
<li><strong>[platform] [bug]</strong> Unblocked automatic greenlet
installation for Python 3.14 now that
there are greenlet wheels on pypi for python 3.14.</li>
</ul>
<h2>orm</h2>
<ul>
<li>
<p><strong>[orm] [usecase]</strong> The way ORM Annotated Declarative
interprets Python <a href="https://peps.python.org/pep-0695">PEP 695</a>
type aliases
in <code>Mapped[]</code> annotations has been refined to expand the
lookup scheme. A
<a href="https://peps.python.org/pep-0695">PEP 695</a> type can now be
resolved based on either its direct presence in
<code>_orm.registry.type_annotation_map</code> or its immediate resolved
value, as long as a recursive lookup across multiple <a
href="https://peps.python.org/pep-0695">PEP 695</a> types is
not required for it to resolve. This change reverses part of the
restrictions introduced in 2.0.37 as part of <a
href="https://www.sqlalchemy.org/trac/ticket/11955">#11955</a>, which
deprecated (and disallowed in 2.1) the ability to resolve any <a
href="https://peps.python.org/pep-0695">PEP 695</a>
type that was not explicitly present in
<code>_orm.registry.type_annotation_map</code>. Recursive lookups of
<a href="https://peps.python.org/pep-0695">PEP 695</a> types remains
deprecated in 2.0 and disallowed in version 2.1,
as do implicit lookups of <code>NewType</code> types without an entry in
<code>_orm.registry.type_annotation_map</code>.</p>
<p>Additionally, new support has been added for generic <a
href="https://peps.python.org/pep-0695">PEP 695</a> aliases that
refer to <a href="https://peps.python.org/pep-0593">PEP 593</a>
<code>Annotated</code> constructs containing
<code>_orm.mapped_column()</code> configurations. See the sections below
for
examples.</p>
<p>References: <a
href="https://www.sqlalchemy.org/trac/ticket/12829">#12829</a></p>
</li>
<li>
<p><strong>[orm] [bug]</strong> Fixed a caching issue where
<code>_orm.with_loader_criteria()</code> would
incorrectly reuse cached bound parameter values when used with
<code>_sql.CompoundSelect</code> constructs such as
<code>_sql.union()</code>. The
issue was caused by the cache key for compound selects not including the
execution options that are part of the <code>_sql.Executable</code> base
class,
which <code>_orm.with_loader_criteria()</code> uses to apply its
criteria
dynamically. The fix ensures that compound selects and other executable
constructs properly include execution options in their cache key
traversal.</p>
<p>References: <a
href="https://www.sqlalchemy.org/trac/ticket/12905">#12905</a></p>
</li>
</ul>
<h2>engine</h2>
<ul>
<li><strong>[engine] [bug]</strong> Implemented initial support for
free-threaded Python by adding new tests
and reworking the test harness to include Python 3.13t and Python 3.14t
in</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/sqlalchemy/sqlalchemy/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
removes error:
ConnectionError: HTTPConnectionPool(host='localhost', port=4318): Max
retries exceeded with url: /v1/traces
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object
at 0x10fd98e60>: Failed to establish a
new connection: [Errno 61] Connection refused'))
## Test Plan
uv run llama stack run starter
curl http://localhost:8321/v1/models
observe no error in server logs
# What does this PR do?
relates to #2878
We introduce a Containerfile which is used to replaced the `llama stack
build` command (removal in a separate PR).
```
llama stack build --distro starter --image-type venv --run
```
is replaced by
```
llama stack list-deps starter | xargs -L1 uv pip install
llama stack run starter
```
- See the updated workflow files for e2e workflow.
## Test Plan
CI
```
❯ docker build . -f docker/Dockerfile --build-arg DISTRO_NAME=starter --build-arg INSTALL_MODE=editable --tag test_starter
❯ docker run -p 8321:8321 test_starter
❯ curl http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
```
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3839).
* #3855
* __->__ #3839
# What does this PR do?
the sidebar currently has an extra `ii. Run the Script` because its
incorrectly put into the doc as an H3 not an H4 (like the other ones)
<img width="239" height="218" alt="Screenshot 2025-10-20 at 1 04 54 PM"
src="https://github.com/user-attachments/assets/eb8cb26e-7ea9-4b61-9101-d64965b39647"
/>
Fix this which will update the sidebar
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Fix examples in the NVIDIA inference documentation to align with
current API requirements.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
N/A
Bumps [jest](https://github.com/jestjs/jest/tree/HEAD/packages/jest) and
[@types/jest](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/jest).
These dependencies needed to be updated together.
Updates `jest` from 29.7.0 to 30.2.0
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/releases">jest's
releases</a>.</em></p>
<blockquote>
<h2>30.2.0</h2>
<h3>Chore & Maintenance</h3>
<ul>
<li><code>[*]</code> Update example repo for testing React Native
projects (<a
href="https://redirect.github.com/jestjs/jest/pull/15832">#15832</a>)</li>
<li><code>[*]</code> Update <code>jest-watch-typeahead</code> to v3 (<a
href="https://redirect.github.com/jestjs/jest/pull/15830">#15830</a>)</li>
</ul>
<h2>Features</h2>
<ul>
<li><code>[jest-environment-jsdom-abstract]</code> Add support for JSDOM
v27 (<a
href="https://redirect.github.com/jestjs/jest/pull/15834">#15834</a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[babel-jest]</code> Export the <code>TransformerConfig</code>
interface (<a
href="https://redirect.github.com/jestjs/jest/pull/15820">#15820</a>)</li>
<li><code>[jest-config]</code> Fix <code>jest.config.ts</code> with TS
loader specified in docblock pragma (<a
href="https://redirect.github.com/jestjs/jest/pull/15839">#15839</a>)</li>
</ul>
<h2>30.1.3</h2>
<h3>Fixes</h3>
<ul>
<li>Fix <code>unstable_mockModule</code> with <code>node:</code>
prefixed core modules.</li>
</ul>
<h2>30.1.2</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Correct snapshot header regexp to
work with newline across OSes (<a
href="https://redirect.github.com/jestjs/jest/pull/15803">#15803</a>)</li>
</ul>
<h2>30.1.1</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
warning not handling Windows end-of-line sequences (<a
href="https://redirect.github.com/jestjs/jest/pull/15800">#15800</a>)</li>
</ul>
<h2>30.1.0</h2>
<h2>Features</h2>
<ul>
<li><code>[jest-leak-detector]</code> Configurable GC aggressiveness
regarding to V8 heap snapshot generation (<a
href="https://redirect.github.com/jestjs/jest/pull/15793/">#15793</a>)</li>
<li><code>[jest-runtime]</code> Reduce redundant ReferenceError
messages</li>
<li><code>[jest-core]</code> Include test modules that failed to load
when --onlyFailures is active</li>
</ul>
<h3>Fixes</h3>
<ul>
<li>`[jest-snapshot-utils] Fix deprecated goo.gl snapshot guide link not
getting replaced with fully canonical URL (<a
href="https://redirect.github.com/jestjs/jest/pull/15787">#15787</a>)</li>
<li><code>[jest-circus]</code> Fix <code>it.concurrent</code> not
working with <code>describe.skip</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15765">#15765</a>)</li>
<li><code>[jest-snapshot]</code> Fix mangled inline snapshot updates
when used with Prettier 3 and CRLF line endings</li>
<li><code>[jest-runtime]</code> Importing from
<code>@jest/globals</code> in more than one file no longer breaks
relative paths (<a
href="https://redirect.github.com/jestjs/jest/issues/15772">#15772</a>)</li>
</ul>
<h1>Chore</h1>
<ul>
<li><code>[expect]</code> Update docblock for <code>toContain()</code>
to display info on substring check (<a
href="https://redirect.github.com/jestjs/jest/pull/15789">#15789</a>)</li>
</ul>
<h2>30.0.2</h2>
<h2>What's Changed</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/blob/main/CHANGELOG.md">jest's
changelog</a>.</em></p>
<blockquote>
<h2>30.2.0</h2>
<h3>Chore & Maintenance</h3>
<ul>
<li><code>[*]</code> Update example repo for testing React Native
projects (<a
href="https://redirect.github.com/jestjs/jest/pull/15832">#15832</a>)</li>
<li><code>[*]</code> Update <code>jest-watch-typeahead</code> to v3 (<a
href="https://redirect.github.com/jestjs/jest/pull/15830">#15830</a>)</li>
</ul>
<h2>Features</h2>
<ul>
<li><code>[jest-environment-jsdom-abstract]</code> Add support for JSDOM
v27 (<a
href="https://redirect.github.com/jestjs/jest/pull/15834">#15834</a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[jest-matcher-utils]</code> Fix infinite recursion with
self-referential getters in <code>deepCyclicCopyReplaceable</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15831">#15831</a>)</li>
<li><code>[babel-jest]</code> Export the <code>TransformerConfig</code>
interface (<a
href="https://redirect.github.com/jestjs/jest/pull/15820">#15820</a>)</li>
<li><code>[jest-config]</code> Fix <code>jest.config.ts</code> with TS
loader specified in docblock pragma (<a
href="https://redirect.github.com/jestjs/jest/pull/15839">#15839</a>)</li>
</ul>
<h2>30.1.3</h2>
<h3>Fixes</h3>
<ul>
<li>Fix <code>unstable_mockModule</code> with <code>node:</code>
prefixed core modules.</li>
</ul>
<h2>30.1.2</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Correct snapshot header regexp to
work with newline across OSes (<a
href="https://redirect.github.com/jestjs/jest/pull/15803">#15803</a>)</li>
</ul>
<h2>30.1.1</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
warning not handling Windows end-of-line sequences (<a
href="https://redirect.github.com/jestjs/jest/pull/15800">#15800</a>)</li>
<li><code>[jest-snapshot-utils]</code> Improve messaging about goo.gl
snapshot link change (<a
href="https://redirect.github.com/jestjs/jest/pull/15821">#15821</a>)</li>
</ul>
<h2>30.1.0</h2>
<h2>Features</h2>
<ul>
<li><code>[jest-leak-detector]</code> Configurable GC aggressiveness
regarding to V8 heap snapshot generation (<a
href="https://redirect.github.com/jestjs/jest/pull/15793/">#15793</a>)</li>
<li><code>[jest-runtime]</code> Reduce redundant ReferenceError
messages</li>
<li><code>[jest-core]</code> Include test modules that failed to load
when --onlyFailures is active</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
guide link not getting replaced with fully canonical URL (<a
href="https://redirect.github.com/jestjs/jest/pull/15787">#15787</a>)</li>
<li><code>[jest-circus]</code> Fix <code>it.concurrent</code> not
working with <code>describe.skip</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15765">#15765</a>)</li>
<li><code>[jest-snapshot]</code> Fix mangled inline snapshot updates
when used with Prettier 3 and CRLF line endings</li>
<li><code>[jest-runtime]</code> Importing from
<code>@jest/globals</code> in more than one file no longer breaks
relative paths (<a
href="https://redirect.github.com/jestjs/jest/issues/15772">#15772</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="855864e3f9"><code>855864e</code></a>
v30.2.0</li>
<li><a
href="da9b532f04"><code>da9b532</code></a>
v30.1.3</li>
<li><a
href="ebfa31cc97"><code>ebfa31c</code></a>
v30.1.2</li>
<li><a
href="d347c0f3f8"><code>d347c0f</code></a>
v30.1.1</li>
<li><a
href="4d5f41d088"><code>4d5f41d</code></a>
v30.1.0</li>
<li><a
href="22236cf58b"><code>22236cf</code></a>
v30.0.5</li>
<li><a
href="f4296d2bc8"><code>f4296d2</code></a>
v30.0.4</li>
<li><a
href="d4a6c94daf"><code>d4a6c94</code></a>
v30.0.3</li>
<li><a
href="393acbfac3"><code>393acbf</code></a>
v30.0.2</li>
<li><a
href="5ce865b406"><code>5ce865b</code></a>
v30.0.1</li>
<li>Additional commits viewable in <a
href="https://github.com/jestjs/jest/commits/v30.2.0/packages/jest">compare
view</a></li>
</ul>
</details>
<br />
Updates `@types/jest` from 29.5.14 to 30.0.0
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/jest">compare
view</a></li>
</ul>
</details>
<br />
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[jest-environment-jsdom](https://github.com/jestjs/jest/tree/HEAD/packages/jest-environment-jsdom)
from 30.1.2 to 30.2.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/releases">jest-environment-jsdom's
releases</a>.</em></p>
<blockquote>
<h2>30.2.0</h2>
<h3>Chore & Maintenance</h3>
<ul>
<li><code>[*]</code> Update example repo for testing React Native
projects (<a
href="https://redirect.github.com/jestjs/jest/pull/15832">#15832</a>)</li>
<li><code>[*]</code> Update <code>jest-watch-typeahead</code> to v3 (<a
href="https://redirect.github.com/jestjs/jest/pull/15830">#15830</a>)</li>
</ul>
<h2>Features</h2>
<ul>
<li><code>[jest-environment-jsdom-abstract]</code> Add support for JSDOM
v27 (<a
href="https://redirect.github.com/jestjs/jest/pull/15834">#15834</a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[babel-jest]</code> Export the <code>TransformerConfig</code>
interface (<a
href="https://redirect.github.com/jestjs/jest/pull/15820">#15820</a>)</li>
<li><code>[jest-config]</code> Fix <code>jest.config.ts</code> with TS
loader specified in docblock pragma (<a
href="https://redirect.github.com/jestjs/jest/pull/15839">#15839</a>)</li>
</ul>
<h2>30.1.3</h2>
<h3>Fixes</h3>
<ul>
<li>Fix <code>unstable_mockModule</code> with <code>node:</code>
prefixed core modules.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/blob/main/CHANGELOG.md">jest-environment-jsdom's
changelog</a>.</em></p>
<blockquote>
<h2>30.2.0</h2>
<h3>Chore & Maintenance</h3>
<ul>
<li><code>[*]</code> Update example repo for testing React Native
projects (<a
href="https://redirect.github.com/jestjs/jest/pull/15832">#15832</a>)</li>
<li><code>[*]</code> Update <code>jest-watch-typeahead</code> to v3 (<a
href="https://redirect.github.com/jestjs/jest/pull/15830">#15830</a>)</li>
</ul>
<h2>Features</h2>
<ul>
<li><code>[jest-environment-jsdom-abstract]</code> Add support for JSDOM
v27 (<a
href="https://redirect.github.com/jestjs/jest/pull/15834">#15834</a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[jest-matcher-utils]</code> Fix infinite recursion with
self-referential getters in <code>deepCyclicCopyReplaceable</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15831">#15831</a>)</li>
<li><code>[babel-jest]</code> Export the <code>TransformerConfig</code>
interface (<a
href="https://redirect.github.com/jestjs/jest/pull/15820">#15820</a>)</li>
<li><code>[jest-config]</code> Fix <code>jest.config.ts</code> with TS
loader specified in docblock pragma (<a
href="https://redirect.github.com/jestjs/jest/pull/15839">#15839</a>)</li>
</ul>
<h2>30.1.3</h2>
<h3>Fixes</h3>
<ul>
<li>Fix <code>unstable_mockModule</code> with <code>node:</code>
prefixed core modules.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="855864e3f9"><code>855864e</code></a>
v30.2.0</li>
<li>See full diff in <a
href="https://github.com/jestjs/jest/commits/v30.2.0/packages/jest-environment-jsdom">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
## Test Plan
.venv ❯ sh ./scripts/install.sh
⚠️ Found existing container(s) for 'ollama-server', removing...
⚠️ Found existing container(s) for 'llama-stack', removing...
⚠️ Found existing container(s) for 'jaeger', removing...
⚠️ Found existing container(s) for 'otel-collector', removing...
⚠️ Found existing container(s) for 'prometheus', removing...
⚠️ Found existing container(s) for 'grafana', removing...
📡 Starting telemetry stack...
🦙 Starting Ollama...
⏳ Waiting for Ollama daemon...
📦 Ensuring model is pulled: llama3.2:3b...
🦙 Starting Llama Stack...
⏳ Waiting for Llama Stack API...
..
🎉 Llama Stack is ready!
👉 API endpoint: http://localhost:8321📖 Documentation:
https://llamastack.github.io/latest/references/api_reference/index.html💻 To access the llama stack CLI, exec into the container:
docker exec -ti llama-stack bash
📡 Telemetry dashboards:
Jaeger UI: http://localhost:16686
Prometheus UI: http://localhost:9090
Grafana UI: http://localhost:3000 (admin/admin)
OTEL Collector: http://localhost:4318🐛 Report an issue @ https://github.com/llamastack/llama-stack/issues if
you think it's a bug
# What does this PR do?
Adds a test and a standardized way to build future tests out for
telemetry in llama stack.
Contributes to https://github.com/llamastack/llama-stack/issues/3806
## Test Plan
This is the test plan 😎
In replay mode, inference is instantenous. We don't need to wait 15
seconds for the batch to be done. Fixing polling to do exp backoff makes
things work super fast.
# What does this PR do?
remove telemetry as a providable API from the codebase. This includes
removing it from generated distributions but also the provider registry,
the router, etc
since `setup_logger` is tied pretty strictly to `Api.telemetry` being in
impls we still need an "instantiated provider" in our implementations.
However it should not be auto-routed or provided. So in
validate_and_prepare_providers (called from resolve_impls) I made it so
that if run_config.telemetry.enabled, we set up the meta-reference
"provider" internally to be used so that log_event will work when
called.
This is the neatest way I think we can remove telemetry from the
provider configs but also not need to rip apart the whole "telemetry is
a provider" logic just yet, but we can do it internally later without
disrupting users.
so telemetry is removed from the registry such that if a user puts
`telemetry:` as an API in their build/run config it will err out, but
can still be used by us internally as we go through this transition.
relates to #3806
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When stack config is set to server in docker
STACK_CONFIG_ARG=--stack-config=http://localhost:8321, the env variable
was not getting correctly set and test id not set, causing
This is needed for test-and-cut to work
E openai.BadRequestError: Error code: 400 - {'detail': 'Invalid value:
Test ID is required for file ID allocation'}
5286461406
## Test Plan
CI
# What does this PR do?
Adds a subpage of the OpenAI compatibility page in the documentation.
This subpage documents known limitations of the Responses API.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3575
---------
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
As indicated in the title. Our `starter` distribution enables all remote
providers _very intentionally_ because we believe it creates an easier,
more welcoming experience to new folks using the software. If we do
that, and then slam the logs with errors making them question their life
choices, it is not so good :)
Note that this fix is limited in scope. If you ever try to actually
instantiate the OpenAI client from a code path without an API key being
present, you deserve to fail hard.
## Test Plan
Run `llama stack run starter` with `OPENAI_API_KEY` set. No more wall of
text, just one message saying "listed 96 models".
a bunch of logger.info()s are good for server code to help debug in
production, but we don't want them killing our unit test output :)
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
**!!BREAKING CHANGE!!**
The lookup is also straightforward -- we always look for this identifier
and don't try to find a match for something without the provider_id
prefix.
Note that, this ideally means we need to update the `register_model()`
API also (we should kill "identifier" from there) but I am not doing
that as part of this PR.
## Test Plan
Existing unit tests
Wanted to re-enable Responses CI but it seems to hang for some reason
due to some interactions with conversations_store or responses_store.
## Test Plan
```
# library client
./scripts/integration-tests.sh --stack-config ci-tests --suite responses
# server
./scripts/integration-tests.sh --stack-config server:ci-tests --suite responses
```
# What does this PR do?
Have closed the previous PR due to merge conflicts with multiple PRs
Addressed all comments from
https://github.com/llamastack/llama-stack/pull/3768 (sorry for carrying
over to this one)
## Test Plan
Added UTs and integration tests
Handle a base case when no stored messages exist because no Response
call has been made.
## Test Plan
```
./scripts/integration-tests.sh --stack-config server:ci-tests \
--suite responses --inference-mode record-if-missing --pattern test_conversation_responses
```
Fixed KeyError when chunks don't have document_id in metadata or
chunk_metadata. Updated logging to safely extract document_id using
getattr and RAG memory to handle different document_id locations. Added
test for missing document_id scenarios.
Fixes issue #3494 where /v1/vector-io/insert would crash with KeyError.
Fixed KeyError when chunks don't have document_id in metadata or
chunk_metadata. Updated logging to safely extract document_id using
getattr and RAG memory to handle different document_id locations. Added
test for missing document_id scenarios.
# What does this PR do?
Fixes a KeyError crash in `/v1/vector-io/insert` when chunks are missing
`document_id` fields. The API
was failing even though `document_id` is optional according to the
schema.
Closes#3494
## Test Plan
**Before fix:**
- POST to `/v1/vector-io/insert` with chunks → 500 KeyError
- Happened regardless of where `document_id` was placed
**After fix:**
- Same request works fine → 200 OK
- Tested with Postman using FAISS backend
- Added unit test covering missing `document_id` scenarios
This PR updates the Conversation item related types and improves a
couple critical parts of the implemenation:
- it creates a streaming output item for the final assistant message
output by
the model. until now we only added content parts and included that
message in the final response.
- rewrites the conversation update code completely to account for items
other than messages (tool calls, outputs, etc.)
## Test Plan
Used the test script from
https://github.com/llamastack/llama-stack-client-python/pull/281 for
this
```
TEST_API_BASE_URL=http://localhost:8321/v1 \
pytest tests/integration/test_agent_turn_step_events.py::test_client_side_function_tool -xvs
```
# Add support for Google Gemini `gemini-embedding-001` embedding model
and correctly registers model type
MR message created with the assistance of Claude-4.5-sonnet
This resolves https://github.com/llamastack/llama-stack/issues/3755
## What does this PR do?
This PR adds support for the `gemini-embedding-001` Google embedding
model to the llama-stack Gemini provider. This model provides
high-dimensional embeddings (3072 dimensions) compared to the existing
`text-embedding-004` model (768 dimensions). Old embeddings models (such
as text-embedding-004) will be deprecated soon according to Google
([Link](https://developers.googleblog.com/en/gemini-embedding-available-gemini-api/))
## Problem
The Gemini provider only supported the `text-embedding-004` embedding
model. The newer `gemini-embedding-001` model, which provides
higher-dimensional embeddings for improved semantic representation, was
not available through llama-stack.
## Solution
This PR consists of three commits that implement, fix the model
registration, and enable embedding generation:
### Commit 1: Initial addition of gemini-embedding-001
Added metadata for `gemini-embedding-001` to the
`embedding_model_metadata` dictionary:
```python
embedding_model_metadata: dict[str, dict[str, int]] = {
"text-embedding-004": {"embedding_dimension": 768, "context_length": 2048},
"gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # NEW
}
```
**Issue discovered:** The model was not being registered correctly
because the dictionary keys didn't match the model IDs returned by
Gemini's API.
### Commit 2: Fix model ID matching with `models/` prefix
Updated both dictionary keys to include the `models/` prefix to match
Gemini's OpenAI-compatible API response format:
```python
embedding_model_metadata: dict[str, dict[str, int]] = {
"models/text-embedding-004": {"embedding_dimension": 768, "context_length": 2048}, # UPDATED
"models/gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # UPDATED
}
```
**Root cause:** Gemini's OpenAI-compatible API returns model IDs with
the `models/` prefix (e.g., `models/text-embedding-004`). The
`OpenAIMixin.list_models()` method directly matches these IDs against
the `embedding_model_metadata` dictionary keys. Without the prefix, the
models were being registered as LLMs instead of embedding models.
### Commit 3: Fix embedding generation for providers without usage stats
Fixed a bug in `OpenAIMixin.openai_embeddings()` that prevented
embedding generation for providers (like Gemini) that don't return usage
statistics:
```python
# Before (Line 351-354):
usage = OpenAIEmbeddingUsage(
prompt_tokens=response.usage.prompt_tokens, # ← Crashed with AttributeError
total_tokens=response.usage.total_tokens,
)
# After (Lines 351-362):
if response.usage:
usage = OpenAIEmbeddingUsage(
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
)
else:
usage = OpenAIEmbeddingUsage(
prompt_tokens=0, # Default when not provided
total_tokens=0, # Default when not provided
)
```
**Impact:** This fix enables embedding generation for **all** Gemini
embedding models, not just the newly added one.
## Changes
### Modified Files
**`llama_stack/providers/remote/inference/gemini/gemini.py`**
- Line 17: Updated `text-embedding-004` key to
`models/text-embedding-004`
- Line 18: Added `models/gemini-embedding-001` with correct metadata
**`llama_stack/providers/utils/inference/openai_mixin.py`**
- Lines 351-362: Added null check for `response.usage` to handle
providers without usage statistics
## Key Technical Details
### Model ID Matching Flow
1. `list_provider_model_ids()` calls Gemini's `/v1/models` endpoint
2. API returns model IDs like: `models/text-embedding-004`,
`models/gemini-embedding-001`
3. `OpenAIMixin.list_models()` (line 410) checks: `if metadata :=
self.embedding_model_metadata.get(provider_model_id)`
4. If matched, registers as `model_type: "embedding"` with metadata;
otherwise registers as `model_type: "llm"`
### Why Both Keys Needed the Prefix
The `text-embedding-004` model was already working because there was
likely separate configuration or manual registration handling it. For
auto-discovery to work correctly for **both** models, both keys must
match the API's model ID format exactly.
## How to test this PR
Verified the changes by:
1. **Model Auto-Discovery**: Started llama-stack server and confirmed
models are auto-discovered from Gemini API
2. **Model Registration**: Confirmed both embedding models are correctly
registered and visible
```bash
curl http://localhost:8325/v1/models | jq '.data[] | select(.provider_id == "gemini" and .model_type == "embedding")'
```
**Results:**
- ✅ `gemini/models/text-embedding-004` - 768 dimensions - `model_type:
"embedding"`
- ✅ `gemini/models/gemini-embedding-001` - 3072 dimensions -
`model_type: "embedding"`
3. **Before Fix (Commit 1)**: Models appeared as `model_type: "llm"`
without embedding metadata
4. **After Fix (Commit 2)**: Models correctly identified as `model_type:
"embedding"` with proper metadata
5. **Generate Embeddings**: Verified embedding generation works
```bash
curl -X POST http://localhost:8325/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "gemini/models/gemini-embedding-001", "input": "test"}' | \
jq '.data[0].embedding | length'
```
# What does this PR do?
Enables automatic embedding model detection for vector stores and by
using a `default_configured` boolean that can be defined in the
`run.yaml`.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
- Unit tests
- Integration tests
- Simple example below:
Spin up the stack:
```bash
uv run llama stack build --distro starter --image-type venv --run
```
Then test with OpenAI's client:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
vs = client.vector_stores.create()
```
Previously you needed:
```python
vs = client.vector_stores.create(
extra_body={
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
}
)
```
The `extra_body` is now unnecessary.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Previously, the NVIDIA inference provider implemented a custom
`openai_embeddings` method with a hardcoded `input_type="query"`
parameter, which is required by NVIDIA asymmetric embedding
models([https://github.com/llamastack/llama-stack/pull/3205](https://github.com/llamastack/llama-stack/pull/3205)).
Recently `extra_body` parameter is added to the embeddings API
([https://github.com/llamastack/llama-stack/pull/3794](https://github.com/llamastack/llama-stack/pull/3794)).
So, this PR updates the NVIDIA inference provider to use the base
`OpenAIMixin.openai_embeddings` method instead and pass the `input_type`
through the `extra_body` parameter for asymmetric embedding models.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run the following command for the ```embedding_model```:
```nvidia/llama-3.2-nv-embedqa-1b-v2```, ```nvidia/nv-embedqa-e5-v5```,
```nvidia/nv-embedqa-mistral-7b-v2```, and
```snowflake/arctic-embed-l```.
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model={embedding_model} --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com" --inference-mode=record
```
# What does this PR do?
As discussed on discord, we do not need to reinvent the wheel for
telemetry. Instead we'll lean into the canonical OTEL stack.
Logs/traces/metrics will still be sent via OTEL - they just won't be
stored on, queried through Stack.
This is the first of many PRs to remove telemetry API from Stack.
1) removed webmethod decorators to remove from API spec
2) removed tests as @iamemilio is adding them on otel directly.
## Test Plan
# What does this PR do?
Updates CONTRIBUTING.md with the following changes:
- Use Python 3.12 (and why)
- Use pre-commit==4.3.0
- Recommend using -v with pre-commit to get detailed info about why it
is failing if it fails.
- Instructs users to go to the docs/ directory before rebuilding the
docs (it doesn't work unless you do that).
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to replace the Llama Stack's default embedding
model by nomic-embed-text-v1.5.
These are the key reasons why Llama Stack community decided to switch
from all-MiniLM-L6-v2 to nomic-embed-text-v1.5:
1. The training data for
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data)
includes a lot of data sets with various licensing terms, so it is
tricky to know when/whether it is appropriate to use this model for
commercial applications.
2. The model is not particularly competitive on major benchmarks. For
example, if you look at the [MTEB
Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click
on Miscellaneous/BEIR to see English information retrieval accuracy, you
see that the top of the leaderboard is dominated by enormous models but
also that there are many, many models of relatively modest size whith
much higher Retrieval scores. If you want to look closely at the data, I
recommend clicking "Download Table" because it is easier to browse that
way.
More discussion info can be founded
[here](https://github.com/llamastack/llama-stack/issues/2418)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2418
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. Run `./scripts/unit-tests.sh`
2. Integration tests via CI wokrflow
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR fixes issues with the WatsonX provider so it works correctly
with LiteLLM.
The main problem was that WatsonX requests failed because the provider
data validator didn’t properly handle the API key and project ID. This
was fixed by updating the WatsonXProviderDataValidator and ensuring the
provider data is loaded correctly.
The openai_chat_completion method was also updated to match the behavior
of other providers while adding WatsonX-specific fields like project_id.
It still calls await super().openai_chat_completion.__func__(self,
params) to keep the existing setup and tracing logic.
After these changes, WatsonX requests now run correctly.
## Test Plan
The changes were tested by running chat completion requests and
confirming that credentials and project parameters are passed correctly.
I have tested with my WatsonX credentials, by using the cli with `uv run
llama-stack-client inference chat-completion --session`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This commit migrates the authentication system from python-jose to PyJWT
to eliminate the dependency on the archived rsa package. The migration
includes:
- Refactored OAuth2TokenAuthProvider to use PyJWT's PyJWKClient for
clean JWKS handling
- Removed manual JWKS fetching, caching and key extraction logic in
favor of PyJWT's built-in functionality
The new implementation is cleaner, more maintainable, and follows PyJWT
best practices while maintaining full backward compatibility.
## Test Plan
Unit tests. Auth CI.
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
2 main changes:
1. Remove `provider_id` requirement in call to vector stores and
2. Removes "register first embedding model" logic
- Now forces embedding model id as required on Vector Store creation
Simplifies the UX for OpenAI to:
```python
vs = client.vector_stores.create(
name="my_citations_db",
extra_body={
"embedding_model": "ollama/nomic-embed-text:latest",
}
)
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixed CI job to check the correct directory for file changes Artifacts
are now stored in multiple directories not just
./tests/integration/recordings
Signed-off-by: Derek Higgins <derekh@redhat.com>
Applies the same pattern from
https://github.com/llamastack/llama-stack/pull/3777 to embeddings and
vector_stores.create() endpoints.
This should _not_ be a breaking change since (a) our tests were already
using the `extra_body` parameter when passing in to the backend (b) but
the backend probably wasn't extracting the parameters correctly. This PR
will fix that.
Updated APIs: `openai_embeddings(), openai_create_vector_store(),
openai_create_vector_store_file_batch()`
Bumps [framer-motion](https://github.com/motiondivision/motion) from
12.23.12 to 12.23.24.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/motiondivision/motion/blob/main/CHANGELOG.md">framer-motion's
changelog</a>.</em></p>
<blockquote>
<h2>[12.23.24] 2025-10-10</h2>
<h3>Fixed</h3>
<ul>
<li>Ensure that when a component remounts, it continues to fire
animations even when <code>initial={false}</code>.</li>
</ul>
<h2>[12.23.23] 2025-10-10</h2>
<h3>Added</h3>
<ul>
<li>Exporting <code>PresenceChild</code> and <code>PopChild</code> type
for internal use.</li>
</ul>
<h2>[12.23.22] 2025-09-25</h2>
<h3>Added</h3>
<ul>
<li>Exporting <code>HTMLElements</code> and <code>useComposedRefs</code>
type for internal use.</li>
</ul>
<h2>[12.23.21] 2025-09-24</h2>
<h3>Fixed</h3>
<ul>
<li>Fixing main-thread <code>scroll</code> with animations that contain
<code>delay</code>.</li>
</ul>
<h2>[12.23.20] 2025-09-24</h2>
<h3>Fixed</h3>
<ul>
<li>Suppress non-animatable value warning for instant animations.</li>
</ul>
<h2>[12.23.19] 2025-09-23</h2>
<h3>Fixed</h3>
<ul>
<li>Remove support for changing <code>ref</code> prop.</li>
</ul>
<h2>[12.23.18] 2025-09-19</h2>
<h3>Fixed</h3>
<ul>
<li><code><motion /></code> components now support changing
<code>ref</code> prop.</li>
</ul>
<h2>[12.23.17] 2025-09-19</h2>
<h3>Fixed</h3>
<ul>
<li>Ensure <code>animate()</code> <code>onComplete</code> only fires
once, when all values are complete.</li>
</ul>
<h2>[12.23.16] 2025-09-19</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="b5df740a46"><code>b5df740</code></a>
v12.23.24</li>
<li><a
href="808ebce630"><code>808ebce</code></a>
Updating changelog</li>
<li><a
href="237eee2246"><code>237eee2</code></a>
v12.23.23</li>
<li><a
href="834965c803"><code>834965c</code></a>
Updating changelog</li>
<li><a
href="40690864e9"><code>4069086</code></a>
Update README.md</li>
<li><a
href="6da6b61e94"><code>6da6b61</code></a>
Update README.md with new sponsor links</li>
<li><a
href="e36683149d"><code>e366831</code></a>
Update README.md</li>
<li><a
href="7796f4f1e0"><code>7796f4f</code></a>
Update Gold section with new links and images</li>
<li><a
href="d1bb93757c"><code>d1bb937</code></a>
Update sponsor section in README.md</li>
<li><a
href="97fba16059"><code>97fba16</code></a>
Update sponsorship logos in README</li>
<li>Additional commits viewable in <a
href="https://github.com/motiondivision/motion/compare/v12.23.12...v12.23.24">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@types/react-dom](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-dom)
from 19.2.0 to 19.2.1.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-dom">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@types/react](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react)
from 19.2.0 to 19.2.2.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Implements missing streaming events from OpenAI Responses API spec:
- reasoning text/summary events for o1/o3 models,
- refusal events for safety moderation
- annotation events for citations,
- and file search streaming events.
Added optional reasoning_content field to chat completion chunks to
support non-standard provider extensions.
**NOTE:** OpenAI does _not_ fill reasoning_content when users use the
chat_completion APIs. This means there is no way for us to implement
Responses (with reasoning) by using OpenAI chat completions! We'd need
to transparently punt to OpenAI's responses endpoints if we wish to do
that. For others though (vLLM, etc.) we can use it.
## Test Plan
File search streaming test passes:
```
./scripts/integration-tests.sh --stack-config server:ci-tests \
--suite responses --setup gpt --inference-mode replay --pattern test_response_file_search_streaming_events
```
Need more complex setup and validation for reasoning tests (need a vLLM
powered OSS model maybe gpt-oss which can return reasoning_content). I
will do that in a followup PR.
Bumps [psycopg2-binary](https://github.com/psycopg/psycopg2) from 2.9.10
to 2.9.11.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psycopg/psycopg2/blob/master/NEWS">psycopg2-binary's
changelog</a>.</em></p>
<blockquote>
<h2>Current release</h2>
<p>What's new in psycopg 2.9.11
^^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<ul>
<li>Add support for Python 3.14.</li>
<li>Avoid a segfault passing more arguments than placeholders if Python
is built
with assertions enabled
(🎫<code>[#1791](https://github.com/psycopg/psycopg2/issues/1791)</code>).</li>
<li><code>~psycopg2.errorcodes</code> map and
<code>~psycopg2.errors</code> classes updated to
PostgreSQL 18.</li>
<li>Drop support for Python 3.8.</li>
</ul>
<p>What's new in psycopg 2.9.10
^^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<ul>
<li>Add support for Python 3.13.</li>
<li>Receive notifications on commit
(🎫<code>[#1728](https://github.com/psycopg/psycopg2/issues/1728)</code>).</li>
<li><code>~psycopg2.errorcodes</code> map and
<code>~psycopg2.errors</code> classes updated to
PostgreSQL 17.</li>
<li>Drop support for Python 3.7.</li>
</ul>
<p>What's new in psycopg 2.9.9
^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<ul>
<li>Add support for Python 3.12.</li>
<li>Drop support for Python 3.6.</li>
</ul>
<p>What's new in psycopg 2.9.8
^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<ul>
<li>Wheel package bundled with PostgreSQL 16 libpq in order to add
support for
recent features, such as <code>sslcertmode</code>.</li>
</ul>
<p>What's new in psycopg 2.9.7
^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<ul>
<li>Fix propagation of exceptions raised during module initialization
(🎫<code>[#1598](https://github.com/psycopg/psycopg2/issues/1598)</code>).</li>
<li>Fix building when pg_config returns an empty string
(🎫<code>[#1599](https://github.com/psycopg/psycopg2/issues/1599)</code>).</li>
<li>Wheel package bundled with OpenSSL 1.1.1v.</li>
</ul>
<p>What's new in psycopg 2.9.6
^^^^^^^^^^^^^^^^^^^^^^^^^^^</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="fd9ae8cad2"><code>fd9ae8c</code></a>
chore: bump to version 2.9.11</li>
<li><a
href="d923840546"><code>d923840</code></a>
chore: update docs requirements</li>
<li><a
href="d42dc7169d"><code>d42dc71</code></a>
Merge branch 'fix-1791'</li>
<li><a
href="4fde6560c3"><code>4fde656</code></a>
fix: avoid failed assert passing more arguments than placeholders</li>
<li><a
href="8308c19d6a"><code>8308c19</code></a>
fix: drop warning about the use of deprecated PyWeakref_GetObject
function</li>
<li><a
href="1a1eabf098"><code>1a1eabf</code></a>
build(deps): bump actions/github-script from 7 to 8</li>
<li><a
href="897af8b38b"><code>897af8b</code></a>
build(deps): bump peter-evans/repository-dispatch from 3 to 4</li>
<li><a
href="ceefd30511"><code>ceefd30</code></a>
build(deps): bump actions/checkout from 4 to 5</li>
<li><a
href="4dc585430c"><code>4dc5854</code></a>
build(deps): bump actions/setup-python from 5 to 6</li>
<li><a
href="1945788dcf"><code>1945788</code></a>
Merge pull request <a
href="https://redirect.github.com/psycopg/psycopg2/issues/1802">#1802</a>
from edgarrmondragon/cp314-wheels</li>
<li>Additional commits viewable in <a
href="https://github.com/psycopg/psycopg2/compare/2.9.10...2.9.11">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.8.0 to 7.0.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v7.0.0 🌈 node24 and a lot of bugfixes</h2>
<h2>Changes</h2>
<p>This release comes with a load of bug fixes and a speed up. Because
of switching from node20 to node24 it is also a breaking change. If you
are running on GitHub hosted runners this will just work, if you are
using self-hosted runners make sure, that your runners are up to date.
If you followed the normal installation instructions your self-hosted
runner will keep itself updated.</p>
<p>This release also removes the deprecated input
<code>server-url</code> which was used to download uv releases from a
different server.
The <a
href="https://github.com/astral-sh/setup-uv?tab=readme-ov-file#manifest-file">manifest-file</a>
input supersedes that functionality by adding a flexible way to define
available versions and where they should be downloaded from.</p>
<h3>Fixes</h3>
<ul>
<li>The action now respects when the environment variable
<code>UV_CACHE_DIR</code> is already set and does not overwrite it. It
now also finds <a
href="https://docs.astral.sh/uv/reference/settings/#cache-dir">cache-dir</a>
settings in config files if you set them.</li>
<li>Some users encountered problems that <a
href="https://github.com/astral-sh/setup-uv?tab=readme-ov-file#disable-cache-pruning">cache
pruning</a> took forever because they had some <code>uv</code> processes
running in the background. Starting with uv version <code>0.8.24</code>
this action uses <code>uv cache prune --ci --force</code> to ignore the
running processes</li>
<li>If you just want to install uv but not have it available in path,
this action now respects <code>UV_NO_MODIFY_PATH</code></li>
<li>Some other actions also set the env var <code>UV_CACHE_DIR</code>.
This action can now deal with that but as this could lead to unwanted
behavior in some edgecases a warning is now displayed.</li>
</ul>
<h3>Improvements</h3>
<p>If you are using minimum version specifiers for the version of uv to
install for example</p>
<pre lang="toml"><code>[tool.uv]
required-version = ">=0.8.17"
</code></pre>
<p>This action now detects that and directly uses the latest version.
Previously it would download all available releases from the uv repo
to determine the highest matching candidate for the version specifier,
which took much more time.</p>
<p>If you are using other specifiers like <code>0.8.x</code> this action
still needs to download all available releases because the specifier
defines an upper bound (not 0.9.0 or later) and "latest" would
possibly not satisfy that.</p>
<h2>🚨 Breaking changes</h2>
<ul>
<li>Use node24 instead of node20 <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/608">#608</a>)</li>
<li>Remove deprecated input server-url <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/607">#607</a>)</li>
</ul>
<h2>🐛 Bug fixes</h2>
<ul>
<li>Respect UV_CACHE_DIR and cache-dir <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/612">#612</a>)</li>
<li>Use --force when pruning cache <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/611">#611</a>)</li>
<li>Respect UV_NO_MODIFY_PATH <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/603">#603</a>)</li>
<li>Warn when <code>UV_CACHE_DIR</code> has changed <a
href="https://github.com/jamesbraza"><code>@jamesbraza</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/601">#601</a>)</li>
</ul>
<h2>🚀 Enhancements</h2>
<ul>
<li>Shortcut to latest version for minimum version specifier <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/598">#598</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>Bump dependencies <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/613">#613</a>)</li>
<li>Fix test-uv-no-modify-path <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/604">#604</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="eb1897b8dc"><code>eb1897b</code></a>
Bump dependencies (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/613">#613</a>)</li>
<li><a
href="d78d791822"><code>d78d791</code></a>
Bump github/codeql-action from 3.30.5 to 3.30.6 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/605">#605</a>)</li>
<li><a
href="535dc2664c"><code>535dc26</code></a>
Respect UV_CACHE_DIR and cache-dir (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/612">#612</a>)</li>
<li><a
href="f610be5ff9"><code>f610be5</code></a>
Use --force when pruning cache (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/611">#611</a>)</li>
<li><a
href="3deccc0075"><code>3deccc0</code></a>
Use node24 instead of node20 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/608">#608</a>)</li>
<li><a
href="d9ee7e2f26"><code>d9ee7e2</code></a>
Remove deprecated input server-url (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/607">#607</a>)</li>
<li><a
href="59a0868fea"><code>59a0868</code></a>
Bump github/codeql-action from 3.30.3 to 3.30.5 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/594">#594</a>)</li>
<li><a
href="c952556164"><code>c952556</code></a>
Bump <code>@renovatebot/pep440</code> from 4.2.0 to 4.2.1 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/581">#581</a>)</li>
<li><a
href="51c3328db2"><code>51c3328</code></a>
Fix test-uv-no-modify-path (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/604">#604</a>)</li>
<li><a
href="f2859da213"><code>f2859da</code></a>
Respect UV_NO_MODIFY_PATH (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/603">#603</a>)</li>
<li>Additional commits viewable in <a
href="d0cc045d04...eb1897b8dc">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
Removes VectorDBs from API surface and our tests.
Moves tests to Vector Stores.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Allows passing through extra_body parameters to inference providers.
With this, we removed the 2 vllm-specific parameters from completions
API into `extra_body`.
Before/After
<img width="1883" height="324" alt="image"
src="https://github.com/user-attachments/assets/acb27c08-c748-46c9-b1da-0de64e9908a1"
/>
closes#2720
## Test Plan
CI and added new test
```
❯ uv run pytest -s -v tests/integration/ --stack-config=server:starter --inference-mode=record -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag ) and test_openai_completion_guided_choice' --setup=vllm --suite=base --color=yes
Uninstalled 3 packages in 125ms
Installed 3 packages in 19ms
INFO 2025-10-10 14:29:54,317 tests.integration.conftest:118 tests: Applying setup 'vllm' for suite base
INFO 2025-10-10 14:29:54,331 tests.integration.conftest:47 tests: Test stack config type: server
(stack_config=server:starter)
============================================================================================================== test session starts ==============================================================================================================
platform darwin -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- /Users/erichuang/projects/llama-stack-1/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'macOS-15.6.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/erichuang/projects/llama-stack-1
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 285 items / 284 deselected / 1 selected
tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B]
instantiating llama_stack_client
Starting llama stack server with config 'starter' on port 8321...
Waiting for server at http://localhost:8321... (0.0s elapsed)
Waiting for server at http://localhost:8321... (0.5s elapsed)
Waiting for server at http://localhost:8321... (5.1s elapsed)
Waiting for server at http://localhost:8321... (5.6s elapsed)
Waiting for server at http://localhost:8321... (10.1s elapsed)
Waiting for server at http://localhost:8321... (10.6s elapsed)
Server is ready at http://localhost:8321
llama_stack_client instantiated in 11.773s
PASSEDTerminating llama stack server process...
Terminating process 98444 and its group...
Server process and children terminated gracefully
============================================================================================================= slowest 10 durations ==============================================================================================================
11.88s setup tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B]
3.02s call tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B]
0.01s teardown tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B]
================================================================================================ 1 passed, 284 deselected, 3 warnings in 16.21s =================================================================================================
```
# What does this PR do?
Converts openai(_chat)_completions params to pydantic BaseModel to
reduce code duplication across all providers.
## Test Plan
CI
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3761).
* #3777
* __->__ #3761
The AuthenticationMiddleware was blocking all requests without an
Authorization header, including health and version endpoints that are
needed by monitoring tools, load balancers, and Kubernetes probes.
This commit allows endpoints ending in /health or /version to bypass
authentication, enabling operational tooling to function properly
without requiring credentials.
Closes: #3735
Signed-off-by: Derek Higgins <derekh@redhat.com>
Implementats usage accumulation to StreamingResponseOrchestrator.
The most important part was to pass `stream_options = { "include_usage":
true }` to the chat_completion call. This means I will have to record
all responses tests again because request hash will change :)
Test changes:
- Add usage assertions to streaming and non-streaming tests
- Update test recordings with actual usage data from OpenAI
There are many changes to responses which are landing. They are
introducing fundamental new types. This means re-recordings even from
the inference calls. Let's avoid that for now.
Once everything lands I will re-record everything, make things pass and
re-enable.
# What does this PR do?
This PR checks whether, if a previous response is linked, there are
mcp_list_tools objects that can be reused instead of listing the tools
explicitly every time.
Closes#3106
## Test Plan
Tested manually.
Added unit tests to cover new behaviour.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
use SecretStr for OpenAIMixin providers
- RemoteInferenceProviderConfig now has auth_credential: SecretStr
- the default alias is api_key (most common name)
- some providers override to use api_token (RunPod, vLLM, Databricks)
- some providers exclude it (Ollama, TGI, Vertex AI)
addresses #3517
## Test Plan
ci w/ new tests
Updated scripts/normalize_recordings.py to dynamically find and process
all 'recordings' directories under tests/ using pathlib.rglob() instead
of hardcoding a single path.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Allows model check to fail gracefully instead of crashing on startup.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
set VLLM_URL to your VLLM server
```
(base) akram@Mac llama-stack % LAMA_STACK_LOGGING="all=debug" VLLM_ENABLE_MODEL_DISCOVERY=false MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
```
INFO 2025-10-08 20:11:24,637 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-08 20:11:24,866 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-08 20:11:26,160 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=9fba207425
5851c718aca717a5887d76%3A%2Fmodels">Found</a>.
[...]
INFO 2025-10-08 20:11:26,295 uvicorn.error:84 uncategorized: Started server process [83144]
INFO 2025-10-08 20:11:26,296 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-10-08 20:11:26,297 llama_stack.core.server.server:170 core::server: Starting up
INFO 2025-10-08 20:11:26,297 llama_stack.core.stack:399 core: starting registry refresh task
INFO 2025-10-08 20:11:26,311 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-10-08 20:11:26,312 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
ERROR 2025-10-08 20:11:26,791 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=8ef0cba3e1
71a4f8b04cb445cfb91a4c%3A%2Fmodels">Found</a>.
```
## Summary
Adds OpenAI-compatible usage tracking types to enable reporting token
consumption for both streaming and non-streaming responses.
## Type Definitions
**Chat Completion Usage** (inference API):
```python
class OpenAIChatCompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
prompt_tokens_details: OpenAIChatCompletionUsagePromptTokensDetails | None
completion_tokens_details: OpenAIChatCompletionUsageCompletionTokensDetails | None
```
**Response Usage** (responses API):
```python
class OpenAIResponseUsage(BaseModel):
input_tokens: int
output_tokens: int
total_tokens: int
input_tokens_details: OpenAIResponseUsageInputTokensDetails | None
output_tokens_details: OpenAIResponseUsageOutputTokensDetails | None
```
This matches OpenAI's usage reporting format and enables PR #3766 to
implement usage tracking in streaming responses.
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
After removing model management CLI in #3700, this PR updates remaining
references to the old `llama download` command to use `huggingface-cli
download` instead.
## Changes
- Updated error messages in `meta_reference/common.py` to recommend
`huggingface-cli download`
- Updated error messages in
`torchtune/recipes/lora_finetuning_single_device.py` to use
`huggingface-cli download`
- Updated post-training notebook to use `huggingface-cli download`
instead of `llama download`
- Fixed typo: "you model" -> "your model"
## Test Plan
- Verified error messages provide correct guidance for users
- Checked that notebook instructions are up-to-date with current tooling
## Summary
Fixes#2990
Remote provider authentication errors (401/403) were being converted to
500 Internal Server Error, preventing users from understanding why their
requests failed.
## The Problem
When a request with an invalid API key was sent to a remote provider:
- Provider correctly returns 401 with error details
- Llama Stack's `translate_exception()` didn't recognize provider SDK
exceptions
- Fell through to generic 500 error handler
- User received: "Internal server error: An unexpected error occurred."
## The Fix
Added handler in `translate_exception()` that checks for exceptions with
a `status_code` attribute and preserves the original HTTP status code
and error message.
**Before:**
```json
HTTP 500
{"detail": "Internal server error: An unexpected error occurred."}
```
**After:**
```json
HTTP 401
{"detail": "Error code: 401 - {'error': {'message': 'Invalid API Key', 'type': 'invalid_request_error', 'code': 'invalid_api_key'}}"}
```
## Tested With
- ✅ groq: 401 "Invalid API Key"
- ✅ openai: 401 "Incorrect API key provided"
- ✅ together: 401 "Invalid API key provided"
- ✅ fireworks: 403 "unauthorized"
## Test Plan
**Automated test script:**
https://gist.github.com/ashwinb/1199dd7585ffa3f4be67b111cc65f2f3
The test script:
1. Builds separate stacks for each provider
2. Registers models (with validation temporarily disabled for testing)
3. Sends requests with invalid API keys via `x-llamastack-provider-data`
header
4. Verifies HTTP status codes are 401/403 (not 500)
**Results before fix:** All providers returned 500
**Results after fix:** All providers correctly return 401/403
**Manual verification:**
```bash
# 1. Build stack
llama stack build --image-type venv --providers inference=remote::groq
# 2. Start stack
llama stack run
# 3. Send request with invalid API key
curl http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-H 'x-llamastack-provider-data: {"groq_api_key": "invalid-key"}' \
-d '{"model": "groq/llama3-70b-8192", "messages": [{"role": "user", "content": "test"}]}'
# Expected: HTTP 401 with provider error message (not 500)
```
## Impact
- Works with all remote providers using OpenAI SDK (groq, openai,
together, fireworks, etc.)
- Works with any provider SDK that follows the pattern of exceptions
with `status_code` attribute
- No breaking changes - only affects error responses
# What does this PR do?
objects (vector dbs, models, scoring functions, etc) have an identifier
and associated object values.
we allow exact duplicate registrations.
we reject registrations when the identifier exists and the associated
object values differ.
note: model are namespaced, i.e. {provider_id}/{identifier}, while other
object types are not
## Test Plan
ci w/ new tests
This change removes the `llama model` and `llama download` subcommands
from the CLI, replacing them with recommendations to use the Hugging
Face CLI instead.
Rationale for this change:
- The model management functionality was largely duplicating what
Hugging Face CLI already provides, leading to unnecessary maintenance
overhead (except the download source from Meta?)
- Maintaining our own implementation required fixing bugs and keeping up
with changes in model repositories and download mechanisms
- The Hugging Face CLI is more mature, widely adopted, and better
maintained
- This allows us to focus on the core Llama Stack functionality rather
than reimplementing model management tools
Changes made:
- Removed all model-related CLI commands and their implementations
- Updated documentation to recommend using `huggingface-cli` for model
downloads
- Removed Meta-specific download logic and statements
- Simplified the CLI to focus solely on stack management operations
Users should now use:
- `huggingface-cli download` for downloading models
- `huggingface-cli scan-cache` for listing downloaded models
This is a breaking change as it removes previously available CLI
commands.
Signed-off-by: Sébastien Han <seb@redhat.com>
Replaces opaque error messages when recordings are not found with
somewhat better guidance
Before:
```
No recorded response found for request hash: abc123...
To record this response, run with LLAMA_STACK_TEST_INFERENCE_MODE=record
```
After:
```
Recording not found for request hash: abc123
Model: gpt-4 | Request: POST https://api.openai.com/v1/chat/completions
Run './scripts/integration-tests.sh --inference-mode record-if-missing' with required API keys to generate.
```
These vector databases are already thoroughly tested in integration
tests.
Unit tests now focus on sqlite_vec, faiss, and pgvector with mocked
dependencies, removing the need for external service dependencies.
## Changes:
- Deleted test_qdrant.py unit test file
- Removed chroma/qdrant fixtures and parametrization from conftest.py
- Fixed SqliteKVStoreConfig import to use correct location
- Removed chromadb, qdrant-client, pymilvus, milvus-lite, and
weaviate-client from unit test dependencies in pyproject.toml
Renames `inference_recorder.py` to `api_recorder.py` and extends it to
support recording/replaying tool invocations in addition to inference
calls.
This allows us to record web-search, etc. tool calls and thereafter
apply recordings for `tests/integration/responses`
## Test Plan
```
export OPENAI_API_KEY=...
export TAVILY_SEARCH_API_KEY=...
./scripts/integration-tests.sh --stack-config ci-tests \
--suite responses --inference-mode record-if-missing
```
# What does this PR do?
Adds traces around tool execution and mcp tool listing for better
observability.
Closes#3108
## Test Plan
Manually examined traces in jaeger to verify the added information was
available.
Signed-off-by: Gordon Sim <gsim@redhat.com>
Adds --collect-only flag to scripts/integration-tests.sh that skips
server startup and passes the flag to pytest for test collection only.
When specified, minimal flags are required (no --stack-config or --setup
needed).
## Changes
- Added `--collect-only` flag that skips server startup
- Made `--stack-config` and `--setup` optional when using
`--collect-only`
- Skip `llama` command check when collecting tests only
## Usage
```bash
# Collect tests without starting server
./scripts/integration-tests.sh --subdirs inference --collect-only
```
# What does this PR do?
Removing Weaviate, PostGres, and Milvus unit tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Propagate test IDs from client to server via HTTP headers to maintain
proper test isolation when running with server-based stack configs.
Without
this, recorded/replayed inference requests in server mode would leak
across
tests.
Changes:
- Patch client _prepare_request to inject test ID into provider data
header
- Sync test context from provider data on server side before storage
operations
- Set LLAMA_STACK_TEST_STACK_CONFIG_TYPE env var based on stack config
- Configure console width for cleaner log output in CI
- Add SQLITE_STORE_DIR temp directory for test data isolation
# What does this PR do?
It prevents a tool call message being added to the chat completions
message without a corresponding tool call result, which is needed in the
case that an approval is required first or if the approval request is
denied. In both these cases the tool call messages is popped of the next
turn messages.
Closes#3728
## Test Plan
Ran the integration tests
Manual check of both approval and denial against gpt-4o
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
- The watsonx.ai provider now uses the LiteLLM mixin instead of using
IBM's library, which does not seem to be working (see #3165 for
context).
- The watsonx.ai provider now lists all the models available by calling
the watsonx.ai server instead of having a hard coded list of known
models. (That list gets out of date quickly)
- An edge case in
[llama_stack/core/routers/inference.py](https://github.com/llamastack/llama-stack/pull/3674/files#diff-a34bc966ed9befd9f13d4883c23705dff49be0ad6211c850438cdda6113f3455)
is addressed that was causing my manual tests to fail.
- Fixes `b64_encode_openai_embeddings_response` which was trying to
enumerate over a dictionary and then reference elements of the
dictionary using .field instead of ["field"]. That method is called by
the LiteLLM mixin for embedding models, so it is needed to get the
watsonx.ai embedding models to work.
- A unit test along the lines of the one in #3348 is added. A more
comprehensive plan for automatically testing the end-to-end
functionality for inference providers would be a good idea, but is out
of scope for this PR.
- Updates to the watsonx distribution. Some were in response to the
switch to LiteLLM (e.g., updating the Python packages needed). Others
seem to be things that were already broken that I found along the way
(e.g., a reference to a watsonx specific doc template that doesn't seem
to exist).
Closes#3165
Also it is related to a line-item in #3387 but doesn't really address
that goal (because it uses the LiteLLM mixin, not the OpenAI one). I
tried the OpenAI one and it doesn't work with watsonx.ai, presumably
because the watsonx.ai service is not OpenAI compatible. It works with
LiteLLM because LiteLLM has a provider implementation for watsonx.ai.
## Test Plan
The test script below goes back and forth between the OpenAI and watsonx
providers. The idea is that the OpenAI provider shows how it should work
and then the watsonx provider output shows that it is also working with
watsonx. Note that the result from the MCP test is not as good (the
Llama 3.3 70b model does not choose tools as wisely as gpt-4o), but it
is still working and providing a valid response. For more details on
setup and the MCP server being used for testing, see [the AI Alliance
sample
notebook](https://github.com/The-AI-Alliance/llama-stack-examples/blob/main/notebooks/01-responses/)
that these examples are drawn from.
```python
#!/usr/bin/env python3
import json
from llama_stack_client import LlamaStackClient
from litellm import completion
import http.client
def print_response(response):
"""Print response in a nicely formatted way"""
print(f"ID: {response.id}")
print(f"Status: {response.status}")
print(f"Model: {response.model}")
print(f"Created at: {response.created_at}")
print(f"Output items: {len(response.output)}")
for i, output_item in enumerate(response.output):
if len(response.output) > 1:
print(f"\n--- Output Item {i+1} ---")
print(f"Output type: {output_item.type}")
if output_item.type in ("text", "message"):
print(f"Response content: {output_item.content[0].text}")
elif output_item.type == "file_search_call":
print(f" Tool Call ID: {output_item.id}")
print(f" Tool Status: {output_item.status}")
# 'queries' is a list, so we join it for clean printing
print(f" Queries: {', '.join(output_item.queries)}")
# Display results if they exist, otherwise note they are empty
print(f" Results: {output_item.results if output_item.results else 'None'}")
elif output_item.type == "mcp_list_tools":
print_mcp_list_tools(output_item)
elif output_item.type == "mcp_call":
print_mcp_call(output_item)
else:
print(f"Response content: {output_item.content}")
def print_mcp_call(mcp_call):
"""Print MCP call in a nicely formatted way"""
print(f"\n🛠️ MCP Tool Call: {mcp_call.name}")
print(f" Server: {mcp_call.server_label}")
print(f" ID: {mcp_call.id}")
print(f" Arguments: {mcp_call.arguments}")
if mcp_call.error:
print("Error: {mcp_call.error}")
elif mcp_call.output:
print("Output:")
# Try to format JSON output nicely
try:
parsed_output = json.loads(mcp_call.output)
print(json.dumps(parsed_output, indent=4))
except:
# If not valid JSON, print as-is
print(f" {mcp_call.output}")
else:
print(" ⏳ No output yet")
def print_mcp_list_tools(mcp_list_tools):
"""Print MCP list tools in a nicely formatted way"""
print(f"\n🔧 MCP Server: {mcp_list_tools.server_label}")
print(f" ID: {mcp_list_tools.id}")
print(f" Available Tools: {len(mcp_list_tools.tools)}")
print("=" * 80)
for i, tool in enumerate(mcp_list_tools.tools, 1):
print(f"\n{i}. {tool.name}")
print(f" Description: {tool.description}")
# Parse and display input schema
schema = tool.input_schema
if schema and 'properties' in schema:
properties = schema['properties']
required = schema.get('required', [])
print(" Parameters:")
for param_name, param_info in properties.items():
param_type = param_info.get('type', 'unknown')
param_desc = param_info.get('description', 'No description')
required_marker = " (required)" if param_name in required else " (optional)"
print(f" • {param_name} ({param_type}){required_marker}")
if param_desc:
print(f" {param_desc}")
if i < len(mcp_list_tools.tools):
print("-" * 40)
def main():
"""Main function to run all the tests"""
# Configuration
LLAMA_STACK_URL = "http://localhost:8321/"
LLAMA_STACK_MODEL_IDS = [
"openai/gpt-3.5-turbo",
"openai/gpt-4o",
"llama-openai-compat/Llama-3.3-70B-Instruct",
"watsonx/meta-llama/llama-3-3-70b-instruct"
]
# Using gpt-4o for this demo, but feel free to try one of the others or add more to run.yaml.
OPENAI_MODEL_ID = LLAMA_STACK_MODEL_IDS[1]
WATSONX_MODEL_ID = LLAMA_STACK_MODEL_IDS[-1]
NPS_MCP_URL = "http://localhost:3005/sse/"
print("=== Llama Stack Testing Script ===")
print(f"Using OpenAI model: {OPENAI_MODEL_ID}")
print(f"Using WatsonX model: {WATSONX_MODEL_ID}")
print(f"MCP URL: {NPS_MCP_URL}")
print()
# Initialize client
print("Initializing LlamaStackClient...")
client = LlamaStackClient(base_url="http://localhost:8321")
# Test 1: List models
print("\n=== Test 1: List Models ===")
try:
models = client.models.list()
print(f"Found {len(models)} models")
except Exception as e:
print(f"Error listing models: {e}")
raise e
# Test 2: Basic chat completion with OpenAI
print("\n=== Test 2: Basic Chat Completion (OpenAI) ===")
try:
chat_completion_response = client.chat.completions.create(
model=OPENAI_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print("OpenAI Response:")
for chunk in chat_completion_response.choices[0].message.content:
print(chunk, end="", flush=True)
print()
except Exception as e:
print(f"Error with OpenAI chat completion: {e}")
raise e
# Test 3: Basic chat completion with WatsonX
print("\n=== Test 3: Basic Chat Completion (WatsonX) ===")
try:
chat_completion_response_wxai = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print("WatsonX Response:")
for chunk in chat_completion_response_wxai.choices[0].message.content:
print(chunk, end="", flush=True)
print()
except Exception as e:
print(f"Error with WatsonX chat completion: {e}")
raise e
# Test 4: Tool calling with OpenAI
print("\n=== Test 4: Tool Calling (OpenAI) ===")
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a specific location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
},
},
"required": ["location"],
},
},
}
]
messages = [
{"role": "user", "content": "What's the weather like in Boston, MA?"}
]
try:
print("--- Initial API Call ---")
response = client.chat.completions.create(
model=OPENAI_MODEL_ID,
messages=messages,
tools=tools,
tool_choice="auto", # "auto" is the default
)
print("OpenAI tool calling response received")
except Exception as e:
print(f"Error with OpenAI tool calling: {e}")
raise e
# Test 5: Tool calling with WatsonX
print("\n=== Test 5: Tool Calling (WatsonX) ===")
try:
wxai_response = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=messages,
tools=tools,
tool_choice="auto", # "auto" is the default
)
print("WatsonX tool calling response received")
except Exception as e:
print(f"Error with WatsonX tool calling: {e}")
raise e
# Test 6: Streaming with WatsonX
print("\n=== Test 6: Streaming Response (WatsonX) ===")
try:
chat_completion_response_wxai_stream = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}],
stream=True
)
print("Model response: ", end="")
for chunk in chat_completion_response_wxai_stream:
# Each 'chunk' is a ChatCompletionChunk object.
# We want the content from the 'delta' attribute.
if hasattr(chunk, 'choices') and chunk.choices is not None:
content = chunk.choices[0].delta.content
# The first few chunks might have None content, so we check for it.
if content is not None:
print(content, end="", flush=True)
print()
except Exception as e:
print(f"Error with streaming: {e}")
raise e
# Test 7: MCP with OpenAI
print("\n=== Test 7: MCP Integration (OpenAI) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=OPENAI_MODEL_ID,
input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.",
tools=[
{
"type": "mcp",
"server_url": NPS_MCP_URL,
"server_label": "National Parks Service tools",
"allowed_tools": ["search_parks", "get_park_events"],
}
]
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (OpenAI): {e}")
raise e
# Test 8: MCP with WatsonX
print("\n=== Test 8: MCP Integration (WatsonX) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=WATSONX_MODEL_ID,
input="What is the capital of France?"
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (WatsonX): {e}")
raise e
# Test 9: MCP with Llama 3.3
print("\n=== Test 9: MCP Integration (Llama 3.3) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=WATSONX_MODEL_ID,
input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.",
tools=[
{
"type": "mcp",
"server_url": NPS_MCP_URL,
"server_label": "National Parks Service tools",
"allowed_tools": ["search_parks", "get_park_events"],
}
]
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (Llama 3.3): {e}")
raise e
# Test 10: Embeddings
print("\n=== Test 10: Embeddings ===")
try:
conn = http.client.HTTPConnection("localhost:8321")
payload = json.dumps({
"model": "watsonx/ibm/granite-embedding-278m-multilingual",
"input": "Hello, world!",
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
conn.request("POST", "/v1/openai/v1/embeddings", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
except Exception as e:
print(f"Error with Embeddings: {e}")
raise e
print("\n=== Testing Complete ===")
if __name__ == "__main__":
main()
```
---------
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 10.0.0 to
10.1.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/stale/releases">actions/stale's
releases</a>.</em></p>
<blockquote>
<h2>v10.1.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add <code>only-issue-types</code> option to filter issues by type by
<a href="https://github.com/Bibo-Joshi"><code>@Bibo-Joshi</code></a> in
<a
href="https://redirect.github.com/actions/stale/pull/1255">actions/stale#1255</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/Bibo-Joshi"><code>@Bibo-Joshi</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/stale/pull/1255">actions/stale#1255</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/stale/compare/v10...v10.1.0">https://github.com/actions/stale/compare/v10...v10.1.0</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="5f858e3efb"><code>5f858e3</code></a>
Add <code>only-issue-types</code> option to filter issues by type (<a
href="https://redirect.github.com/actions/stale/issues/1255">#1255</a>)</li>
<li>See full diff in <a
href="3a9db7e6a4...5f858e3efb">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
When the user wants to change the attributes (which could include model
name, dimensions,...etc) of an already registered provider, they will
get an error message asking that they first unregister the provider
before registering a new one.
# What does this PR do?
This PR updated the register function to raise an error to the user when
they attempt to register a provider that was already registered asking
them to un-register the existing provider first.
<!-- If resolving an issue, uncomment and update the line below -->
#2313
## Test Plan
Tested the change with /tests/unit/registry/test_registry.py
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
# What does this PR do?
user can simply set env vars in the beginning of the command.`FOO=BAR
llama stack run ...`
## Test Plan
Run
TELEMETRY_SINKS=coneol uv run --with llama-stack llama stack build
--distro=starter --image-type=venv --run
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3711).
* #3714
* __->__ #3711
# What does this PR do?
Removing some dead code, found by vulture and checked by claude that
there are no references or imports for these
## Test Plan
CI
# What does this PR do?
(Used claude to solve #3715, coded with claude but tested by me)
## From claude summary:
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
**Problem**: The `NVIDIAInferenceAdapter` class was missing the
`alias_to_provider_id_map` attribute, which caused the error:
`ERROR 'NVIDIAInferenceAdapter' object has no attribute
'alias_to_provider_id_map'`
**Root Cause**: The `NVIDIAInferenceAdapter` only inherited from
`OpenAIMixin`, but some parts of the system expected it to have the
`alias_to_provider_id_map` attribute, which is provided by the
`ModelRegistryHelper` class.
**Solution**:
1. **Added ModelRegistryHelper import**: Imported the
`ModelRegistryHelper` class from
`llama_stack.providers.utils.inference.model_registry`
2. **Updated inheritance**: Changed the class declaration to inherit
from both `OpenAIMixin` and `ModelRegistryHelper`
3. **Added proper initialization**: Added an `__init__` method that
properly initializes the `ModelRegistryHelper` with empty model entries
(since NVIDIA uses dynamic model discovery) and the allowed models from
the configuration
**Key Changes**:
* Added `from llama_stack.providers.utils.inference.model_registry
import ModelRegistryHelper`
* Changed class declaration from `class
NVIDIAInferenceAdapter(OpenAIMixin):` to `class
NVIDIAInferenceAdapter(OpenAIMixin, ModelRegistryHelper):`
* Added `__init__` method that calls `ModelRegistryHelper.__init__(self,
model_entries=[], allowed_models=config.allowed_models)`
The inheritance order is important - `OpenAIMixin` comes first to ensure
its `check_model_availability()` method takes precedence over the
`ModelRegistryHelper` version, as mentioned in the class documentation.
This fix ensures that the `NVIDIAInferenceAdapter` has the required
`alias_to_provider_id_map` attribute while maintaining all existing
functionality.<!-- If resolving an issue, uncomment and update the line
below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Launching llama-stack server successfully, see logs:
```
NVIDIA_API_KEY=dummy NVIDIA_BASE_URL=http://localhost:8912 llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --image-type venv &
[2] 3753042
(venv) nvidia@nv-meta-H100-testing-gpu01:~/kai/llama-stack$ WARNING 2025-10-07 00:29:09,848 root:266 uncategorized: Unknown logging category:
openai::conversations. Falling back to default 'root' level: 20
WARNING 2025-10-07 00:29:09,932 root:266 uncategorized: Unknown logging category: cli.
Falling back to default 'root' level: 20
INFO 2025-10-07 00:29:09,937 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:09,937 llama_stack.cli.stack.run:136 cli: Using run
configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml
Using virtual environment: /home/nvidia/kai/venv
Virtual environment already activated
+ '[' -n /home/nvidia/.llama/distributions/starter/starter-run.yaml ']'
+ yaml_config_arg=/home/nvidia/.llama/distributions/starter/starter-run.yaml
+ llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --port 8321
WARNING 2025-10-07 00:29:11,432 root:266 uncategorized: Unknown logging category:
openai::conversations. Falling back to default 'root' level: 20
WARNING 2025-10-07 00:29:11,593 root:266 uncategorized: Unknown logging category: cli.
Falling back to default 'root' level: 20
INFO 2025-10-07 00:29:11,603 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,604 llama_stack.cli.stack.run:136 cli: Using run
configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,624 llama_stack.cli.stack.run:155 cli: No image type or
image name provided. Assuming environment packages.
INFO 2025-10-07 00:29:11,625 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,644 llama_stack.cli.stack.run:230 cli: HTTPS enabled with
certificates:
Key: None
Cert: None
INFO 2025-10-07 00:29:11,645 llama_stack.cli.stack.run:232 cli: Listening on ['::',
'0.0.0.0']:8321
INFO 2025-10-07 00:29:11,816 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,836 llama_stack.core.server.server:480 core::server: Run
configuration:
INFO 2025-10-07 00:29:11,845 llama_stack.core.server.server:483 core::server: apis:
- agents
- batches
- datasetio
- eval
- files
- inference
- post_training
- safety
- scoring
- telemetry
- tool_runtime
- vector_io
benchmarks: []
datasets: []
image_name: starter
inference_store:
db_path: /home/nvidia/.llama/distributions/starter/inference_store.db
type: sqlite
metadata_store:
db_path: /home/nvidia/.llama/distributions/starter/registry.db
type: sqlite
models: []
providers:
agents:
- config:
persistence_store:
db_path: /home/nvidia/.llama/distributions/starter/agents_store.db
type: sqlite
responses_store:
db_path: /home/nvidia/.llama/distributions/starter/responses_store.db
type: sqlite
provider_id: meta-reference
provider_type: inline::meta-reference
batches:
- config:
kvstore:
db_path: /home/nvidia/.llama/distributions/starter/batches.db
type: sqlite
provider_id: reference
provider_type: inline::reference
datasetio:
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/huggingface_datasetio.db
type: sqlite
provider_id: huggingface
provider_type: remote::huggingface
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/localfs_datasetio.db
type: sqlite
provider_id: localfs
provider_type: inline::localfs
eval:
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/meta_reference_eval.db
type: sqlite
provider_id: meta-reference
provider_type: inline::meta-reference
files:
- config:
metadata_store:
db_path: /home/nvidia/.llama/distributions/starter/files_metadata.db
type: sqlite
storage_dir: /home/nvidia/.llama/distributions/starter/files
provider_id: meta-reference-files
provider_type: inline::localfs
inference:
- config:
api_key: '********'
url: https://api.fireworks.ai/inference/v1
provider_id: fireworks
provider_type: remote::fireworks
- config:
api_key: '********'
url: https://api.together.xyz/v1
provider_id: together
provider_type: remote::together
- config: {}
provider_id: bedrock
provider_type: remote::bedrock
- config:
api_key: '********'
append_api_version: true
url: http://localhost:8912
provider_id: nvidia
provider_type: remote::nvidia
- config:
api_key: '********'
base_url: https://api.openai.com/v1
provider_id: openai
provider_type: remote::openai
- config:
api_key: '********'
provider_id: anthropic
provider_type: remote::anthropic
- config:
api_key: '********'
provider_id: gemini
provider_type: remote::gemini
- config:
api_key: '********'
url: https://api.groq.com
provider_id: groq
provider_type: remote::groq
- config:
api_key: '********'
url: https://api.sambanova.ai/v1
provider_id: sambanova
provider_type: remote::sambanova
- config: {}
provider_id: sentence-transformers
provider_type: inline::sentence-transformers
post_training:
- config:
checkpoint_format: meta
provider_id: torchtune-cpu
provider_type: inline::torchtune-cpu
safety:
- config:
excluded_categories: []
provider_id: llama-guard
provider_type: inline::llama-guard
- config: {}
provider_id: code-scanner
provider_type: inline::code-scanner
scoring:
- config: {}
provider_id: basic
provider_type: inline::basic
- config: {}
provider_id: llm-as-judge
provider_type: inline::llm-as-judge
- config:
openai_api_key: '********'
provider_id: braintrust
provider_type: inline::braintrust
telemetry:
- config:
service_name: "\u200B"
sinks: sqlite
sqlite_db_path: /home/nvidia/.llama/distributions/starter/trace_store.db
provider_id: meta-reference
provider_type: inline::meta-reference
tool_runtime:
- config:
api_key: '********'
max_results: 3
provider_id: brave-search
provider_type: remote::brave-search
- config:
api_key: '********'
max_results: 3
provider_id: tavily-search
provider_type: remote::tavily-search
- config: {}
provider_id: rag-runtime
provider_type: inline::rag-runtime
- config: {}
provider_id: model-context-protocol
provider_type: remote::model-context-protocol
vector_io:
- config:
kvstore:
db_path: /home/nvidia/.llama/distributions/starter/faiss_store.db
type: sqlite
provider_id: faiss
provider_type: inline::faiss
- config:
db_path: /home/nvidia/.llama/distributions/starter/sqlite_vec.db
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/sqlite_vec_registry.db
type: sqlite
provider_id: sqlite-vec
provider_type: inline::sqlite-vec
scoring_fns: []
server:
port: 8321
shields: []
tool_groups:
- provider_id: tavily-search
toolgroup_id: builtin::websearch
- provider_id: rag-runtime
toolgroup_id: builtin::rag
vector_dbs: []
version: 2
INFO 2025-10-07 00:29:12,138
llama_stack.providers.remote.inference.nvidia.nvidia:49 inference::nvidia:
Initializing NVIDIAInferenceAdapter(http://localhost:8912)...
INFO 2025-10-07 00:29:12,921
llama_stack.providers.utils.inference.inference_store:74 inference: Write
queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-07 00:29:13,524
llama_stack.providers.utils.responses.responses_store:96 openai_responses:
Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-07 00:29:13,679 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:13,681 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider fireworks: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:13,682 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed
with: Pass Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
WARNING 2025-10-07 00:29:13,684 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider together: Pass
Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
Handling connection for 8912
INFO 2025-10-07 00:29:14,047 llama_stack.providers.utils.inference.openai_mixin:448
providers::utils: NVIDIAInferenceAdapter.list_provider_model_ids() returned 3
models
ERROR 2025-10-07 00:29:14,062 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,063 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider openai: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,099 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed
with: "Could not resolve authentication method. Expected either api_key or
auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers
to be explicitly omitted"
WARNING 2025-10-07 00:29:14,100 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider anthropic: "Could not
resolve authentication method. Expected either api_key or auth_token to be
set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly
omitted"
ERROR 2025-10-07 00:29:14,102 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,103 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider gemini: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,105 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with:
API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-10-07 00:29:14,106 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider groq: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider
config.
ERROR 2025-10-07 00:29:14,107 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,109 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider sambanova: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,454 uvicorn.error:84 uncategorized: Started server process
[3753046]
INFO 2025-10-07 00:29:14,455 uvicorn.error:48 uncategorized: Waiting for
application startup.
INFO 2025-10-07 00:29:14,457 llama_stack.core.server.server:170 core::server:
Starting up
INFO 2025-10-07 00:29:14,458 llama_stack.core.stack:415 core: starting registry
refresh task
ERROR 2025-10-07 00:29:14,459 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,461 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider fireworks: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,462 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed
with: Pass Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
WARNING 2025-10-07 00:29:14,463 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider together: Pass
Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
ERROR 2025-10-07 00:29:14,465 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,466 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider openai: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,500 uvicorn.error:62 uncategorized: Application startup
complete.
ERROR 2025-10-07 00:29:14,502 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed
with: "Could not resolve authentication method. Expected either api_key or
auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers
to be explicitly omitted"
WARNING 2025-10-07 00:29:14,503 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider anthropic: "Could not
resolve authentication method. Expected either api_key or auth_token to be
set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly
omitted"
ERROR 2025-10-07 00:29:14,504 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,506 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider gemini: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,507 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with:
API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-10-07 00:29:14,508 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider groq: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider
config.
ERROR 2025-10-07 00:29:14,510 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,511 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider sambanova: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,513 uvicorn.error:216 uncategorized: Uvicorn running on
http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
tested with curl model, it also works:
```
curl http://localhost:8321/v1/models
{"data":[{"identifier":"bedrock/meta.llama3-1-8b-instruct-v1:0","provider_resource_id":"meta.llama3-1-8b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-70b-instruct-v1:0","provider_resource_id":"meta.llama3-1-70b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-405b-instruct-v1:0","provider_resource_id":"meta.llama3-1-405b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/bigcode/starcoder2-7b","provider_resource_id":"bigcode/starcoder2-7b","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/meta/llama-3.3-70b-instruct","provider_resource_id":"meta/llama-3.3-70b-instruct","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/nvidia/llama-3.2-nv-embedqa-1b-v2","provider_resource_id":"nvidia/llama-3.2-nv-embedqa-1b-v2","provider_id":"nvidia","type":"model","metadata":{"embedding_dimension":2048,"context_length":8192},"model_type":"embedding"},{"identifier":"sentence-transformers/all-MiniLM-L6-v2","provider_resource_id":"all-MiniLM-L6-v2","provider_id":"sentence-transformers","type":"model","metadata":{"embedding_dimension":384},"model_type":"embedding"}]}%
```
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
Have been running into flaky unit test failures:
5217035494
Fixing below
1. Shutting down properly by cancelling any stale file batches tasks
running in background.
2. Also, use unique_kvstore_config, so the test dont use same db path
and maintain test isolation
## Test Plan
Ran unit test locally and CI
- Allows use of unavailable models on startup
- Add has_model method to ModelsRoutingTable for checking pre-registered
models
- Update check_model_availability to check model_store before provider
APIs
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Start llama stack and point unavailable vLLM
```
VLLM_URL=https://my-unavailable-vllm/v1 MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
llama stack will start without crashing but only notifying error.
```
- provider_id: rag-runtime
toolgroup_id: builtin::rag
vector_dbs: []
version: 2
INFO 2025-10-07 06:40:41,804 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-07 06:40:42,066 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-07 06:40:58,882 llama_stack.providers.utils.inference.openai_mixin:436 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: Request timed out.
WARNING 2025-10-07 06:40:58,883 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider vllm: Request timed out.
[...]
INFO 2025-10-07 06:40:59,036 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO 2025-10-07 06:41:04,064 openai._base_client:1618 uncategorized: Retrying request to /models in 0.398814 seconds
INFO 2025-10-07 06:41:09,497 openai._base_client:1618 uncategorized: Retrying request to /models in 0.781908 seconds
ERROR 2025-10-07 06:41:15,282 llama_stack.providers.utils.inference.openai_mixin:436 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: Request timed out.
WARNING 2025-10-07 06:41:15,283 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider vllm: Request timed out.
```
# What does this PR do?
Implements annotations for `file_search` tool.
Also adds some logs and tests.
## How does this work?
1. **Citation Markers**: Models insert `<|file-id|>` tokens during
generation with instructions from search results
2. **Post-Processing**: Extract markers using regex to calculate
character positions and create `AnnotationFileCitation` objects
3. **File Mapping**: Store filename metadata during vector store
operations for proper citation display
## Example
This is the updated `quickstart.py` script, which uses the `extra_body`
to register the embedding model.
```python
import io, requests
from openai import OpenAI
url="https://www.paulgraham.com/greatwork.html"
model = "gpt-4o-mini"
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")
vs = client.vector_stores.create(
name="my_citations_db",
extra_body={
"embedding_model": "ollama/nomic-embed-text:latest",
"embedding_dimension": 768,
}
)
response = requests.get(url)
pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
file_id = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants").id
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file_id)
resp = client.responses.create(
model=model,
input="How do you do great work? Use our existing knowledge_search tool.",
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
include=["file_search_call.results"],
)
print(resp)
```
<details>
<summary> Example of the full response </summary>
```python
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/vector_stores "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/vector_stores/vs_0f6f7e35-f48b-4850-8604-8117d9a50e0a/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/responses "HTTP/1.1 200 OK"
Response(id='resp-28f5793d-3272-4de3-81f6-8cbf107d5bcd', created_at=1759797954.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='gpt-4o-mini', object='response', output=[ResponseFileSearchToolCall(id='call_xWtvEQETN5GNiRLLiBIDKntg', queries=['how to do great work tips'], status='completed', type='file_search_call', results=[Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.3722624322210302, text='\\\'re looking where few have looked before.<br /><br />One sign that you\\\'re suited for some kind of work is when you like\\neven the parts that other people find tedious or frightening.<br /><br />But fields aren\\\'t people; you don\\\'t owe them any loyalty. If in the\\ncourse of working on one thing you discover another that\\\'s more\\nexciting, don\\\'t be afraid to switch.<br /><br />If you\\\'re making something for people, make sure it\\\'s something\\nthey actually want. The best way to do this is to make something\\nyou yourself want. Write the story you want to read; build the tool\\nyou want to use. Since your friends probably have similar interests,\\nthis will also get you your initial audience.<br /><br />This <i>should</i> follow from the excitingness rule. Obviously the most\\nexciting story to write will be the one you want to read. The reason\\nI mention this case explicitly is that so many people get it wrong.\\nInstead of making what they want, they try to make what some\\nimaginary, more sophisticated audience wants. And once you go down\\nthat route, you\\\'re lost.\\n<font color=#dddddd>[<a href="#f6n"><font color=#dddddd>6</font></a>]</font><br /><br />There are a lot of forces that will lead you astray when you\\\'re\\ntrying to figure out what to work on. Pretentiousness, fashion,\\nfear, money, politics, other people\\\'s wishes, eminent frauds. But\\nif you stick to what you find genuinely interesting, you\\\'ll be proof\\nagainst all of them. If you\\\'re interested, you\\\'re not astray.<br /><br /><br /><br /><br /><br />\\nFollowing your interests may sound like a rather passive strategy,\\nbut in practice it usually means following them past all sorts of\\nobstacles. You usually have to risk rejection and failure. So it\\ndoes take a good deal of boldness.<br /><br />But while you need boldness, you don\\\'t usually need much planning.\\nIn most cases the recipe for doing great work is simply: work hard\\non excitingly ambitious projects, and something good will come of\\nit. Instead of making a plan and then executing it, you just try\\nto preserve certain invariants.<br /><br />The trouble with planning is that it only works for achievements\\nyou can describe in advance. You can win a gold medal or get rich\\nby deciding to as a child and then tenaciously pursuing that goal,\\nbut you can\\\'t discover natural selection that way.<br /><br />I think for most people who want to do great work, the right strategy\\nis not to plan too much. At each stage do whatever seems most\\ninteresting and gives you the best options for the future. I call\\nthis approach "staying upwind." This is how most people who\\\'ve done\\ngreat work seem to have done it.<br /><br /><br /><br /><br /><br />\\nEven when you\\\'ve found something exciting to work on, working on\\nit is not always straightforward. There will be times when some new\\nidea makes you leap out of bed in the morning and get straight to\\nwork. But there will also be plenty of times when things aren\\\'t\\nlike that.<br /><br />You don\\\'t just put out your sail and get blown forward by inspiration.\\nThere are headwinds and currents and hidden shoals. So there\\\'s a\\ntechnique to working, just as there is to sailing.<br /><br />For example, while you must work hard, it\\\'s possible to work too\\nhard, and if'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.2532794607643494, text=' with anyone who\\\'s genuinely interested. If they\\\'re\\nreally good at their work, then they probably have a hobbyist\\\'s\\ninterest in it, and hobbyists always want to talk about their\\nhobbies.<br /><br />It may take some effort to find the people who are really good,\\nthough. Doing great work has such prestige that in some places,\\nparticularly universities, there\\\'s a polite fiction that everyone\\nis engaged in it. And that is far from true. People within universities\\ncan\\\'t say so openly, but the quality of the work being done in\\ndifferent departments varies immensely. Some departments have people\\ndoing great work; others have in the past; others never have.<br /><br /><br /><br /><br /><br />\\nSeek out the best colleagues. There are a lot of projects that can\\\'t\\nbe done alone, and even if you\\\'re working on one that can be, it\\\'s\\ngood to have other people to encourage you and to bounce ideas off.<br /><br />Colleagues don\\\'t just affect your work, though; they also affect\\nyou. So work with people you want to become like, because you will.<br /><br />Quality is more important than quantity in colleagues. It\\\'s better\\nto have one or two great ones than a building full of pretty good\\nones. In fact it\\\'s not merely better, but necessary, judging from\\nhistory: the degree to which great work happens in clusters suggests\\nthat one\\\'s colleagues often make the difference between doing great\\nwork and not.<br /><br />How do you know when you have sufficiently good colleagues? In my\\nexperience, when you do, you know. Which means if you\\\'re unsure,\\nyou probably don\\\'t. But it may be possible to give a more concrete\\nanswer than that. Here\\\'s an attempt: sufficiently good colleagues\\noffer <i>surprising</i> insights. They can see and do things that you\\ncan\\\'t. So if you have a handful of colleagues good enough to keep\\nyou on your toes in this sense, you\\\'re probably over the threshold.<br /><br />Most of us can benefit from collaborating with colleagues, but some\\nprojects require people on a larger scale, and starting one of those\\nis not for everyone. If you want to run a project like that, you\\\'ll\\nhave to become a manager, and managing well takes aptitude and\\ninterest like any other kind of work. If you don\\\'t have them, there\\nis no middle path: you must either force yourself to learn management\\nas a second language, or avoid such projects.\\n<font color=#dddddd>[<a href="#f27n"><font color=#dddddd>27</font></a>]</font><br /><br /><br /><br /><br /><br />\\nHusband your morale. It\\\'s the basis of everything when you\\\'re working\\non ambitious projects. You have to nurture and protect it like a\\nliving organism.<br /><br />Morale starts with your view of life. You\\\'re more likely to do great\\nwork if you\\\'re an optimist, and more likely to if you think of\\nyourself as lucky than if you think of yourself as a victim.<br /><br />Indeed, work can to some extent protect you from your problems. If\\nyou choose work that\\\'s pure, its very difficulties will serve as a\\nrefuge from the difficulties of everyday life. If this is escapism,\\nit\\\'s a very productive form of it, and one that has been used by\\nsome of the greatest minds in history.<br /><br />Morale compounds via work: high morale helps you do good work, which\\nincreases your morale and helps you do even'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.1973485818164222, text=' your\\nability and interest can take you. And you can only answer that by\\ntrying.<br /><br />Many more people could try to do great work than do. What holds\\nthem back is a combination of modesty and fear. It seems presumptuous\\nto try to be Newton or Shakespeare. It also seems hard; surely if\\nyou tried something like that, you\\\'d fail. Presumably the calculation\\nis rarely explicit. Few people consciously decide not to try to do\\ngreat work. But that\\\'s what\\\'s going on subconsciously; they shy\\naway from the question.<br /><br />So I\\\'m going to pull a sneaky trick on you. Do you want to do great\\nwork, or not? Now you have to decide consciously. Sorry about that.\\nI wouldn\\\'t have done it to a general audience. But we already know\\nyou\\\'re interested.<br /><br />Don\\\'t worry about being presumptuous. You don\\\'t have to tell anyone.\\nAnd if it\\\'s too hard and you fail, so what? Lots of people have\\nworse problems than that. In fact you\\\'ll be lucky if it\\\'s the worst\\nproblem you have.<br /><br />Yes, you\\\'ll have to work hard. But again, lots of people have to\\nwork hard. And if you\\\'re working on something you find very\\ninteresting, which you necessarily will if you\\\'re on the right path,\\nthe work will probably feel less burdensome than a lot of your\\npeers\\\'.<br /><br />The discoveries are out there, waiting to be made. Why not by you?<br /><br /><br /><br /><br /><br /><br /><br /><br /><br />\\n<b>Notes</b><br /><br />[<a name="f1n"><font color=#000000>1</font></a>]\\nI don\\\'t think you could give a precise definition of what\\ncounts as great work. Doing great work means doing something important\\nso well that you expand people\\\'s ideas of what\\\'s possible. But\\nthere\\\'s no threshold for importance. It\\\'s a matter of degree, and\\noften hard to judge at the time anyway. So I\\\'d rather people focused\\non developing their interests rather than worrying about whether\\nthey\\\'re important or not. Just try to do something amazing, and\\nleave it to future generations to say if you succeeded.<br /><br />[<a name="f2n"><font color=#000000>2</font></a>]\\nA lot of standup comedy is based on noticing anomalies in\\neveryday life. "Did you ever notice...?" New ideas come from doing\\nthis about nontrivial things. Which may help explain why people\\\'s\\nreaction to a new idea is often the first half of laughing: Ha!<br /><br />[<a name="f3n"><font color=#000000>3</font></a>]\\nThat second qualifier is critical. If you\\\'re excited about\\nsomething most authorities discount, but you can\\\'t give a more\\nprecise explanation than "they don\\\'t get it," then you\\\'re starting\\nto drift into the territory of cranks.<br /><br />[<a name="f4n"><font color=#000000>4</font></a>]\\nFinding something to work on is not simply a matter of finding\\na match between the current version of you and a list of known\\nproblems. You\\\'ll often have to coevolve with the problem. That\\\'s\\nwhy it can sometimes be so hard to figure out what to work on. The\\nsearch space is huge. It\\\'s the cartesian product of all possible\\nt'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.1764591706535943, text='\\noptimistic, and even though one of the sources of their optimism\\nis ignorance, in this case ignorance can sometimes beat knowledge.<br /><br />Try to finish what you start, though, even if it turns out to be\\nmore work than you expected. Finishing things is not just an exercise\\nin tidiness or self-discipline. In many projects a lot of the best\\nwork happens in what was meant to be the final stage.<br /><br />Another permissible lie is to exaggerate the importance of what\\nyou\\\'re working on, at least in your own mind. If that helps you\\ndiscover something new, it may turn out not to have been a lie after\\nall.\\n<font color=#dddddd>[<a href="#f7n"><font color=#dddddd>7</font></a>]</font><br /><br /><br /><br /><br /><br />\\nSince there are two senses of starting work — per day and per\\nproject — there are also two forms of procrastination. Per-project\\nprocrastination is far the more dangerous. You put off starting\\nthat ambitious project from year to year because the time isn\\\'t\\nquite right. When you\\\'re procrastinating in units of years, you can\\nget a lot not done.\\n<font color=#dddddd>[<a href="#f8n"><font color=#dddddd>8</font></a>]</font><br /><br />One reason per-project procrastination is so dangerous is that it\\nusually camouflages itself as work. You\\\'re not just sitting around\\ndoing nothing; you\\\'re working industriously on something else. So\\nper-project procrastination doesn\\\'t set off the alarms that per-day\\nprocrastination does. You\\\'re too busy to notice it.<br /><br />The way to beat it is to stop occasionally and ask yourself: Am I\\nworking on what I most want to work on? When you\\\'re young it\\\'s ok\\nif the answer is sometimes no, but this gets increasingly dangerous\\nas you get older.\\n<font color=#dddddd>[<a href="#f9n"><font color=#dddddd>9</font></a>]</font><br /><br /><br /><br /><br /><br />\\nGreat work usually entails spending what would seem to most people\\nan unreasonable amount of time on a problem. You can\\\'t think of\\nthis time as a cost, or it will seem too high. You have to find the\\nwork sufficiently engaging as it\\\'s happening.<br /><br />There may be some jobs where you have to work diligently for years\\nat things you hate before you get to the good part, but this is not\\nhow great work happens. Great work happens by focusing consistently\\non something you\\\'re genuinely interested in. When you pause to take\\nstock, you\\\'re surprised how far you\\\'ve come.<br /><br />The reason we\\\'re surprised is that we underestimate the cumulative\\neffect of work. Writing a page a day doesn\\\'t sound like much, but\\nif you do it every day you\\\'ll write a book a year. That\\\'s the key:\\nconsistency. People who do great things don\\\'t get a lot done every\\nday. They get something done, rather than nothing.<br /><br />If you do work that compounds, you\\\'ll get exponential growth. Most\\npeople who do this do it unconsciously, but it\\\'s worth stopping to\\nthink about. Learning, for example, is an instance of this phenomenon:\\nthe more you learn about something, the easier it is to learn more.\\nGrowing an audience is another: the more fans you have, the more\\nnew fans they\\\'ll bring you.<br /><br />'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.174069664815369, text='\\ninside.<br /><br /><br /><br /><br /><br />Let\\\'s talk a little more about the complicated business of figuring\\nout what to work on. The main reason it\\\'s hard is that you can\\\'t\\ntell what most kinds of work are like except by doing them. Which\\nmeans the four steps overlap: you may have to work at something for\\nyears before you know how much you like it or how good you are at\\nit. And in the meantime you\\\'re not doing, and thus not learning\\nabout, most other kinds of work. So in the worst case you choose\\nlate based on very incomplete information.\\n<font color=#dddddd>[<a href="#f4n"><font color=#dddddd>4</font></a>]</font><br /><br />The nature of ambition exacerbates this problem. Ambition comes in\\ntwo forms, one that precedes interest in the subject and one that\\ngrows out of it. Most people who do great work have a mix, and the\\nmore you have of the former, the harder it will be to decide what\\nto do.<br /><br />The educational systems in most countries pretend it\\\'s easy. They\\nexpect you to commit to a field long before you could know what\\nit\\\'s really like. And as a result an ambitious person on an optimal\\ntrajectory will often read to the system as an instance of breakage.<br /><br />It would be better if they at least admitted it — if they admitted\\nthat the system not only can\\\'t do much to help you figure out what\\nto work on, but is designed on the assumption that you\\\'ll somehow\\nmagically guess as a teenager. They don\\\'t tell you, but I will:\\nwhen it comes to figuring out what to work on, you\\\'re on your own.\\nSome people get lucky and do guess correctly, but the rest will\\nfind themselves scrambling diagonally across tracks laid down on\\nthe assumption that everyone does.<br /><br />What should you do if you\\\'re young and ambitious but don\\\'t know\\nwhat to work on? What you should <i>not</i> do is drift along passively,\\nassuming the problem will solve itself. You need to take action.\\nBut there is no systematic procedure you can follow. When you read\\nbiographies of people who\\\'ve done great work, it\\\'s remarkable how\\nmuch luck is involved. They discover what to work on as a result\\nof a chance meeting, or by reading a book they happen to pick up.\\nSo you need to make yourself a big target for luck, and the way to\\ndo that is to be curious. Try lots of things, meet lots of people,\\nread lots of books, ask lots of questions.\\n<font color=#dddddd>[<a href="#f5n"><font color=#dddddd>5</font></a>]</font><br /><br />When in doubt, optimize for interestingness. Fields change as you\\nlearn more about them. What mathematicians do, for example, is very\\ndifferent from what you do in high school math classes. So you need\\nto give different types of work a chance to show you what they\\\'re\\nlike. But a field should become <i>increasingly</i> interesting as you\\nlearn more about it. If it doesn\\\'t, it\\\'s probably not for you.<br /><br />Don\\\'t worry if you find you\\\'re interested in different things than\\nother people. The stranger your tastes in interestingness, the\\nbetter. Strange tastes are often strong ones, and a strong taste\\nfor work means you\\\'ll be productive. And you\\\'re more likely to find\\nnew things if you'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.158095578895721, text='. Don\\\'t copy the manner of\\nan eminent 50 year old professor if you\\\'re 18, for example, or the\\nidiom of a Renaissance poem hundreds of years later.<br /><br />Some of the features of things you admire are flaws they succeeded\\ndespite. Indeed, the features that are easiest to imitate are the\\nmost likely to be the flaws.<br /><br />This is particularly true for behavior. Some talented people are\\njerks, and this sometimes makes it seem to the inexperienced that\\nbeing a jerk is part of being talented. It isn\\\'t; being talented\\nis merely how they get away with it.<br /><br />One of the most powerful kinds of copying is to copy something from\\none field into another. History is so full of chance discoveries\\nof this type that it\\\'s probably worth giving chance a hand by\\ndeliberately learning about other kinds of work. You can take ideas\\nfrom quite distant fields if you let them be metaphors.<br /><br />Negative examples can be as inspiring as positive ones. In fact you\\ncan sometimes learn more from things done badly than from things\\ndone well; sometimes it only becomes clear what\\\'s needed when it\\\'s\\nmissing.<br /><br /><br /><br /><br /><br />\\nIf a lot of the best people in your field are collected in one\\nplace, it\\\'s usually a good idea to visit for a while. It will\\nincrease your ambition, and also, by showing you that these people\\nare human, increase your self-confidence.\\n<font color=#dddddd>[<a href="#f26n"><font color=#dddddd>26</font></a>]</font><br /><br />If you\\\'re earnest you\\\'ll probably get a warmer welcome than you\\nmight expect. Most people who are very good at something are happy\\nto talk about it with anyone who\\\'s genuinely interested. If they\\\'re\\nreally good at their work, then they probably have a hobbyist\\\'s\\ninterest in it, and hobbyists always want to talk about their\\nhobbies.<br /><br />It may take some effort to find the people who are really good,\\nthough. Doing great work has such prestige that in some places,\\nparticularly universities, there\\\'s a polite fiction that everyone\\nis engaged in it. And that is far from true. People within universities\\ncan\\\'t say so openly, but the quality of the work being done in\\ndifferent departments varies immensely. Some departments have people\\ndoing great work; others have in the past; others never have.<br /><br /><br /><br /><br /><br />\\nSeek out the best colleagues. There are a lot of projects that can\\\'t\\nbe done alone, and even if you\\\'re working on one that can be, it\\\'s\\ngood to have other people to encourage you and to bounce ideas off.<br /><br />Colleagues don\\\'t just affect your work, though; they also affect\\nyou. So work with people you want to become like, because you will.<br /><br />Quality is more important than quantity in colleagues. It\\\'s better\\nto have one or two great ones than a building full of pretty good\\nones. In fact it\\\'s not merely better, but necessary, judging from\\nhistory: the degree to which great work happens in clusters suggests\\nthat one\\\'s colleagues often make the difference between doing great\\nwork and not.<br /><br />How do you know when you have sufficiently good colleagues? In my\\nexperience, when you do, you know. Which means if you\\\'re unsure,\\nyou probably don\\\'t. But it may be possible to give a more concrete\\nanswer than that. Here\\\'s an attempt: sufficiently good'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.1566747762241967, text=',\\nbut in practice it usually means following them past all sorts of\\nobstacles. You usually have to risk rejection and failure. So it\\ndoes take a good deal of boldness.<br /><br />But while you need boldness, you don\\\'t usually need much planning.\\nIn most cases the recipe for doing great work is simply: work hard\\non excitingly ambitious projects, and something good will come of\\nit. Instead of making a plan and then executing it, you just try\\nto preserve certain invariants.<br /><br />The trouble with planning is that it only works for achievements\\nyou can describe in advance. You can win a gold medal or get rich\\nby deciding to as a child and then tenaciously pursuing that goal,\\nbut you can\\\'t discover natural selection that way.<br /><br />I think for most people who want to do great work, the right strategy\\nis not to plan too much. At each stage do whatever seems most\\ninteresting and gives you the best options for the future. I call\\nthis approach "staying upwind." This is how most people who\\\'ve done\\ngreat work seem to have done it.<br /><br /><br /><br /><br /><br />\\nEven when you\\\'ve found something exciting to work on, working on\\nit is not always straightforward. There will be times when some new\\nidea makes you leap out of bed in the morning and get straight to\\nwork. But there will also be plenty of times when things aren\\\'t\\nlike that.<br /><br />You don\\\'t just put out your sail and get blown forward by inspiration.\\nThere are headwinds and currents and hidden shoals. So there\\\'s a\\ntechnique to working, just as there is to sailing.<br /><br />For example, while you must work hard, it\\\'s possible to work too\\nhard, and if you do that you\\\'ll find you get diminishing returns:\\nfatigue will make you stupid, and eventually even damage your health.\\nThe point at which work yields diminishing returns depends on the\\ntype. Some of the hardest types you might only be able to do for\\nfour or five hours a day.<br /><br />Ideally those hours will be contiguous. To the extent you can, try\\nto arrange your life so you have big blocks of time to work in.\\nYou\\\'ll shy away from hard tasks if you know you might be interrupted.<br /><br />It will probably be harder to start working than to keep working.\\nYou\\\'ll often have to trick yourself to get over that initial\\nthreshold. Don\\\'t worry about this; it\\\'s the nature of work, not a\\nflaw in your character. Work has a sort of activation energy, both\\nper day and per project. And since this threshold is fake in the\\nsense that it\\\'s higher than the energy required to keep going, it\\\'s\\nok to tell yourself a lie of corresponding magnitude to get over\\nit.<br /><br />It\\\'s usually a mistake to lie to yourself if you want to do great\\nwork, but this is one of the rare cases where it isn\\\'t. When I\\\'m\\nreluctant to start work in the morning, I often trick myself by\\nsaying "I\\\'ll just read over what I\\\'ve got so far." Five minutes\\nlater I\\\'ve found something that seems mistaken or incomplete, and\\nI\\\'m off.<br /><br />Similar techniques work for starting new projects. It\\\'s ok to lie\\nto yourself about how much work a project will entail, for example.\\nLots of great things began with someone saying "How hard could it\\nbe?"<br /><br />This is one case where the young have an advantage. They\\\'re more'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.1349744395573516, text=' audience\\nin the traditional sense. Either way it doesn\\\'t need to be big.\\nThe value of an audience doesn\\\'t grow anything like linearly with\\nits size. Which is bad news if you\\\'re famous, but good news if\\nyou\\\'re just starting out, because it means a small but dedicated\\naudience can be enough to sustain you. If a handful of people\\ngenuinely love what you\\\'re doing, that\\\'s enough.<br /><br />To the extent you can, avoid letting intermediaries come between\\nyou and your audience. In some types of work this is inevitable,\\nbut it\\\'s so liberating to escape it that you might be better off\\nswitching to an adjacent type if that will let you go direct.\\n<font color=#dddddd>[<a href="#f28n"><font color=#dddddd>28</font></a>]</font><br /><br />The people you spend time with will also have a big effect on your\\nmorale. You\\\'ll find there are some who increase your energy and\\nothers who decrease it, and the effect someone has is not always\\nwhat you\\\'d expect. Seek out the people who increase your energy and\\navoid those who decrease it. Though of course if there\\\'s someone\\nyou need to take care of, that takes precedence.<br /><br />Don\\\'t marry someone who doesn\\\'t understand that you need to work,\\nor sees your work as competition for your attention. If you\\\'re\\nambitious, you need to work; it\\\'s almost like a medical condition;\\nso someone who won\\\'t let you work either doesn\\\'t understand you,\\nor does and doesn\\\'t care.<br /><br />Ultimately morale is physical. You think with your body, so it\\\'s\\nimportant to take care of it. That means exercising regularly,\\neating and sleeping well, and avoiding the more dangerous kinds of\\ndrugs. Running and walking are particularly good forms of exercise\\nbecause they\\\'re good for thinking.\\n<font color=#dddddd>[<a href="#f29n"><font color=#dddddd>29</font></a>]</font><br /><br />People who do great work are not necessarily happier than everyone\\nelse, but they\\\'re happier than they\\\'d be if they didn\\\'t. In fact,\\nif you\\\'re smart and ambitious, it\\\'s dangerous <i>not</i> to be productive.\\nPeople who are smart and ambitious but don\\\'t achieve much tend to\\nbecome bitter.<br /><br /><br /><br /><br /><br />\\nIt\\\'s ok to want to impress other people, but choose the right people.\\nThe opinion of people you respect is signal. Fame, which is the\\nopinion of a much larger group you might or might not respect, just\\nadds noise.<br /><br />The prestige of a type of work is at best a trailing indicator and\\nsometimes completely mistaken. If you do anything well enough,\\nyou\\\'ll make it prestigious. So the question to ask about a type of\\nwork is not how much prestige it has, but how well it could be done.<br /><br />Competition can be an effective motivator, but don\\\'t let it choose\\nthe problem for you; don\\\'t let yourself get drawn into chasing\\nsomething just because others are. In fact, don\\\'t let competitors\\nmake you do anything much more specific than work harder.<br /><br />Curiosity is the best guide. Your curiosity never lies, and it knows\\nmore than you do about what\\\'s worth paying attention to.<br /><br /><br /><br /><br /><br />\\nNotice how often that word has come up. If you asked an oracle the\\nsecret to doing great work and the oracle replied'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.123214818076958, text='b\'<html><head><meta name="Keywords" content="" /><title>How to Do Great Work</title><!-- <META NAME="ROBOTS" CONTENT="NOODP"> -->\\n<link rel="shortcut icon" href="http://ycombinator.com/arc/arc.png">\\n</head><body bgcolor="#ffffff" background="https://s.turbifycdn.com/aah/paulgraham/bel-6.gif" text="#000000" link="#000099" vlink="#464646"><table border="0" cellspacing="0" cellpadding="0"><tr valign="top"><td><map name=118ab66adb24b4f><area shape=rect coords="0,0,67,21" href="index.html"><area shape=rect coords="0,21,67,42" href="articles.html"><area shape=rect coords="0,42,67,63" href="http://www.amazon.com/gp/product/0596006624"><area shape=rect coords="0,63,67,84" href="books.html"><area shape=rect coords="0,84,67,105" href="http://ycombinator.com"><area shape=rect coords="0,105,67,126" href="arc.html"><area shape=rect coords="0,126,67,147" href="bel.html"><area shape=rect coords="0,147,67,168" href="lisp.html"><area shape=rect coords="0,168,67,189" href="antispam.html"><area shape=rect coords="0,189,67,210" href="kedrosky.html"><area shape=rect coords="0,210,67,231" href="faq.html"><area shape=rect coords="0,231,67,252" href="raq.html"><area shape=rect coords="0,252,67,273" href="quo.html"><area shape=rect coords="0,273,67,294" href="rss.html"><area shape=rect coords="0,294,67,315" href="bio.html"><area shape=rect coords="0,315,67,336" href="https://twitter.com/paulg"><area shape=rect coords="0,336,67,357" href="https://mas.to/@paulg"></map><img src="https://s.turbifycdn.com/aah/paulgraham/bel-7.gif" width="69" height="357" usemap=#118ab66adb24b4f border="0" hspace="0" vspace="0" ismap /></td><td><img src="https://sep.turbifycdn.com/ca/Img/trans_1x1.gif" height="1" width="26" border="0" /></td><td><a href="index.html"><img src="https://s.turbifycdn.com/aah/paulgraham/bel-8.gif" width="410" height="45" border="0" hspace="0" vspace="0" /></a><br /><br /><table border="0" cellspacing="0" cellpadding="0" width="435"><tr valign="top"><td width="435"><img src="https://s.turbifycdn.com/aah/paulgraham/how-to-do-great-work-2.gif" width="185" height="18" border="0" hspace="0" vspace="0" alt="How to Do Great Work" /><br /><br /><font size="2" face="verdana">July 2023<br /><br />If you collected lists of techniques for doing great work in a lot\\nof different fields, what would the intersection look like? I decided\\nto find out'), Result(attributes={}, file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='file-a98ada68681c4fbeba2201e9c7213fc3', score=1.1193194369249235, text=' dangerous kinds of\\ndrugs. Running and walking are particularly good forms of exercise\\nbecause they\\\'re good for thinking.\\n<font color=#dddddd>[<a href="#f29n"><font color=#dddddd>29</font></a>]</font><br /><br />People who do great work are not necessarily happier than everyone\\nelse, but they\\\'re happier than they\\\'d be if they didn\\\'t. In fact,\\nif you\\\'re smart and ambitious, it\\\'s dangerous <i>not</i> to be productive.\\nPeople who are smart and ambitious but don\\\'t achieve much tend to\\nbecome bitter.<br /><br /><br /><br /><br /><br />\\nIt\\\'s ok to want to impress other people, but choose the right people.\\nThe opinion of people you respect is signal. Fame, which is the\\nopinion of a much larger group you might or might not respect, just\\nadds noise.<br /><br />The prestige of a type of work is at best a trailing indicator and\\nsometimes completely mistaken. If you do anything well enough,\\nyou\\\'ll make it prestigious. So the question to ask about a type of\\nwork is not how much prestige it has, but how well it could be done.<br /><br />Competition can be an effective motivator, but don\\\'t let it choose\\nthe problem for you; don\\\'t let yourself get drawn into chasing\\nsomething just because others are. In fact, don\\\'t let competitors\\nmake you do anything much more specific than work harder.<br /><br />Curiosity is the best guide. Your curiosity never lies, and it knows\\nmore than you do about what\\\'s worth paying attention to.<br /><br /><br /><br /><br /><br />\\nNotice how often that word has come up. If you asked an oracle the\\nsecret to doing great work and the oracle replied with a single\\nword, my bet would be on "curiosity."<br /><br />That doesn\\\'t translate directly to advice. It\\\'s not enough just to\\nbe curious, and you can\\\'t command curiosity anyway. But you can\\nnurture it and let it drive you.<br /><br />Curiosity is the key to all four steps in doing great work: it will\\nchoose the field for you, get you to the frontier, cause you to\\nnotice the gaps in it, and drive you to explore them. The whole\\nprocess is a kind of dance with curiosity.<br /><br /><br /><br /><br /><br />\\nBelieve it or not, I tried to make this essay as short as I could.\\nBut its length at least means it acts as a filter. If you made it\\nthis far, you must be interested in doing great work. And if so\\nyou\\\'re already further along than you might realize, because the\\nset of people willing to want to is small.<br /><br />The factors in doing great work are factors in the literal,\\nmathematical sense, and they are: ability, interest, effort, and\\nluck. Luck by definition you can\\\'t do anything about, so we can\\nignore that. And we can assume effort, if you do in fact want to\\ndo great work. So the problem boils down to ability and interest.\\nCan you find a kind of work where your ability and interest will\\ncombine to yield an explosion of new ideas?<br /><br />Here there are grounds for optimism. There are so many different\\nways to do great work, and even more that are still undiscovered.\\nOut of all those different types of work, the one you\\\'re most suited\\nfor is probably a pretty close match. Probably a comically close\\nmatch. It\\\'s just a question of finding it, and how far into it')]), ResponseOutputMessage(id='msg_3591ea71-8b35-4efd-a5ad-c1c250801971', content=[ResponseOutputText(annotations=[AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=361, type='file_citation'), AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=676, type='file_citation'), AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=948, type='file_citation'), AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1259, type='file_citation'), AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1520, type='file_citation'), AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1747, type='file_citation')], text='To do great work, consider the following principles:\n\n1. **Follow Your Interests**: Engage in work that genuinely excites you. If you find an area intriguing, pursue it without being overly concerned about external pressures or norms. You should create things that you would want for yourself, as this often aligns with what others in your circle might want too.\n\n2. **Work Hard on Ambitious Projects**: Ambition is vital, but it should be tempered by genuine interest. Instead of detailed planning for the future, focus on exciting projects that keep your options open. This approach, known as "staying upwind," allows for adaptability and can lead to unforeseen achievements.\n\n3. **Choose Quality Colleagues**: Collaborating with talented colleagues can significantly affect your own work. Seek out individuals who offer surprising insights and whom you admire. The presence of good colleagues can elevate the quality of your work and inspire you.\n\n4. **Maintain High Morale**: Your attitude towards work and life affects your performance. Cultivating optimism and viewing yourself as lucky rather than victimized can boost your productivity. It’s essential to care for your physical health as well since it directly impacts your mental faculties and morale.\n\n5. **Be Consistent**: Great work often comes from cumulative effort. Daily progress, even in small amounts, can result in substantial achievements over time. Emphasize consistency and make the work engaging, as this reduces the perceived burden of hard labor.\n\n6. **Embrace Curiosity**: Curiosity is a driving force that can guide you in selecting fields of interest, pushing you to explore uncharted territories. Allow it to shape your work and continually seek knowledge and insights.\n\nBy focusing on these aspects, you can create an environment conducive to great work and personal fulfillment.', type='output_text', logprobs=None)], role='assistant', status='completed', type='message')], parallel_tool_calls=False, temperature=None, tool_choice=None, tools=None, top_p=None, background=None, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier=None, status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity=None), top_logprobs=None, truncation=None, usage=None, user=None)
In [34]: resp.output[1].content[0].text
Out[34]: 'To do great work, consider the following principles:\n\n1. **Follow Your Interests**: Engage in work that genuinely excites you. If you find an area intriguing, pursue it without being overly concerned about external pressures or norms. You should create things that you would want for yourself, as this often aligns with what others in your circle might want too.\n\n2. **Work Hard on Ambitious Projects**: Ambition is vital, but it should be tempered by genuine interest. Instead of detailed planning for the future, focus on exciting projects that keep your options open. This approach, known as "staying upwind," allows for adaptability and can lead to unforeseen achievements.\n\n3. **Choose Quality Colleagues**: Collaborating with talented colleagues can significantly affect your own work. Seek out individuals who offer surprising insights and whom you admire. The presence of good colleagues can elevate the quality of your work and inspire you.\n\n4. **Maintain High Morale**: Your attitude towards work and life affects your performance. Cultivating optimism and viewing yourself as lucky rather than victimized can boost your productivity. It’s essential to care for your physical health as well since it directly impacts your mental faculties and morale.\n\n5. **Be Consistent**: Great work often comes from cumulative effort. Daily progress, even in small amounts, can result in substantial achievements over time. Emphasize consistency and make the work engaging, as this reduces the perceived burden of hard labor.\n\n6. **Embrace Curiosity**: Curiosity is a driving force that can guide you in selecting fields of interest, pushing you to explore uncharted territories. Allow it to shape your work and continually seek knowledge and insights.\n\nBy focusing on these aspects, you can create an environment conducive to great work and personal fulfillment.'
```
</details>
The relevant output looks like this:
```python
>resp.output[1].content[0].annotations
[AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=361, type='file_citation'),
AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=676, type='file_citation'),
AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=948, type='file_citation'),
AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1259, type='file_citation'),
AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1520, type='file_citation'),
AnnotationFileCitation(file_id='file-a98ada68681c4fbeba2201e9c7213fc3', filename='https://www.paulgraham.com/greatwork.html', index=1747, type='file_citation')]```
And
```python
In [144]: print(resp.output[1].content[0].text)
To do great work, consider the following principles:
1. **Follow Your Interests**: Engage in work that genuinely excites you.
If you find an area intriguing, pursue it without being overly concerned
about external pressures or norms. You should create things that you
would want for yourself, as this often aligns with what others in your
circle might want too.
2. **Work Hard on Ambitious Projects**: Ambition is vital, but it should
be tempered by genuine interest. Instead of detailed planning for the
future, focus on exciting projects that keep your options open. This
approach, known as "staying upwind," allows for adaptability and can
lead to unforeseen achievements.
3. **Choose Quality Colleagues**: Collaborating with talented colleagues
can significantly affect your own work. Seek out individuals who offer
surprising insights and whom you admire. The presence of good colleagues
can elevate the quality of your work and inspire you.
4. **Maintain High Morale**: Your attitude towards work and life affects
your performance. Cultivating optimism and viewing yourself as lucky
rather than victimized can boost your productivity. It’s essential to
care for your physical health as well since it directly impacts your
mental faculties and morale.
5. **Be Consistent**: Great work often comes from cumulative effort.
Daily progress, even in small amounts, can result in substantial
achievements over time. Emphasize consistency and make the work
engaging, as this reduces the perceived burden of hard labor.
6. **Embrace Curiosity**: Curiosity is a driving force that can guide
you in selecting fields of interest, pushing you to explore uncharted
territories. Allow it to shape your work and continually seek knowledge
and insights.
By focusing on these aspects, you can create an environment conducive to
great work and personal fulfillment.
```
And the code below outputs only periods highlighting that the position/index behaves as expected—i.e., the annotation happens at the end of the sentence.
```python
print([resp.output[1].content[0].text[j.index] for j in
resp.output[1].content[0].annotations])
Out[41]: ['.', '.', '.', '.', '.', '.']
```
## Test Plan
Unit tests added.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
there is a lot of code in the agents API using the telemetry API and its
helpers without checking if that API is even enabled.
This is the only API besides inference actively using telemetry code, so
after this telemetry can be optional for the entire stack
resolves#3665
## Test Plan
existing agent tests.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
inference adapters can now configure `refresh_models: bool` to control
periodic model listing from their providers
BREAKING CHANGE: together inference adapter default changed. previously
always refreshed, now follows config.
addresses "models: refresh" on #3517
## Test Plan
ci w/ new tests
# What does this PR do?
when using a distro like starter where a bunch of providers are disabled
I should not see logs like:
```
in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:38:52,117 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider sambanova: API key is not set. Please provide a valid
API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:43:52,123 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider fireworks: Pass Fireworks API Key in the header
X-LlamaStack-Provider-Data as { "fireworks_api_key": <your api key>}
WARNING 2025-10-07 08:43:52,126 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider together: Pass Together API Key in the header
X-LlamaStack-Provider-Data as { "together_api_key": <your api key>}
WARNING 2025-10-07 08:43:52,129 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider openai: API key is not set. Please provide a valid API
key in the provider data header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:43:52,132 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider anthropic: API key is not set. Please provide a valid
API key in the provider data header, e.g. x-llamastack-provider-data: {"anthropic_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:43:52,136 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider gemini: API key is not set. Please provide a valid API
key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:43:52,139 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider groq: API key is not set. Please provide a valid API key
in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config.
WARNING 2025-10-07 08:43:52,142 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider sambanova: API key is not set. Please provide a valid
API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config.
^CINFO 2025-10-07 08:46:11,996 llama_stack.core.utils.exec:75 core:
```
as WARNING. Switch to Debug.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
What does this PR do?
Updates pyproject.toml dependencies to fix vector processing
compatibility issues.
closes: #3495
Test Plan
Tested llama stack server with faiss vector database:
1. Built and ran server: llama stack build --distro starter --image-type
venv --image-name llamastack-faiss
3. Tested file upload: Successfully uploaded PDF via /v1/openai/v1/files
4. Tested vector operations:
- Created vector store with faiss backend
- Added PDF to vector store
- Performed semantic search queries
# What does this PR do?
Sorry to @mattf I thought I could close the other PR and reopen it.. But
I didn't have the option to reopen it now. I just didn't want it to keep
notifying maintainers if I would make other commits for testing.
Continuation of: https://github.com/llamastack/llama-stack/pull/3641
PR fixes Runpod Adapter
https://github.com/llamastack/llama-stack/issues/3517
## What I fixed from before:
Continuation of: https://github.com/llamastack/llama-stack/pull/3641
1. Made it all OpenAI
2. Fixed the class up since the OpenAIMixin had a couple changes with
the pydantic base model stuff.
3. Test to make sure that we could dynamically find models and use the
resulting identifier to make requests
```bash
curl -X GET \
-H "Content-Type: application/json" \
"http://localhost:8321/v1/models"
```
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
# RunPod Provider Quick Start
## Prerequisites
- Python 3.10+
- Git
- RunPod API token
## Setup for Development
```bash
# 1. Clone and enter the repository
cd (into the repo)
# 2. Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 3. Remove any existing llama-stack installation
pip uninstall llama-stack llama-stack-client -y
# 4. Install llama-stack in development mode
pip install -e .
# 5. Build using local development code
(Found this through the Discord)
LLAMA_STACK_DIR=. llama stack build
# When prompted during build:
# - Name: runpod-dev
# - Image type: venv
# - Inference provider: remote::runpod
# - Safety provider: "llama-guard"
# - Other providers: first defaults
```
## Configure the Stack
The RunPod adapter automatically discovers models from your endpoint via the `/v1/models` API.
No manual model configuration is required - just set your environment variables.
## Run the Server
### Important: Use the Build-Created Virtual Environment
```bash
# Exit the development venv if you're in it
deactivate
# Activate the build-created venv (NOT .venv)
cd (lama-stack folder github repo)
source llamastack-runpod-dev/bin/activate
```
### For Qwen3-32B-AWQ Public Endpoint (Recommended)
```bash
# Set environment variables
export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1"
export RUNPOD_API_TOKEN="your_runpod_api_key"
# Start server
llama stack run
~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml
```
## Quick Test
### 1. List Available Models (Dynamic Discovery)
First, check which models are available on your RunPod endpoint:
```bash
curl -X GET \
-H "Content-Type: application/json" \
"http://localhost:8321/v1/models"
```
**Example Response:**
```json
{
"data": [
{
"identifier": "qwen3-32b-awq",
"provider_resource_id": "Qwen/Qwen3-32B-AWQ",
"provider_id": "runpod",
"type": "model",
"metadata": {},
"model_type": "llm"
}
]
}
```
**Note:** Use the `identifier` value from the response above in your requests below.
### 2. Chat Completion (Non-streaming)
Replace `qwen3-32b-awq` with your model identifier from step 1:
```bash
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-32b-awq",
"messages": [{"role": "user", "content": "Hello, count to 3"}],
"stream": false
}'
```
### 3. Chat Completion (Streaming)
```bash
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-32b-awq",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
```
**Clean streaming output:**
```bash
curl -N -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content":
"Count to 5"}], "stream": true}' \
2>/dev/null | while read -r line; do
echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r
'.choices[0].delta.content // empty' 2>/dev/null
done
```
**Expected Output:**
```
1
2
3
4
5
```
# What does this PR do?
WARNING 2025-10-06 12:01:45,137 root:266 uncategorized: Unknown logging
category: tokenizer_utils. Falling back to default 'root' level: 20
## Test Plan
# What does this PR do?
* Cleans up API docstrings for better documentation rendering
<img width="2346" height="1126" alt="image"
src="https://github.com/user-attachments/assets/516b09a1-2d5b-4614-a3a9-13431fc21fc1"
/>
## Test Plan
* Manual testing
---------
Signed-off-by: Doug Edgar <dedgar@redhat.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
Co-authored-by: Doug Edgar <dedgar@redhat.com>
Co-authored-by: Christian Zaccaria <73656840+ChristianZaccaria@users.noreply.github.com>
Co-authored-by: Anastas Stoyanovsky <contact@anastas.eu>
Co-authored-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Young Han <110819238+seyeong-han@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
- implement get_api_key instead of relying on
LiteLLMOpenAIMixin.get_api_key
- remove use of LiteLLMOpenAIMixin
- add default initialize/shutdown methods to OpenAIMixin
- remove __init__s to allow proper pydantic construction
- remove dead code from vllm adapter and associated / duplicate unit
tests
- update vllm adapter to use openaimixin for model registration
- remove ModelRegistryHelper from fireworks & together adapters
- remove Inference from nvidia adapter
- complete type hints on embedding_model_metadata
- allow extra fields on OpenAIMixin, for model_store, __provider_id__,
etc
- new recordings for ollama
- enhance the list models error handling
- update cerebras (remove cerebras-cloud-sdk) and anthropic (custom
model listing) inference adapters
- parametrized test_inference_client_caching
- remove cerebras, databricks, fireworks, together from blanket mypy
exclude
- removed unnecessary litellm deps
## Test Plan
ci
# What does this PR do?
when using the providers.d method of installation users could hand craft
their AdapterSpec's to use overlapping code meaning one repo could
contain an inline and remote impl. Currently installing a provider via
module does not allow for that as each repo is only allowed to have one
`get_provider_spec` method with one Spec returned
add an optional way for `get_provider_spec` to return a list of
`ProviderSpec` where each can be either an inline or remote impl.
Note: the `adapter_type` in `get_provider_spec` MUST match the
`provider_type` in the build/run yaml for this to work.
resolves#3226
## Test Plan
once this merges we need to re-enable the external provider test and
account for this functionality. Work needs to be done in the external
provider repos to support this functionality.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
https://github.com/llamastack/llama-stack/pull/3462 allows using uvicorn
to start llama stack server which supports spawning multiple workers.
This PR enables us to launch >1 workers from `llama stack run` (will add
the parameter in a follow-up PR, keeping this PR on simplifying) by
removing the old way of launching stack server and consolidates
launching via uvicorn.run only.
## Test Plan
ran `llama stack run starter`
CI
Bumps [pandas](https://github.com/pandas-dev/pandas) from 2.3.1 to
2.3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/pandas-dev/pandas/releases">pandas's
releases</a>.</em></p>
<blockquote>
<h2>Pandas 2.3.3</h2>
<p>We are pleased to announce the release of pandas 2.3.3.
This release includes some improvements and fixes to the future string
data type (preview feature for the upcoming pandas 3.0). We recommend
that all users upgrade to this version.</p>
<p>See the <a
href="https://pandas.pydata.org/pandas-docs/version/2.3/whatsnew/v2.3.3.html">full
whatsnew</a> for a list of all the changes.
Pandas 2.3.3 supports Python 3.9 and higher, and is the first release to
support Python 3.14.</p>
<p>The release will be available on the conda-forge channel:</p>
<pre><code>conda install pandas --channel conda-forge
</code></pre>
<p>Or via PyPI:</p>
<pre><code>python3 -m pip install --upgrade pandas
</code></pre>
<p>Please report any issues with the release on the <a
href="https://github.com/pandas-dev/pandas/issues">pandas issue
tracker</a>.</p>
<p>Thanks to all the contributors who made this release possible.</p>
<h2>Pandas 2.3.2</h2>
<p>We are pleased to announce the release of pandas 2.3.2.
This release includes some improvements and fixes to the future string
data type (preview feature for the upcoming pandas 3.0). We recommend
that all users upgrade to this version.</p>
<p>See the <a
href="https://pandas.pydata.org/pandas-docs/version/2.3/whatsnew/v2.3.2.html">full
whatsnew</a> for a list of all the changes.
Pandas 2.3.2 supports Python 3.9 and higher.</p>
<p>The release will be available on the conda-forge channel:</p>
<pre><code>conda install pandas --channel conda-forge
</code></pre>
<p>Or via PyPI:</p>
<pre><code>python3 -m pip install --upgrade pandas
</code></pre>
<p>Please report any issues with the release on the <a
href="https://github.com/pandas-dev/pandas/issues">pandas issue
tracker</a>.</p>
<p>Thanks to all the contributors who made this release possible.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9c8bc3e551"><code>9c8bc3e</code></a>
RLS: 2.3.3</li>
<li><a
href="6aa788a00b"><code>6aa788a</code></a>
[backport 2.3.x] DOC: prepare 2.3.3 whatsnew notes for release (<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62499">#62499</a>)
(<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62508">#62508</a>)</li>
<li><a
href="b64f0df403"><code>b64f0df</code></a>
[backport 2.3.x] BUG: avoid validation error for ufunc with
string[python] ar...</li>
<li><a
href="058eb2b0ed"><code>058eb2b</code></a>
[backport 2.3.x] BUG: String[pyarrow] comparison with mixed object (<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62424">#62424</a>)
(...</li>
<li><a
href="2ca088daef"><code>2ca088d</code></a>
[backport 2.3.x] DEPR: remove the Period resampling deprecation (<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62480">#62480</a>)
(<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62">#62</a>...</li>
<li><a
href="92bf98f623"><code>92bf98f</code></a>
[backport 2.3.x] BUG: fix .str.isdigit to honor unicode superscript for
older...</li>
<li><a
href="e57c7d6a22"><code>e57c7d6</code></a>
Backport PR <a
href="https://redirect.github.com/pandas-dev/pandas/issues/62452">#62452</a>
on branch 2.3.x (TST: Adjust tests for numexpr 2.13) (<a
href="https://redirect.github.com/pandas-dev/pandas/issues/62454">#62454</a>)</li>
<li><a
href="e0fe9a03c9"><code>e0fe9a0</code></a>
Backport to 2.3.x: REGR: from_records not initializing subclasses
properly (#...</li>
<li><a
href="23a1085e64"><code>23a1085</code></a>
BUG: improve future warning for boolean operations with missaligned
indexes (...</li>
<li><a
href="61136969fb"><code>6113696</code></a>
Backport PR <a
href="https://redirect.github.com/pandas-dev/pandas/issues/62396">#62396</a>
on branch 2.3.x (PKG/DOC: indicate Python 3.14 support in ...</li>
<li>Additional commits viewable in <a
href="https://github.com/pandas-dev/pandas/compare/v2.3.1...v2.3.3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.7.0 to 6.8.0.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d0cc045d04"><code>d0cc045</code></a>
Always show prune cache output (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/597">#597</a>)</li>
<li><a
href="2841f9f5c1"><code>2841f9f</code></a>
Bump zizmorcore/zizmor-action from 0.1.2 to 0.2.0 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/571">#571</a>)</li>
<li><a
href="e554b93b80"><code>e554b93</code></a>
Add **/*.py.lock to cache-dependency-glob (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/590">#590</a>)</li>
<li><a
href="c7d85d9988"><code>c7d85d9</code></a>
chore: update known versions for 0.8.20</li>
<li><a
href="07f2cb5db9"><code>07f2cb5</code></a>
persist credentials for version update (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/584">#584</a>)</li>
<li><a
href="208b0c0ee4"><code>208b0c0</code></a>
README.md: Fix Python versions and update checkout action (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/572">#572</a>)</li>
<li>See full diff in <a
href="b75a909f75...d0cc045d04">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [requests](https://github.com/psf/requests) from 2.32.4 to 2.32.5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/releases">requests's
releases</a>.</em></p>
<blockquote>
<h2>v2.32.5</h2>
<h2>2.32.5 (2025-08-18)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>The SSLContext caching feature originally introduced in 2.32.0 has
created
a new class of issues in Requests that have had negative impact across a
number
of use cases. The Requests team has decided to revert this feature as
long term
maintenance of it is proving to be unsustainable in its current
iteration.</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for Python 3.14.</li>
<li>Dropped support for Python 3.8 following its end of support.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's
changelog</a>.</em></p>
<blockquote>
<h2>2.32.5 (2025-08-18)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>The SSLContext caching feature originally introduced in 2.32.0 has
created
a new class of issues in Requests that have had negative impact across a
number
of use cases. The Requests team has decided to revert this feature as
long term
maintenance of it is proving to be unsustainable in its current
iteration.</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for Python 3.14.</li>
<li>Dropped support for Python 3.8 following its end of support.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="b25c87d7cb"><code>b25c87d</code></a>
v2.32.5</li>
<li><a
href="131e506079"><code>131e506</code></a>
Merge pull request <a
href="https://redirect.github.com/psf/requests/issues/7010">#7010</a>
from psf/dependabot/github_actions/actions/checkout-...</li>
<li><a
href="b336cb2bc6"><code>b336cb2</code></a>
Bump actions/checkout from 4.2.0 to 5.0.0</li>
<li><a
href="46e939b552"><code>46e939b</code></a>
Update publish workflow to use <code>artifact-id</code> instead of
<code>name</code></li>
<li><a
href="4b9c546aa3"><code>4b9c546</code></a>
Merge pull request <a
href="https://redirect.github.com/psf/requests/issues/6999">#6999</a>
from psf/dependabot/github_actions/step-security/har...</li>
<li><a
href="7618dbef01"><code>7618dbe</code></a>
Bump step-security/harden-runner from 2.12.0 to 2.13.0</li>
<li><a
href="2edca11103"><code>2edca11</code></a>
Add support for Python 3.14 and drop support for Python 3.8 (<a
href="https://redirect.github.com/psf/requests/issues/6993">#6993</a>)</li>
<li><a
href="fec96cd597"><code>fec96cd</code></a>
Update Makefile rules (<a
href="https://redirect.github.com/psf/requests/issues/6996">#6996</a>)</li>
<li><a
href="d58d8aa2f4"><code>d58d8aa</code></a>
docs: clarify timeout parameter uses seconds in Session.request (<a
href="https://redirect.github.com/psf/requests/issues/6994">#6994</a>)</li>
<li><a
href="91a3eabd3d"><code>91a3eab</code></a>
Bump github/codeql-action from 3.28.5 to 3.29.0</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/requests/compare/v2.32.4...v2.32.5">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Uses test_id in request hashes and test-scoped subdirectories to prevent
cross-test contamination. Model list endpoints exclude test_id to enable
merging recordings from different servers.
Additionally, this PR adds a `record-if-missing` mode (which we will use
instead of `record` which records everything) which is very useful.
🤖 Co-authored with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
# What does this PR do?
Added missing configuration files
## Test Plan
run ./scripts/telemetry/setup_telemetry.sh
```
OTEL_SERVICE_NAME=llama_stack OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 TELEMETRY_SINKS=otel_trace,otel_metric uv run --with llama-stack llama stack build --distro=starter --image-type=venv --run
```
Navigate to grafana localhost:3000, query metrics and traces
IDs are now deterministic hashes based on request content, and
timestamps are normalized to constants, eliminating spurious changes
when re-recording tests.
## Changes
- Updated `inference_recorder.py` to normalize IDs and timestamps during
recording
- Added `scripts/normalize_recordings.py` utility to re-normalize
existing recordings
- Created documentation in `tests/integration/recordings/README.md`
- Normalized 350 existing recording files
# What does this PR do?
`uv add "weaviate-client>=4.16.4" --group unit`
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
## Summary
Introduce `ExtraBodyField` annotation to enable parameters that arrive
via extra_body in client SDKs but are accessible server-side with full
typing.
These parameters are documented in OpenAPI specs under
**`x-llama-stack-extra-body-params`** but excluded from generated SDK
signatures.
Add `shields` parameter to `create_openai_response` as the first
implementation using this pattern.
## Test Plan
- added an integration test which checks that shields parameter passed
via extra_body reaches server implementation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
This PR adds a comment-triggered GitHub Actions workflow that allows
running pre-commit hooks on-demand for any pull request. When someone
comments `@github-actions run precommit` on a PR, the bot automatically
runs all pre-commit hooks and commits any formatting or linting fixes
directly to the PR branch.
The implementation uses a secure two-workflow approach: a trigger
workflow validates permissions and dispatches to an execution workflow
that runs pre-commit in a privileged context. This works safely for both
same-repo and fork PRs, with permission checks ensuring only PR authors
or repository collaborators can trigger the bot.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude <noreply@anthropic.com>
# What does this PR do?
on the path to maintainable impls of inference providers. make all
configs instances of RemoteInferenceProviderConfig.
## Test Plan
ci
# What does this PR do?
Initial implementation for `Conversations` and `ConversationItems` using
`AuthorizedSqlStore` with endpoints to:
- CREATE
- UPDATE
- GET/RETRIEVE/LIST
- DELETE
Set `level=LLAMA_STACK_API_V1`.
NOTE: This does not currently incorporate changes for Responses, that'll
be done in a subsequent PR.
Closes https://github.com/llamastack/llama-stack/issues/3235
## Test Plan
- Unit tests
- Integration tests
Also comparison of [OpenAPI spec for OpenAI
API](https://github.com/openai/openai-openapi/tree/manual_spec)
```bash
oasdiff breaking --fail-on ERR docs/static/llama-stack-spec.yaml https://raw.githubusercontent.com/openai/openai-openapi/refs/heads/manual_spec/openapi.yaml --strip-prefix-base "/v1/openai/v1" \
--match-path '(^/v1/openai/v1/conversations.*|^/conversations.*)'
```
Note I still have some uncertainty about this, I borrowed this info from
@cdoern on https://github.com/llamastack/llama-stack/pull/3514 but need
to spend more time to confirm it's working, at the moment it suggests it
does.
UPDATE on `oasdiff`, I investigated the OpenAI spec further and it looks
like currently the spec does not list Conversations, so that analysis is
useless. Noting for future reference.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
now that we consolidated the providerspec types and got rid of
`AdapterSpec`, adjust external.md
BREAKING CHANGE: external providers must update their
`get_provider_spec` function to use `RemoteProviderSpec` properly
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
remove unused chat_completion implementations
vllm features ported -
- requires max_tokens be set, use config value
- set tool_choice to none if no tools provided
## Test Plan
ci
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- This PR implements keyword and hybrid search for Weaviate DB based on
its inbuilt functions.
- Added fixtures to conftest.py for Weaviate.
- Enabled integration tests for remote Weaviate on all 3 search modes.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3010
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Unit tests and integration tests should pass on this PR.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Addresses Issue #3271 - "Starting LLS server locally on a terminal with
120 chars width results in an output with empty lines".
This removes the specific 150-character width limit specified for the
Console, and will now auto-detect the terminal width instead. Now the
formatting of Console output is consistent across different sizes of
terminal windows.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3271
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Launching the server with several different sizes of terminal windows
results in Console output without unexpected spacing. e.g. `python -m
llama_stack.core.server.server /tmp/run.yaml --port 8321`
---------
Signed-off-by: Doug Edgar <dedgar@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
add ModelsProtocolPrivate methods to OpenAIMixin
this will allow providers using OpenAIMixin to use a common interface
## Test Plan
ci w/ new tests
# What does this PR do?
closes#3268closes#3498
When resuming from previous response ID, currently we attempt to convert
from the stored responses input to chat completion messages, which is
not always possible, e.g. for tool calls where some data is lost once
converted from chat completion message to repsonses input format.
This PR stores the chat completion messages that correspond to the
_last_ call to chat completion, which is sufficient to be resumed from
in the next responses API call, where we load these saved messages and
skip conversion entirely.
Separate issue to optimize storage:
https://github.com/llamastack/llama-stack/issues/3646
## Test Plan
existing CI tests
This is a sweeping change to clean up some gunk around our "Tool"
definitions.
First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.
Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.
To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.
Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.
This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
# What does this PR do?
* Adds stainless-llama-stack-spec.yaml for Stainless client generation,
which comprises stable + experimental APIs
## Test Plan
* Manual generation
## Description
Currently, the docs page has the home book opened by default. This PR
updates the .ts so that the sidebar books are collapsed when you first
open the webpage
# What does this PR do?
this was broken by #3631, re-enable this ability by only using oasdiff
when .skip != 'true'
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
* Updates code snippets for Dell distribution, fixing specific user home
directory in code (replacing with $HOME) and updates docker instructions
to use `docker` instead of `podman`.
## Test Plan
N.A.
Co-authored-by: Connor Hack <connorhack@fb.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Spammy
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
n/a
# What does this PR do?
the LiteLLMOpenAIMixin provides support for reading key from provider
data (headers users send).
this adds the same functionality to the OpenAIMixin.
this is infrastructure for migrating providers.
## Test Plan
ci w/ new tests
# What does this PR do?
Adds supplementary static content to root API spec pages. This is useful for giving context behind a specific API group, adding information on supported features or work in progress, etc.
This PR introduces supplementary information for Agents (experimental, deprecated) and Responses (stable) APIs.
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Documentation server renders rich static content for the Agents API group:

<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
First step towards cleaning up the API reference section of the docs.
- Separates API reference into 3 sections: stable (`v1`), experimental (`v1alpha` and `v1beta`), and deprecated (`deprecated=True`)
- Each section is accessible via the dropdown menu and `docs/api-overview`
<img width="1237" height="321" alt="Screenshot 2025-09-30 at 5 47 30 PM" src="https://github.com/user-attachments/assets/fe0e498c-b066-46ed-a48e-4739d3b6724c" />
<img width="860" height="510" alt="Screenshot 2025-09-30 at 5 47 49 PM" src="https://github.com/user-attachments/assets/a92a8d8c-94bf-42d5-9f5b-b47bb2b14f9c" />
- Deprecated APIs: Added styling to the sidebar, and a notice on the endpoint pages
<img width="867" height="428" alt="Screenshot 2025-09-30 at 5 47 43 PM" src="https://github.com/user-attachments/assets/9e6e050d-c782-461b-8084-5ff6496d7bd9" />
Closes#3628
TODO in follow-up PRs:
- Add the ability to annotate API groups with supplementary content (so we can have longer descriptions of complex APIs like Responses)
- Clean up docstrings to show API endpoints (or short semantic titles) in the sidebar
## Test Plan
- Local testing
- Made sure API conformance test still passes
# What does this PR do?
Given the rapidly changing nature of Llama Stack's APIs and the need to have clean, user-friendly API documentation, we want to split the API reference into 3 main buckets; stable, experimental and deprecated. The most straightforward way to do it is to have several automatically generated doctrees, which introduces some complexity in testing APIs for backwards compatibility.
This PR updates the API conformance test to handle cases where the API schema is split into several files; it does not change the testing criteria.
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
No developer-facing changes (all existing tests should pass)
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- categories like "core::server" is not recognized so it's level is not
set by 'all=debug'
- removed spammy telemetry debug logging
## Test Plan
test server launched with LLAMA_STACK_LOGGING='all=debug'
# What does this PR do?
if the PR title has `!` or the footer of the commit has `BREAKING
CHANGE:`, skip conformance. This is documented in the API leveling
proposal
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
level the following APIs, keeping their old routes around as well until
0.4.0
1. datasetio to v1beta: used primarily by eval and training. Given that
training is v1alpha, and eval is v1alpha, datasetio is likely to change
in structure as real usages of the API spin up. Register,unregister, and
iter dataset is sparsely implemented meaning the shape of that route is
likely to change.
2. telemetry to v1alpha: telemetry has been going through many changes.
for example query_metrics was not even implemented until recently and
had to change its shape to work. putting this in v1beta will allow us to
fix functionality like OTEL, sqlite, etc. The routes themselves are set,
but the structure might change a bit
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
When a model decides to use an MCP tool call that requires no arguments,
it sets the `arguments` field to `None`. This causes the user to see a
`400 bad requst error` due to validation errors down the stack because
this field gets removed when being parsed by an openai compatible
inference provider like vLLM
This PR ensures that, as soon as the tool call args are accumulated
while streaming, we check to ensure no tool call function arguments are
set to None - if they are we replace them with "{}"
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3456
## Test Plan
Added new unit test to verify that any tool calls with function
arguments set to `None` get handled correctly
---------
Signed-off-by: Jaideep Rao <jrao@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Fireworks doesn't allow repsonse_format with tool use. The default
response format is 'text' anyway, so we can safely omit.
## Test Plan
Below script failed without the change, runs after.
```
#!/usr/bin/env python3
"""
Script to test Responses API with kubernetes-mcp-server.
This script:
1. Connects to the llama stack server
2. Uses the Responses API with MCP tools
3. Asks for the list of Kubernetes namespaces using the kubernetes-mcp-server
"""
import json
from openai import OpenAI
# Connect to the llama stack server
base_url = "http://localhost:8321/v1"
client = OpenAI(base_url=base_url, api_key="fake")
# Define the MCP tool pointing to the kubernetes-mcp-server
# The kubernetes-mcp-server is running on port 3000 with SSE endpoint at /sse
mcp_server_url = "http://localhost:3000/sse"
tools = [
{
"type": "mcp",
"server_label": "k8s",
"server_url": mcp_server_url,
}
]
# Create a response request asking for k8s namespaces
print("Sending request to list Kubernetes namespaces...")
print(f"Using MCP server at: {mcp_server_url}")
print("Available tools will be listed automatically by the MCP server.")
print()
response = client.responses.create(
# model="meta-llama/Llama-3.2-3B-Instruct", # Using the vllm model
model="fireworks/accounts/fireworks/models/llama4-scout-instruct-basic",
# model="openai/gpt-4o",
input="what are all the Kubernetes namespaces? Use tool call to `namespaces_list`. make sure to adhere to the tool calling format UNDER ALL CIRCUMSTANCES.",
tools=tools,
stream=False,
)
print("\n" + "=" * 80)
print("RESPONSE OUTPUT:")
print("=" * 80)
# Print the output
for i, output in enumerate(response.output):
print(f"\n[Output {i + 1}] Type: {output.type}")
if output.type == "mcp_list_tools":
print(f" Server: {output.server_label}")
print(f" Tools available: {[t.name for t in output.tools]}")
elif output.type == "mcp_call":
print(f" Tool called: {output.name}")
print(f" Arguments: {output.arguments}")
print(f" Result: {output.output}")
if output.error:
print(f" Error: {output.error}")
elif output.type == "message":
print(f" Role: {output.role}")
print(f" Content: {output.content}")
print("\n" + "=" * 80)
print("FINAL RESPONSE TEXT:")
print("=" * 80)
print(response.output_text)
```
# What does this PR do?
This PR adds support for the require_approval on an mcp tool definition
passed to create response in the Responses API. This allows the caller
to indicate whether they want to approve calls to that server, or let
them be called without approval.
Closes#3443
## Test Plan
Tested both approval and denial.
Added automated integration test for both cases.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
* Updates the safety guide in Zero to Hero series to use Moderations API
and the latest safety models
* Fixes an image link
Closes#2557
## Test Plan
* Manual testing
# What does this PR do?
* Adds canonical project information and links to client SDK / k8s
operator / app examples repos to the front page
* Fixes some button rendering errors
Closes#3618
## Test Plan
Local rebuild of the documentation server
https://github.com/llamastack/llama-stack/pull/3604 broke multipart form
data field parsing for the Files API since it changed its shape -- so as
to match the API exactly to the OpenAI spec even in the generated client
code.
The underlying reason is that multipart/form-data cannot transport
structured nested fields. Each field must be str-serialized. The client
(specifically the OpenAI client whose behavior we must match),
transports sub-fields as `expires_after[anchor]` and
`expires_after[seconds]`, etc. We must be able to handle these fields
somehow on the server without compromising the shape of the YAML spec.
This PR "fixes" this by adding a dependency to convert the data. The
main trade-off here is that we must add this `Depends()` annotation on
every provider implementation for Files. This is a headache, but a much
more reasonable one (in my opinion) given the alternatives.
## Test Plan
Tests as shown in
https://github.com/llamastack/llama-stack/pull/3604#issuecomment-3351090653
pass.
# What does this PR do?
agents is likely to be deprecated in favor of responses. Lets level it
as alpha to indicate the lack of longterm support
keep v1 route for backwards compat.
Closes#3611
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
migrate safety api implementation from /inference/chat-completion to
/v1/chat/completions
## Test Plan
ci w/ recordings
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Add llamastack + CrewAI integration example notebook
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Tested in local jupyternotebook and it works.
# What does this PR do?
Fixes error:
```
[ERROR] Error executing endpoint route='/v1/openai/v1/responses'
method='post': Error code: 400 - {'error': {'message': "Invalid schema for function 'pods_exec': In context=('properties', 'command'), array
schema missing items.", 'type': 'invalid_request_error', 'param': 'tools[7].function.parameters', 'code': 'invalid_function_parameters'}}
```
From script:
```
#!/usr/bin/env python3
"""
Script to test Responses API with kubernetes-mcp-server.
This script:
1. Connects to the llama stack server
2. Uses the Responses API with MCP tools
3. Asks for the list of Kubernetes namespaces using the kubernetes-mcp-server
"""
import json
from openai import OpenAI
# Connect to the llama stack server
base_url = "http://localhost:8321/v1/openai/v1"
client = OpenAI(base_url=base_url, api_key="fake")
# Define the MCP tool pointing to the kubernetes-mcp-server
# The kubernetes-mcp-server is running on port 3000 with SSE endpoint at /sse
mcp_server_url = "http://localhost:3000/sse"
tools = [
{
"type": "mcp",
"server_label": "k8s",
"server_url": mcp_server_url,
}
]
# Create a response request asking for k8s namespaces
print("Sending request to list Kubernetes namespaces...")
print(f"Using MCP server at: {mcp_server_url}")
print("Available tools will be listed automatically by the MCP server.")
print()
response = client.responses.create(
# model="meta-llama/Llama-3.2-3B-Instruct", # Using the vllm model
model="openai/gpt-4o",
input="what are all the Kubernetes namespaces? Use tool call to `namespaces_list`. make sure to adhere to the tool calling format.",
tools=tools,
stream=False,
)
print("\n" + "=" * 80)
print("RESPONSE OUTPUT:")
print("=" * 80)
# Print the output
for i, output in enumerate(response.output):
print(f"\n[Output {i + 1}] Type: {output.type}")
if output.type == "mcp_list_tools":
print(f" Server: {output.server_label}")
print(f" Tools available: {[t.name for t in output.tools]}")
elif output.type == "mcp_call":
print(f" Tool called: {output.name}")
print(f" Arguments: {output.arguments}")
print(f" Result: {output.output}")
if output.error:
print(f" Error: {output.error}")
elif output.type == "message":
print(f" Role: {output.role}")
print(f" Content: {output.content}")
print("\n" + "=" * 80)
print("FINAL RESPONSE TEXT:")
print("=" * 80)
print(response.output_text)
```
## Test Plan
new unit tests
script now runs successfully
The `/v1/openai/v1` prefix is annoying and now unnecessary given our
clearer focus on how to think about the API surface.
Let's kill it for the 0.3.0 update.
To make client-side changes feasible, we will do this in two parts. This
part adds a new route (sans `/openai/v1`) so the existing client
continues to work since the server supports both.
The next PR will be client-side (Stainless) changes which I will be
making shortly.
The final PR will remove the `/openai/v1` routes.
Note that all these changes will happen rapidly within this release
cycle. The entire set _will be backwards incompatible_.
# What does this PR do?
Refs: https://github.com/llamastack/llama-stack/issues/3420
When telemetry is enabled the router uncondionally expects the usage
attribute to be availble and fails if it is not present.
Usage is not currently being requested by litellm_openai_mixin.py for
streaming requests when using the responses API which means that
providers like vertexai fail if telemetry is enabled and streaming is
used.
This is part of the required fix. Other part is in liteLLM, will plan to
submit PR for that soon.
## Test Plan
I applied this change along with the change for litellm in a llama stack
deployment and validated that I could make streaming requests through
the responses API to a gemini model and they would succeed instead of
failing due to the missing usage attribute when telemetry is enabled.
Signed-off-by: Michael Dawson <midawson@redhat.com>
# What does this PR do?
now that /v1/inference/completion has been removed, no docs should refer
to it
this cleans up remaining references
## Test Plan
ci
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
* Updates image paths for images in docs/resources/ to proper static
image locations
## Test Plan
* `npm run build` builds documentation properly
# What does this PR do?
move the eval=inline::meta-reference implementation to use
openai_completion/openai_chat_completion
note: this breaks backward compatibility if eval setup used sampling
params' repetition_penalty or strategy
## Test Plan
ci w/ new recordings
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
we skip embedding tests when the embedding_model_id isn't provided. same
for completion / chat tests when text_model_id isn't given.
instead of failing safety tests when a shield_id isn't provided, we'll
skip them too.
## Test Plan
ci
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
inference/rerank is the one route in the API intended to not be
deprecated. Level it as v1alpha.
Additionally, remove `experimental` and opt to instead use `v1alpha`
which itself implies an experimental state based on the original
proposal
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fix#3300 by adding mime type of application/json support in
[agent_instance.py](4a59961a6c/llama_stack/providers/inline/agents/meta_reference/agent_instance.py (L923))
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[3300] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
all related pytest passed, see log:
```
./scripts/unit-tests.sh tests/unit/providers/agent/test_get_raw_document_text.py -vvv
/Users/kaiwu/work/kaiwu/llama-stack/.venv/bin/python3
Uninstalled 22 packages in 5.65s
Installed 47 packages in 1.24s
================= test session starts =================
platform darwin -- Python 3.12.9, pytest-8.4.2, pluggy-1.6.0 -- /Users/kaiwu/work/kaiwu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'macOS-15.6.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/kaiwu/work/kaiwu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 14 items
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_yaml_mime_type PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_deprecated_text_yaml_with_warning PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_json_mime_type PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_url_content PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unexpected_content_type PASSED
================ slowest 10 durations =================
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_url
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unexpected_content_type
0.00s setup tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
0.00s teardown tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_url
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_url_content
0.00s teardown tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_url
0.00s call tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
================= 14 passed in 0.14s ==================
Generating coverage report...
Wrote HTML report to htmlcov-3.12/index.html
```
Reverts llamastack/llama-stack#3576
When I edit Stainless and codegen succeeds, the `next` branch is updated
directly. It provides us no chance to see if there might be something
unideal going on. If something is wrong, all CI will start breaking
immediately. This is not ideal. I will likely create another staging
branch `next-release` or something to accomodate the special workflow
that Stainless requires.
# What does this PR do?
Mirroring the same changes that was used for inference_store:
https://github.com/llamastack/llama-stack/pull/3383
Will follow up with a shared internal API for managing these write
queues.
## Test Plan
existing tests
# What does this PR do?
the nvidia datastore tests were running when the datastore was not
configured. they would always fail.
this introduces a skip when the nvidia datastore is not configured.
## Test Plan
ci
Bumps [actions/cache](https://github.com/actions/cache) from 4.2.4 to
4.3.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/releases">actions/cache's
releases</a>.</em></p>
<blockquote>
<h2>v4.3.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add note on runner versions by <a
href="https://github.com/GhadimiR"><code>@GhadimiR</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1642">actions/cache#1642</a></li>
<li>Prepare <code>v4.3.0</code> release by <a
href="https://github.com/Link"><code>@Link</code></a>- in <a
href="https://redirect.github.com/actions/cache/pull/1655">actions/cache#1655</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/GhadimiR"><code>@GhadimiR</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/cache/pull/1642">actions/cache#1642</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/cache/compare/v4...v4.3.0">https://github.com/actions/cache/compare/v4...v4.3.0</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/blob/main/RELEASES.md">actions/cache's
changelog</a>.</em></p>
<blockquote>
<h1>Releases</h1>
<h3>4.3.0</h3>
<ul>
<li>Bump <code>@actions/cache</code> to <a
href="https://redirect.github.com/actions/toolkit/pull/2132">v4.1.0</a></li>
</ul>
<h3>4.2.4</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.5</li>
</ul>
<h3>4.2.3</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.3 (obfuscates SAS token in
debug logs for cache entries)</li>
</ul>
<h3>4.2.2</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.2</li>
</ul>
<h3>4.2.1</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.1</li>
</ul>
<h3>4.2.0</h3>
<p>TLDR; The cache backend service has been rewritten from the ground up
for improved performance and reliability. <a
href="https://github.com/actions/cache">actions/cache</a> now integrates
with the new cache service (v2) APIs.</p>
<p>The new service will gradually roll out as of <strong>February 1st,
2025</strong>. The legacy service will also be sunset on the same date.
Changes in these release are <strong>fully backward
compatible</strong>.</p>
<p><strong>We are deprecating some versions of this action</strong>. We
recommend upgrading to version <code>v4</code> or <code>v3</code> as
soon as possible before <strong>February 1st, 2025.</strong> (Upgrade
instructions below).</p>
<p>If you are using pinned SHAs, please use the SHAs of versions
<code>v4.2.0</code> or <code>v3.4.0</code></p>
<p>If you do not upgrade, all workflow runs using any of the deprecated
<a href="https://github.com/actions/cache">actions/cache</a> will
fail.</p>
<p>Upgrading to the recommended versions will not break your
workflows.</p>
<h3>4.1.2</h3>
<ul>
<li>Add GitHub Enterprise Cloud instances hostname filters to inform API
endpoint choices - <a
href="https://redirect.github.com/actions/cache/pull/1474">#1474</a></li>
<li>Security fix: Bump braces from 3.0.2 to 3.0.3 - <a
href="https://redirect.github.com/actions/cache/pull/1475">#1475</a></li>
</ul>
<h3>4.1.1</h3>
<ul>
<li>Restore original behavior of <code>cache-hit</code> output - <a
href="https://redirect.github.com/actions/cache/pull/1467">#1467</a></li>
</ul>
<h3>4.1.0</h3>
<ul>
<li>Ensure <code>cache-hit</code> output is set when a cache is missed -
<a
href="https://redirect.github.com/actions/cache/pull/1404">#1404</a></li>
<li>Deprecate <code>save-always</code> input - <a
href="https://redirect.github.com/actions/cache/pull/1452">#1452</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0057852bfa"><code>0057852</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1655">#1655</a>
from actions/Link-/prepare-4.3.0</li>
<li><a
href="4f5ea67f1c"><code>4f5ea67</code></a>
Update licensed cache</li>
<li><a
href="9fcad95d03"><code>9fcad95</code></a>
Upgrade actions/cache to 4.1.0 and prepare 4.3.0 release</li>
<li><a
href="638ed79f9d"><code>638ed79</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1642">#1642</a>
from actions/GhadimiR-patch-1</li>
<li><a
href="3862dccb17"><code>3862dcc</code></a>
Add note on runner versions</li>
<li>See full diff in <a
href="0400d5f644...0057852bfa">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [tw-animate-css](https://github.com/Wombosvideo/tw-animate-css)
from 1.2.9 to 1.4.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/Wombosvideo/tw-animate-css/releases">tw-animate-css's
releases</a>.</em></p>
<blockquote>
<h2>v1.4.0</h2>
<h2>Changelog</h2>
<p>902e37a019ffd165ba078e0b3c02634526c54bf0: fix: remove support for
prefix, add new export for prefixed version. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/58">#58</a>.
fab2a5bf817605be1976e159976718a83489fc1c: chore: bump version to 1.4.0
and update dependencies
c20dc32e2b532a8e74546879b4ce7d9ce89ba710: fix(build): make transform.ts
accept two arguments</p>
<h2>⚠️ BREAKING CHANGE ⚠️</h2>
<p>Support for Tailwind CSS's prefix option was moved to
<code>tw-animate-css/prefix</code> because it was breaking the
<code>--spacing</code> function. Users requiring prefixes should replace
their import:</p>
<pre lang="diff"><code>- import "tw-animate-css";
+ import "tw-animate-css/prefix";
</code></pre>
<p><em>I do not plan to introduce breaking changes like this to
non-major releases in the future. But because more people use spacing
rather than prefixes, reverting the previous version's (obviously
breaking) change seems reasonable.</em></p>
<h2>v1.3.8</h2>
<h2>Changelog</h2>
<ul>
<li>b5ff23a: fix: add support for global CSS variable prefix. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/48">#48</a></li>
<li>03e5f12: feat: add support for ng-primitives height variables <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/56">#56</a>
(thanks <a
href="https://github.com/immohammadjaved"><code>@immohammadjaved</code></a>)</li>
<li>b076cfb: docs: fix various issues in accordion and collapsible
docs</li>
<li>9485e33: chore: bump version to 1.3.8 and update dependencies</li>
</ul>
<h2>⚠️ BREAKING CHANGE ⚠️</h2>
<p>Adding support for prefixes broke custom spacing. It is recommended
that you skip this version if you do not use Tailwind CSS's prefix
option, and use v1.4.0 instead. If you are actually using prefixes, you
can use a special version supporting prefixes:</p>
<pre lang="diff"><code>- import "tw-animate-css"; /* Version
with spacing support */
+ import "tw-animate-css/prefix"; /* Version with prefix
support */
</code></pre>
<p><em>I do not plan to fix the incompatibility between the spacing and
prefix versions due to time constraints. Feel free to investigate and
open a pull request if you manage to fix it.</em></p>
<h2>v1.3.7</h2>
<h2>Changelog</h2>
<ul>
<li>80dbfcc: feat: add utilities for blur transitions <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/54">#54</a>
(thanks <a
href="https://github.com/coffeeispower"><code>@coffeeispower</code></a>)</li>
<li>dc294f9: docs: add upcoming changes warning</li>
<li>c640bb8: chore: update dependencies and package manager version</li>
<li>9e63e34: chore: bump version to 1.3.7</li>
</ul>
<h2>v1.3.6</h2>
<h2>Changelog</h2>
<ul>
<li>58f3396: fix: allow changing animation parameters for ready-to-use
animations</li>
<li>8313476: chore: update dependencies nd package manager version</li>
<li>f81346c: chore: bump version to 1.3.6</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="c20dc32e2b"><code>c20dc32</code></a>
fix(build): make transform.ts accept two arguments</li>
<li><a
href="fab2a5bf81"><code>fab2a5b</code></a>
chore: bump version to 1.4.0 and update dependencies</li>
<li><a
href="902e37a019"><code>902e37a</code></a>
fix: remove support for prefix, add new export for prefixed version</li>
<li><a
href="9485e33d99"><code>9485e33</code></a>
chore: bump version to 1.3.8 and update dependencies</li>
<li><a
href="b076cfb04a"><code>b076cfb</code></a>
docs: fix various issues in accordion and collapsible docs</li>
<li><a
href="03e5f12418"><code>03e5f12</code></a>
feat: add support for ng-primitives height variables (<a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/56">#56</a>)</li>
<li><a
href="b5ff23a0d5"><code>b5ff23a</code></a>
fix: add support for global CSS variable prefix. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/48">#48</a></li>
<li><a
href="9e63e34286"><code>9e63e34</code></a>
chore: bump version to 1.3.7</li>
<li><a
href="c640bb8933"><code>c640bb8</code></a>
chore: update dependencies and package manager version</li>
<li><a
href="dc294f990a"><code>dc294f9</code></a>
docs: add upcoming changes warning</li>
<li>Additional commits viewable in <a
href="https://github.com/Wombosvideo/tw-animate-css/compare/v1.2.9...v1.4.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [shiki](https://github.com/shikijs/shiki/tree/HEAD/packages/shiki)
from 1.29.2 to 3.13.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/shikijs/shiki/releases">shiki's
releases</a>.</em></p>
<blockquote>
<h2>v3.13.0</h2>
<h3> 🚀 Features</h3>
<ul>
<li><strong>transformers</strong>: Render indent guides - by <a
href="https://github.com/KazariEX"><code>@KazariEX</code></a> and <a
href="https://github.com/antfu"><code>@antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1060">shikijs/shiki#1060</a>
<a href="aecd1617"><!-- raw HTML
omitted -->(aecd1)<!-- raw HTML omitted --></a></li>
</ul>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.12.3...v3.13.0">View
changes on GitHub</a></h5>
<h2>v3.12.3</h2>
<h3> 🐞 Bug Fixes</h3>
<ul>
<li><code>@shikijs/twoslash</code> version specifier - by <a
href="https://github.com/9romise"><code>@9romise</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1078">shikijs/shiki#1078</a>
<a href="a1cdea41"><!-- raw HTML
omitted -->(a1cde)<!-- raw HTML omitted --></a></li>
</ul>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.12.2...v3.12.3">View
changes on GitHub</a></h5>
<h2>v3.12.2</h2>
<h3> 🐞 Bug Fixes</h3>
<ul>
<li><strong>twoslash</strong>: Fix <code>onTwoslashError</code> return
value handling - by <a
href="https://github.com/Karibash"><code>@Karibash</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1070">shikijs/shiki#1070</a>
<a href="e86b0a7c"><!-- raw HTML
omitted -->(e86b0)<!-- raw HTML omitted --></a></li>
</ul>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.12.1...v3.12.2">View
changes on GitHub</a></h5>
<h2>v3.12.1</h2>
<p><em>No significant changes</em></p>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.12.0...v3.12.1">View
changes on GitHub</a></h5>
<h2>v3.12.0</h2>
<h3> 🚀 Features</h3>
<ul>
<li><strong>vitepress-twoslash</strong>:
<ul>
<li>Improve UX for option customization - by <a
href="https://github.com/9romise"><code>@9romise</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1066">shikijs/shiki#1066</a>
<a href="e3cfdeca"><!-- raw HTML
omitted -->(e3cfd)<!-- raw HTML omitted --></a></li>
<li>Twoslash inline type cache for markdown - by <a
href="https://github.com/serkodev"><code>@serkodev</code></a> and <a
href="https://github.com/antfu"><code>@antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1063">shikijs/shiki#1063</a>
<a href="dc7fbc70"><!-- raw HTML
omitted -->(dc7fb)<!-- raw HTML omitted --></a></li>
</ul>
</li>
</ul>
<h3> 🐞 Bug Fixes</h3>
<ul>
<li><strong>remove-notation-escape</strong>: Correct escape sequence -
by <a href="https://github.com/sor4chi"><code>@sor4chi</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1065">shikijs/shiki#1065</a>
<a href="22d0c780"><!-- raw HTML
omitted -->(22d0c)<!-- raw HTML omitted --></a></li>
</ul>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.11.0...v3.12.0">View
changes on GitHub</a></h5>
<h2>v3.11.0</h2>
<h3> 🚀 Features</h3>
<ul>
<li><strong>core</strong>: Add <code>enforce</code> options to
<code>ShikiTransformer</code> - by <a
href="https://github.com/serkodev"><code>@serkodev</code></a> and <a
href="https://github.com/antfu"><code>@antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1062">shikijs/shiki#1062</a>
<a href="8ad05bd8"><!-- raw HTML
omitted -->(8ad05)<!-- raw HTML omitted --></a></li>
</ul>
<h5> <a
href="https://github.com/shikijs/shiki/compare/v3.10.0...v3.11.0">View
changes on GitHub</a></h5>
<h2>v3.10.0</h2>
<h3> 🚀 Features</h3>
<ul>
<li>Add funding links to playground - by <a
href="https://github.com/jtbandes"><code>@jtbandes</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1054">shikijs/shiki#1054</a>
<a href="e36eb4d8"><!-- raw HTML
omitted -->(e36eb)<!-- raw HTML omitted --></a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="fd7326a82f"><code>fd7326a</code></a>
chore: release v3.13.0</li>
<li><a
href="5cbb05219e"><code>5cbb052</code></a>
chore: release v3.12.3</li>
<li><a
href="e462618190"><code>e462618</code></a>
chore: release v3.12.2</li>
<li><a
href="793d71e68f"><code>793d71e</code></a>
chore: release v3.12.1</li>
<li><a
href="9260f3fd10"><code>9260f3f</code></a>
chore: release v3.12.0</li>
<li><a
href="d05f39b1e8"><code>d05f39b</code></a>
chore: release v3.11.0</li>
<li><a
href="bda1a76743"><code>bda1a76</code></a>
chore: release v3.10.0</li>
<li><a
href="09921f1cb8"><code>09921f1</code></a>
chore: release v3.9.2</li>
<li><a
href="854eddf2ed"><code>854eddf</code></a>
chore: release v3.9.1</li>
<li><a
href="950ede5ae5"><code>950ede5</code></a>
chore: release v3.9.0</li>
<li>Additional commits viewable in <a
href="https://github.com/shikijs/shiki/commits/v3.13.0/packages/shiki">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by [GitHub Actions](<a
href="https://www.npmjs.com/~GitHub">https://www.npmjs.com/~GitHub</a>
Actions), a new releaser for shiki since your current version.</p>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
When we update Stainless (editor changes), the `next` branch gets
updated. Eventually when one decides on a release, you land changes into
`main`. This is the Stainless workflow.
This PR makes sure we follow that workflow by pulling from the `next`
branch for our integration tests.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Add items and title to ToolParameter/ToolParamDefinition. Adding items
will resolve the issue that occurs with Gemini LLM when an MCP tool has
array-type properties.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Unite test cases will be added.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kai Wu <kaiwu@meta.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
recorded for: ./scripts/integration-tests.sh --stack-config
server:ci-tests --suite base --setup fireworks --subdirs inference
--pattern openai
## Test Plan
./scripts/integration-tests.sh --stack-config server:ci-tests --suite
base --setup fireworks --subdirs inference --pattern openai
# What does this PR do?
unpublish (make unavailable to users) the following apis -
- `/v1/inference/completion`, replaced by `/v1/openai/v1/completions`
- `/v1/inference/chat-completion`, replaced by
`/v1/openai/v1/chat/completions`
- `/v1/inference/embeddings`, replaced by `/v1/openai/v1/embeddings`
- `/v1/inference/batch-completion`, replaced by `/v1/openai/v1/batches`
- `/v1/inference/batch-chat-completion`, replaced by
`/v1/openai/v1/batches`
note: the implementations are still available for internal use, e.g.
agents uses chat-completion.
# What does this PR do?
APIs removed:
- POST /v1/batch-inference/completion
- POST /v1/batch-inference/chat-completion
- POST /v1/inference/batch-completion
- POST /v1/inference/batch-chat-completion
note -
- batch-completion & batch-chat-completion were only implemented for
inference=inline::meta-reference
- batch-inference were not implemented
# What does this PR do?
simplify Ollama inference adapter by -
- moving image_url download code to OpenAIMixin
- being a ModelRegistryHelper instead of having one (mypy blocks
check_model_availability method assignment)
## Test Plan
- add unit tests for new download feature
- add integration tests for openai_chat_completion w/ image_url (close
test gap)
the commit to be reverted is an public api behavior change to something
we should not support.
instead of allowing silent updates (the caller cannot see the log
messages), we should be sending an error to the caller that they must
first unregister the model before reusing the same name w/ a different
backend.
# What does this PR do?
address -
```
ERROR 2025-09-26 10:44:29,450 main:527 core::server: Error creating app: 'FireworksInferenceAdapter' object has no attribute
'alias_to_provider_id_map'
```
## Test Plan
manual startup w/ valid together & fireworks api keys
# What does this PR do?
When listing (and lazily indexing) tools, it's possible for an error to
get thrown by individual toolgroups if for example an MCP toolgroup is
unable to connect to its `mcp_endpoint`.
This logs a warning in the server when that happens, logs a full stack
trace of the error if debug logging is enabled, and just returns the
list of tools from all working toolgroups instead of throwing an error
to the client when a single toolgroup is temporarily or permanently
misbehaving.
The exception to the above is authentication errors, which we
specifically send all the way back to the client as that's how we
indicate to the client that it needs to provide authentication data for
the remote MCP servers.
Closes#2540
## Test Plan
A new unit test was added to test this exception handling, which is run
as part of our regular test suite but also manually run to specifically
verify this fix via:
```
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
To verify the additional debug logging is printing properly:
```
LLAMA_STACK_LOGGING=core=debug \
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
The mcp integration tests were run as below (and by CI):
```
ollama run llama3.2:3b
ENABLE_OLLAMA="ollama" \
OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=starter \
uv run pytest -sv tests/integration/tool_runtime/test_mcp.py \
--text-model meta-llama/Llama-3.2-3B-Instruct
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Rather than have a single `LLAMA_STACK_VERSION`, we need to have a
`_V1`, `_V1ALPHA`, and `_V1BETA` constant.
This also necessitated addition of `level` to the `WebMethod` so that
routing can be handeled properly.
For backwards compat, the `v1` routes are being kept around and marked
as `deprecated`. When used, the server will log a deprecation warning.
Deprecation log:
<img width="1224" height="134" alt="Screenshot 2025-09-25 at 2 43 36 PM"
src="https://github.com/user-attachments/assets/0cc7c245-dafc-48f0-be99-269fb9a686f9"
/>
move:
1. post_training to `v1alpha` as it is under heavy development and not
near its final state
2. eval: job scheduling is not implemented. Relies heavily on the
datasetio API which is under development missing implementations of
specific routes indicating the structure of those routes might change.
Additionally eval depends on the `inference` API which is going to be
deprecated, eval will likely need a major API surface change to conform
to using completions properly
implements leveling in #3317
note: integration tests will fail until the SDK is regenerated with
v1alpha/inference as opposed to v1/inference
## Test Plan
existing tests should pass with newly generated schema. Conformance will
also pass as these routes are not the ones we currently test for
stability
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
use together's new base64 support
## Test Plan
recordings for: ./scripts/integration-tests.sh --stack-config
server:ci-tests --suite base --setup together --subdirs inference
--pattern openai
# What does this PR do?
Switches from `random.getrandbits` to `secrets.randbits` in the
telemetry module.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3553
## Test Plan
Unit tests from scripts/unit-tests.sh were ran to verify the tests still
pass.
Signed-off-by: Doug Edgar <dedgar@redhat.com>
# What does this PR do?
Fixes Llama Stack docs deployment URL
## Test Plan
```
npm run gen-api-docs all
npm run build
```
successfully builds the documentation
# What does this PR do?
- remove auto-download of ollama embedding models
- add embedding model metadata to dynamic listing w/ unit test
- add support and tests for allowed_models
- removed inference provider models.py files where dynamic listing is
enabled
- store embedding metadata in embedding_model_metadata field on
inference providers
- make model_entries optional on ModelRegistryHelper and
LiteLLMOpenAIMixin
- make OpenAIMixin a ModelRegistryHelper
- skip base64 embedding test for remote::ollama, always returns floats
- only use OpenAI client for ollama model listing
- remove unused build_model_entry function
- remove unused get_huggingface_repo function
## Test Plan
ci w/ new tests
# What does this PR do?
use ollama embedding models for ollama test, previously using
sentence-transformer
recordings:
- ./scripts/integration-tests.sh --stack-config server:ci-tests --suite
base --setup ollama --inference-mode record
- ./scripts/integration-tests.sh --stack-config server:ci-tests --suite
vision --setup ollama-vision --inference-mode record
## Test Plan
ci w/ added skip base64 embedding test
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Fixes broken links and Docusaurus search
Closes#3518
## Test Plan
The following should produce a clean build with no warnings and search enabled:
```
npm install
npm run gen-api-docs all
npm run build
npm run serve
```
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Fixes Docusaurus build errors
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
- `npm run build` compiles the build properly
- Broken links expected and will be fixed in a follow-on PR
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Docusaurus server setup
- Deprecates Sphinx build pipeline
- Deprecates remaining references to Readthedocs
- MDX compile errors and broken links to be addressed in follow-up PRs
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
```
npm install
npm gen-api-docs all
npm run build
```
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Migrates static content from Sphinx to Docusaurus
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Migrates the remaining documentation sections to the new documentation format
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
- Partial migration
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Migrates the `advanced_apis/` section of the docs to the new format
## Test Plan
- Partial migration
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Updates provider and distro codegen to handle the new format
- Migrates provider and distro files to the new format
## Test Plan
- Manual testing
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
Update file paths in the conformance workflow to reflect the new location of the llama-stack-spec files from `docs/_static/` to `docs/static/`. Also update the `.gitignore` file to exclude Docusaurus-related directories (`docs/.docusaurus/` and `docs/node_modules/`).
## Test Plan
- Run the workflow locally
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Updates OpenAPI generator to use summaries and changed the file generation path.
## Test Plan
- docs/openapi_generator/run_openapi_generator.sh
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
_[Stack 1/10] Docusaurus documentation migration_
Updates the file upload API documentation to use proper OpenAPI format for integer parameters. Replaces `<int>` with `{integer}` in the description of the `expires_after[seconds]` parameter across the HTML spec, YAML spec, and Python implementation.
## Test Plan
- docs/openapi_generator/run_openapi_generator.sh
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Mostly AI-generated scripts to run guidellm
(https://github.com/vllm-project/guidellm) benchmarks on k8s setup
- Stack is using image built from main on 9/11
## Test Plan
See updated README.md
# What does this PR do?
add/enable the Databricks inference adapter
Databricks inference adapter was broken, closes#3486
- remove deprecated completion / chat_completion endpoints
- enable dynamic model listing w/o refresh, listing is not async
- use SecretStr instead of str for token
- backward incompatible change: for consistency with databricks docs,
env DATABRICKS_URL -> DATABRICKS_HOST and DATABRICKS_API_TOKEN ->
DATABRICKS_TOKEN
- databricks urls are custom per user/org, add special recorder handling
for databricks urls
- add integration test --setup databricks
- enable chat completions tests
- enable embeddings tests
- disable n > 1 tests
- disable embeddings base64 tests
- disable embeddings dimensions tests
note: reasoning models, e.g. gpt oss, fail because databricks has a
custom, incompatible response format
## Test Plan
ci and
```
./scripts/integration-tests.sh --stack-config server:ci-tests --setup databricks --subdirs inference --pattern openai
```
note: databricks needs to be manually added to the ci-tests distro for
replay testing
# What does this PR do?
the openai_embeddings method on OpenAIMixin was returning the provider's
model id instead of the llama stack name
## Test Plan
before -
```
$ ./scripts/integration-tests.sh --stack-config server:ci-tests --setup gpt --subdirs inference --inference-mode live --pattern test_openai_embeddings_single_string
...
FAILED tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[openai_client-emb=openai/text-embedding-3-small] - AssertionError: assert 'text-embedding-3-small' == 'openai/text-...dding-3-small'
FAILED tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[llama_stack_client-emb=openai/text-embedding-3-small] - AssertionError: assert 'text-embedding-3-small' == 'openai/text-...dding-3-small'
========================================== 2 failed, 95 deselected, 4 warnings in 3.87s ===========================================
```
after -
```
$ ./scripts/integration-tests.sh --stack-config server:ci-tests --setup gpt --subdirs inference --inference-mode live --pattern test_openai_embeddings_single_string ...
========================================== 2 passed, 95 deselected, 4 warnings in 2.12s ===========================================
```
# What does this PR do?
error:
5099094847
## Test Plan
GITHUB_ACTIONS=true BUILD_PLATFORM=linux/amd64 USE_COPY_NOT_MOUNT=true
LLAMA_STACK_DIR=. uv run --with llama-stack llama stack build --distro
starter --image-type container --image-name ehhuang/distribution-starter
succeeds
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
change ModelRegistryHelper to use ProviderModelEntry instead of
hardcoded ModelType.llm which fixed issue #3330.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[3330] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. open llama-stack server
```
uv sync --python 3.12
source .venv/bin/activate
uv run llama stack build --distro starter --image-type venv --run
```
2.Used following script to test
```
from llama_stack_client import LlamaStackClient
import os
def test_openai_embedding_type():
client = LlamaStackClient(
base_url=os.environ.get("LLAMA_STACK_ENDPOINT", "http://localhost:8321"),
provider_data={
"openai_api_key": os.environ.get("OPENAI_API_KEY", ""),
},
)
model = client.models.retrieve("openai/text-embedding-3-small")
print(model)
assert model.identifier == "openai/text-embedding-3-small"
assert model.model_type == "embedding"
test_openai_embedding_type()
```
logs:
```
python test_openai.py
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models/openai/text-embedding-3-small "HTTP/1.1 200 OK"
Model(identifier='openai/text-embedding-3-small', metadata={'embedding_dimension': 1536.0, 'context_length': 8192.0}, api_model_type='embedding', provider_id='openai', type='model', provider_resource_id='text-embedding-3-small', owner=None, source='listed_from_provider', model_type='embedding')
```
Bumps
[jest-environment-jsdom](https://github.com/jestjs/jest/tree/HEAD/packages/jest-environment-jsdom)
from 29.7.0 to 30.1.2.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/releases">jest-environment-jsdom's
releases</a>.</em></p>
<blockquote>
<h2>30.1.2</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Correct snapshot header regexp to
work with newline across OSes (<a
href="https://redirect.github.com/jestjs/jest/pull/15803">#15803</a>)</li>
</ul>
<h2>30.1.1</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
warning not handling Windows end-of-line sequences (<a
href="https://redirect.github.com/jestjs/jest/pull/15800">#15800</a>)</li>
</ul>
<h2>30.1.0</h2>
<h2>Features</h2>
<ul>
<li><code>[jest-leak-detector]</code> Configurable GC aggressiveness
regarding to V8 heap snapshot generation (<a
href="https://redirect.github.com/jestjs/jest/pull/15793/">#15793</a>)</li>
<li><code>[jest-runtime]</code> Reduce redundant ReferenceError
messages</li>
<li><code>[jest-core]</code> Include test modules that failed to load
when --onlyFailures is active</li>
</ul>
<h3>Fixes</h3>
<ul>
<li>`[jest-snapshot-utils] Fix deprecated goo.gl snapshot guide link not
getting replaced with fully canonical URL (<a
href="https://redirect.github.com/jestjs/jest/pull/15787">#15787</a>)</li>
<li><code>[jest-circus]</code> Fix <code>it.concurrent</code> not
working with <code>describe.skip</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15765">#15765</a>)</li>
<li><code>[jest-snapshot]</code> Fix mangled inline snapshot updates
when used with Prettier 3 and CRLF line endings</li>
<li><code>[jest-runtime]</code> Importing from
<code>@jest/globals</code> in more than one file no longer breaks
relative paths (<a
href="https://redirect.github.com/jestjs/jest/issues/15772">#15772</a>)</li>
</ul>
<h1>Chore</h1>
<ul>
<li><code>[expect]</code> Update docblock for <code>toContain()</code>
to display info on substring check (<a
href="https://redirect.github.com/jestjs/jest/pull/15789">#15789</a>)</li>
</ul>
<h2>30.0.2</h2>
<h2>What's Changed</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-matcher-utils]</code> Make 'deepCyclicCopyObject' safer
by setting descriptors to a null-prototype object (<a
href="https://redirect.github.com/jestjs/jest/pull/15689">#15689</a>)</li>
<li><code>[jest-util]</code> Make garbage collection protection property
writable (<a
href="https://redirect.github.com/jestjs/jest/pull/15689">#15689</a>)</li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/jestjs/jest/blob/main/CHANGELOG.md">https://github.com/jestjs/jest/blob/main/CHANGELOG.md</a></p>
<h2>Jest 30.0.1</h2>
<h2>What's Changed</h2>
<h3>Features</h3>
<ul>
<li><code>[jest-resolver]</code> Implement the
<code>defaultAsyncResolver</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15679">#15679</a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[jest-resolver]</code> Resolve builtin modules correctly (<a
href="https://redirect.github.com/jestjs/jest/pull/15683">#15683</a>)</li>
<li><code>[jest-environment-node, jest-util]</code> Avoid setting
globals cleanup protection symbol when feature is off (<a
href="https://redirect.github.com/jestjs/jest/pull/15684">#15684</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/jestjs/jest/blob/main/CHANGELOG.md">jest-environment-jsdom's
changelog</a>.</em></p>
<blockquote>
<h2>30.1.2</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Correct snapshot header regexp to
work with newline across OSes (<a
href="https://redirect.github.com/jestjs/jest/pull/15803">#15803</a>)</li>
</ul>
<h2>30.1.1</h2>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
warning not handling Windows end-of-line sequences (<a
href="https://redirect.github.com/jestjs/jest/pull/15800">#15800</a>)</li>
</ul>
<h2>30.1.0</h2>
<h2>Features</h2>
<ul>
<li><code>[jest-leak-detector]</code> Configurable GC aggressiveness
regarding to V8 heap snapshot generation (<a
href="https://redirect.github.com/jestjs/jest/pull/15793/">#15793</a>)</li>
<li><code>[jest-runtime]</code> Reduce redundant ReferenceError
messages</li>
<li><code>[jest-core]</code> Include test modules that failed to load
when --onlyFailures is active</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[jest-snapshot-utils]</code> Fix deprecated goo.gl snapshot
guide link not getting replaced with fully canonical URL (<a
href="https://redirect.github.com/jestjs/jest/pull/15787">#15787</a>)</li>
<li><code>[jest-circus]</code> Fix <code>it.concurrent</code> not
working with <code>describe.skip</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15765">#15765</a>)</li>
<li><code>[jest-snapshot]</code> Fix mangled inline snapshot updates
when used with Prettier 3 and CRLF line endings</li>
<li><code>[jest-runtime]</code> Importing from
<code>@jest/globals</code> in more than one file no longer breaks
relative paths (<a
href="https://redirect.github.com/jestjs/jest/issues/15772">#15772</a>)</li>
</ul>
<h1>Chore</h1>
<ul>
<li><code>[expect]</code> Update docblock for <code>toContain()</code>
to display info on substring check (<a
href="https://redirect.github.com/jestjs/jest/pull/15789">#15789</a>)</li>
</ul>
<h2>30.0.5</h2>
<h3>Features</h3>
<ul>
<li><code>[jest-config]</code> Allow <code>testMatch</code> to take a
string value</li>
<li><code>[jest-worker]</code> Let <code>workerIdleMemoryLimit</code>
accept 0 to always restart worker child processes</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><code>[expect]</code> Fix <code>bigint</code> error (<a
href="https://redirect.github.com/jestjs/jest/pull/15702">#15702</a>)</li>
</ul>
<h2>30.0.4</h2>
<h3>Features</h3>
<ul>
<li><code>[expect]</code> The <code>Inverse</code> type is now exported
(<a
href="https://redirect.github.com/jestjs/jest/pull/15714">#15714</a>)</li>
<li><code>[expect]</code> feat: support <code>async functions</code> in
<code>toBe</code> (<a
href="https://redirect.github.com/jestjs/jest/pull/15704">#15704</a>)</li>
</ul>
<h3>Fixes</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="ebfa31cc97"><code>ebfa31c</code></a>
v30.1.2</li>
<li><a
href="d347c0f3f8"><code>d347c0f</code></a>
v30.1.1</li>
<li><a
href="4d5f41d088"><code>4d5f41d</code></a>
v30.1.0</li>
<li><a
href="22236cf58b"><code>22236cf</code></a>
v30.0.5</li>
<li><a
href="f4296d2bc8"><code>f4296d2</code></a>
v30.0.4</li>
<li><a
href="393acbfac3"><code>393acbf</code></a>
v30.0.2</li>
<li><a
href="5ce865b406"><code>5ce865b</code></a>
v30.0.1</li>
<li><a
href="469f665c2d"><code>469f665</code></a>
v30.0.0</li>
<li><a
href="ce14203d91"><code>ce14203</code></a>
v30.0.0-rc.1</li>
<li><a
href="ac334c0cdf"><code>ac334c0</code></a>
v30.0.0-beta.8</li>
<li>Additional commits viewable in <a
href="https://github.com/jestjs/jest/commits/v30.1.2/packages/jest-environment-jsdom">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@radix-ui/react-dialog](https://github.com/radix-ui/primitives)
from 1.1.13 to 1.1.15.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Catch Errors from providers without API keys during model refresh
- Log as warning instead of exception to avoid a scary startup
Closes: #3492
Error message are now warnings instead of several tracebacks
```
INFO 2025-09-19 16:06:55,228 llama_stack.providers.utils.inference.inference_store:74 inference_store: Write queue disabled for SQLite to avoid
concurrency issues
WARNING 2025-09-19 16:06:59,362 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for anthropic: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"anthropic_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-09-19 16:06:59,364 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for gemini: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,367 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for groq: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,372 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for sambanova: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
INFO 2025-09-19 16:06:59,533 llama_stack.core.utils.config_resolution:45 core: Using file path:
```
Signed-off-by: Derek Higgins <derekh@redhat.com>
- Handle Ollama format where models are nested under
response['body']['models']
- Fall back to OpenAI format where models are directly in
response['body']
Closes: #3457
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This PR is generated with AI and reviewed by me.
Refactors the AuthorizedSqlStore class to store the access policy as an
instance variable rather than passing it as a parameter to each method
call. This simplifies the API.
# Test Plan
existing tests
# What does this PR do?
pymilvus recently made `milvus-lite` an optional dependency to their
package. If someone wants to use the inline provider we must include the
extra dependency.
For more details see: https://github.com/milvus-io/pymilvus/pull/2976
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR fixes a blocking issue in the detailed RAG tutorial where the
code fails with a 400 Bad Request error.
The root cause is that recent versions of Llama-Stack ignore the
client-generated vector_db_id and assign a new server-side ID. The
tutorial was not updated to reflect this, causing the rag_tool.insert
call to fail.
This change updates the code to capture the authoritative ID from the
.identifier attribute of the register() method's response. This ensures
the tutorial code runs successfully and reflects the current API
behavior.
## Test Plan
The fix can be verified by running the Python code snippet from the
detailed tutorial page.
Run the original code (Before this change):
Result: The script fails with a 400 Bad Request error on the
rag_tool.insert step.
Run the updated code (After this change):
Result: The script runs successfully to completion.
Co-authored-by: Adam Young <adam.young@redhat.com>
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.
To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.
## Test Plan
CI
> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml
works.
> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4
works.
# What does this PR do?
currently `RemoteProviderSpec` has an `AdapterSpec` embedded in it.
Remove `AdapterSpec`, and put its leftover fields into
`RemoteProviderSpec`.
Additionally, many of the fields were duplicated between
`InlineProviderSpec` and `RemoteProviderSpec`. Move these to
`ProviderSpec` so they are shared.
Fixup the distro codegen to use `RemoteProviderSpec` directly rather
than `remote_provider_spec` which took an AdapterSpec and returned a
full provider spec
## Test Plan
existing distro tests should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The rag-runtime tool requires files API as a dependency, but the NVIDIA
distribution was missing the files provider configuration. Thus, when
running:
```
llama stack build --distro nvidia --image-type venv
```
And then:
```
llama stack run {path_to_distribution_config} --image-type venv
```
It would raise an error:
```
RuntimeError: Failed to resolve 'tool_runtime' provider 'rag-runtime' of type 'inline::rag-runtime': required dependency 'files' is not available. Please add a 'files' provider to your configuration or check if the provider is properly configured.
```
This PR fixes the issue by adding missing files provider to NVIDIA
distribution.
## Test Plan
N/A
# What does this PR do?
this replaces the static model listing for any provider using
OpenAIMixin
currently -
- anthropic
- azure openai
- gemini
- groq
- llama-api
- nvidia
- openai
- sambanova
- tgi
- vertexai
- vllm
- not changed: together has its own impl
## Test Plan
- new unit tests
- manual for llama-api, openai, groq, gemini
```
for provider in llama-openai-compat openai groq gemini; do
uv run llama stack build --image-type venv --providers inference=remote::provider --run &
uv run --with llama-stack-client llama-stack-client models list | grep Total
```
results (17 sep 2025):
- llama-api: 4
- openai: 86
- groq: 21
- gemini: 66
closes#3467
# What does this PR do?
*Add dynamic authentication token forwarding support for vLLM provider*
This enables per-request authentication tokens for vLLM providers,
supporting use cases like RAG operations where different requests may
need different authentication tokens. The implementation follows the
same pattern as other providers like Together AI, Fireworks, and
Passthrough.
- Add LiteLLMOpenAIMixin that manages the vllm_api_token properly
Usage:
- Static: VLLM_API_TOKEN env var or config.api_token
- Dynamic: X-LlamaStack-Provider-Data header with vllm_api_token
All existing functionality is preserved while adding new dynamic
capabilities.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
curl -X POST "http://localhost:8000/v1/chat/completions" -H "Authorization: Bearer my-dynamic-token" \
-H "X-LlamaStack-Provider-Data: {\"vllm_api_token\": \"Bearer my-dynamic-token\", \"vllm_url\": \"http://dynamic-server:8000\"}" \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.1-8b", "messages": [{"role": "user", "content": "Hello!"}]}'
```
---------
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
Updates the qdrant provider's convert_id function to use a
FIPS-validated cryptographic hashing function, so that llama-stack is
considered to be `Designed for FIPS`.
The standard library `uuid.uuid5()` function uses SHA-1 under the hood,
which is not FIPS-validated. This commit uses an approach similar to the
one merged in #3423.
Closes#3476.
## Test Plan
Unit tests from scripts/unit-tests.sh were ran to verify that the tests
pass.
A small test script can display the data flow:
```python
import hashlib
import uuid
# Input
_id = "chunk_abc123"
print(_id)
# Step 1: Format and encode
hash_input = f"qdrant_id:{_id}".encode()
print(hash_input)
# Result: b'qdrant_id:chunk_abc123'
# Step 2: SHA-256 hash
sha256_hash = hashlib.sha256(hash_input).hexdigest()
print(sha256_hash)
# Result: "184893a6eafeaac487cb9166351e8625b994d50f3456d8bc6cea32a014a27151"
# Step 3: Create UUID from first 32 chars
uuid_string = str(uuid.UUID(sha256_hash[:32]))
print(uuid_string)
# sha256_hash[:32] = "184893a6eafeaac487cb9166351e8625"
# Final result: "184893a6-eafe-aac4-87cb-9166351e8625"
```
Signed-off-by: Doug Edgar <dedgar@redhat.com>
# What does this PR do?
When registering a dataset for NVIDIA, the DatasetsRoutingTable expects
`nvidia` to be passed via the `provider_id`
[here](https://github.com/llamastack/llama-stack/blob/main/llama_stack/core/routing_tables/datasets.py#L61).
This PR fixes a notebook to correctly use `provider_id`.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3308
## Test Plan
Manually execute the notebook steps to verify the dataset is registered.
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Fixes this warning in llama stack build:
```bash
WARNING 2025-09-15 15:29:02,197 llama_stack.core.distribution:149 core: Failed to import module prompts: No module named
'llama_stack.providers.registry.prompts'"
```
## Test Plan
Test added
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add default value for PR_HEAD_REPO to prevent 'unbound variable' error
when no PR exists for a branch.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Modified the code in registry.py.
The key changes are:
1. Removed the `return False` statement
2. Added a warning log message that includes the object type,
identifier, and provider_id for better debugging.
3. The method now continues with the registration process instead of
early returning.
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
# What does this PR do?
Pinning to latest pydantic version 2.11.9 as sometime we are picking
older version and failing to start container in github actions :
1775026312
Closes https://github.com/llamastack/llama-stack/issues/3461
## Test Plan
Tested locally with the following commands to start a container
Build container
`llama stack build --distro starter --image-type container`
start container `docker run -d -p 8321:8321 --name llama-stack-test
distribution-starter:0.2.21`
check health http://localhost:8321/v1/health
Couldnt repro with older version(`2.8.2`), but `2.11.9` pydantic is able
to start the container
https://pypi.org/project/pydantic/#history , 2.11.9 is the latest
version
# What does this PR do?
this document outlines different API stability levels, how to enforce
them, and next steps
## Next Steps
Following the adoption of this document, all existing APIs should follow
the enforcement protocol.
relates to #3237
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
adds dynamic model support to TGI
add new overwrite_completion_id feature to OpenAIMixin to deal with TGI
always returning id=""
## Test Plan
tgi: `docker run --gpus all --shm-size 1g -p 8080:80 -v /data:/data
ghcr.io/huggingface/text-generation-inference --model-id
Qwen/Qwen3-0.6B`
stack: `TGI_URL=http://localhost:8080 uv run llama stack build
--image-type venv --distro ci-tests --run`
test: `./scripts/integration-tests.sh --stack-config
http://localhost:8321 --setup tgi --subdirs inference --pattern openai`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR provides functionality for users to unregister ScoringFn and
Benchmark resources for `scoring` and `eval` APIs.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3051
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Updated integration and unit tests via CI workflow
# What does this PR do?
the @required_args decorator in openai-python is masking the async
nature of the {AsyncCompletions,chat.AsyncCompletions}.create method.
see https://github.com/openai/openai-python/issues/996
this means two things -
0. we cannot use iscoroutine in the recorder to detect async vs non
1. our mocks are inappropriately introducing identifiable async
for (0), we update the iscoroutine check w/ detection of /v1/models,
which is the only non-async function we mock & record.
for (1), we could leave everything as is and assume (0) will catch
errors. to be defensive, we update the unit tests to mock below create
methods, allowing the true openai-python create() methods to be tested.
Bumps [@radix-ui/react-select](https://github.com/radix-ui/primitives)
from 2.2.5 to 2.2.6.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
the recorder mocks the openai-python interface. the openai-python
interface allows NOT_GIVEN as an input option. this change properly
handles NOT_GIVEN.
## Test Plan
ci (coverage for chat, completions, embeddings)
# What does this PR do?
Migrates MD5 and SHA-1 hash algorithms to SHA-256.
In particular, replaces:
- MD5 in chunk ID generation.
- MD5 in file verification.
- SHA-1 in model identifier digests.
And updates all related test expectations.
Original discussion:
https://github.com/llamastack/llama-stack/discussions/3413
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3424.
## Test Plan
Unit tests from scripts/unit-tests.sh were updated to match the new hash
output, and ran to verify the tests pass.
Signed-off-by: Doug Edgar <dedgar@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
fix: Improve pre-commit workflow error handling and feedback
- Add explicit step to check pre-commit results and provide clear error
messages
- Improve verification steps with better error messages and file
listings
- Use GitHub Actions annotations (::error:: and :⚠️:) for better
visibility
- Maintain continue-on-error for pre-commit step but add proper failure
handling
This addresses the issue where pre-commit failures were silent but still
caused workflow failures later, making it difficult to understand what
needed to be fixed.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
only run conformance tests when the spec is changed.
Also, cache oasdiff such that it is not installed every time the test is
run
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The notebook was
reverted(https://github.com/llamastack/llama-stack/pull/3259) as it had
some local paths, I missed correcting. Trying with corrections now
## Test Plan
Ran the Jupyter notebook
# What does this PR do?
update the async detection test for vllm
- remove a network access from unit tests
- remove direct logging use
the idea behind the test is to mock inference w/ a sleep, initiate
concurrent inference calls, verify the total execution time is close to
the sleep time. in a non-async env the total time would be closer to
sleep * num concurrent calls.
## Test Plan
ci
# What does this PR do?
update vLLM inference provider to use OpenAIMixin for openai-compat
functions
inference recordings from Qwen3-0.6B and vLLM 0.8.3 -
```
docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B --enable-auto-tool-choice --tool-call-parser hermes
```
## Test Plan
```
./scripts/integration-tests.sh --stack-config server:ci-tests --setup vllm --subdirs inference
```
# What does this PR do?
- Updating documentation on migration from RAG Tool to Vector Stores and
Files APIs
- Adding exception handling for Vector Stores in RAG Tool
- Add more tests on migration from RAG Tool to Vector Stores
- Migrate off of inference_api for context_retriever for RAG
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Integration and unit tests added
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
some providers do not produce spec compliant outputs. when this happens
the replay infra will fail to construct the proper types and will return
a dict to the client. the client likely does not expect a dict.
this was discovered with tgi, which returns finish_reason="" when valid
values are "stop", "length" or "content_filter"
## Test Plan
ci
Fixes#3370
AWS switched to requiring region-prefixed inference profile IDs instead
of foundation model IDs for on-demand throughput. This was causing
ValidationException errors.
Added auto-detection based on boto3 client region to convert model IDs
like meta.llama3-1-70b-instruct-v1:0 to
us.meta.llama3-1-70b-instruct-v1:0 depending on the detected region.
Also handles edge cases like ARNs, case insensitive regions, and None
regions.
Tested with this request.
```json
{
"model_id": "meta.llama3-1-8b-instruct-v1:0",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "tell me a riddle"
}
],
"sampling_params": {
"strategy": {
"type": "top_p",
"temperature": 0.7,
"top_p": 0.9
},
"max_tokens": 512
}
}
```
<img width="1488" height="878" alt="image"
src="https://github.com/user-attachments/assets/0d61beec-3869-4a31-8f37-9f554c280b88"
/>
# What does this PR do?
The openai package is already a dependency of the llama-stack project
itself, so let's the project dictate which openai version we need and
avoid potential breakage with unsatisfiable dependency resolution.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Duplicate chat completion IDs can be generated during tests especially
if they are replaying recorded responses across different tests. No need
to warn or error under those circumstances. In the wild, this is not
likely to happen at all (no evidence) so we aren't really hiding any
problem.
Bumps [openai](https://github.com/openai/openai-python) from 1.102.0 to
1.106.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/releases">openai's
releases</a>.</em></p>
<blockquote>
<h2>v1.106.1</h2>
<h2>1.106.1 (2025-09-04)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.106.0...v1.106.1">v1.106.0...v1.106.1</a></p>
<h3>Chores</h3>
<ul>
<li><strong>internal:</strong> move mypy configurations to
<code>pyproject.toml</code> file (<a
href="ca413a2774">ca413a2</a>)</li>
</ul>
<h2>v1.106.0</h2>
<h2>1.106.0 (2025-09-04)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.105.0...v1.106.0">v1.105.0...v1.106.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>client:</strong> support callable api_key (<a
href="https://redirect.github.com/openai/openai-python/issues/2588">#2588</a>)
(<a
href="e1bad015b8">e1bad01</a>)</li>
<li>improve future compat with pydantic v3 (<a
href="6645d9317a">6645d93</a>)</li>
</ul>
<h2>v1.105.0</h2>
<h2>1.105.0 (2025-09-03)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.2...v1.105.0">v1.104.2...v1.105.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> Add gpt-realtime models (<a
href="8502041480">8502041</a>)</li>
</ul>
<h2>v1.104.2</h2>
<h2>1.104.2 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.1...v1.104.2">v1.104.1...v1.104.2</a></p>
<h3>Bug Fixes</h3>
<ul>
<li><strong>types:</strong> add aliases back for web search tool types
(<a
href="2521cd8445">2521cd8</a>)</li>
</ul>
<h2>v1.104.1</h2>
<h2>1.104.1 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.0...v1.104.1">v1.104.0...v1.104.1</a></p>
<h3>Chores</h3>
<ul>
<li><strong>api:</strong> manual updates for ResponseInputAudio (<a
href="0db5061966">0db5061</a>)</li>
</ul>
<h2>v1.104.0</h2>
<h2>1.104.0 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.103.0...v1.104.0">v1.103.0...v1.104.0</a></p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/blob/main/CHANGELOG.md">openai's
changelog</a>.</em></p>
<blockquote>
<h2>1.106.1 (2025-09-04)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.106.0...v1.106.1">v1.106.0...v1.106.1</a></p>
<h3>Chores</h3>
<ul>
<li><strong>internal:</strong> move mypy configurations to
<code>pyproject.toml</code> file (<a
href="ca413a2774">ca413a2</a>)</li>
</ul>
<h2>1.106.0 (2025-09-04)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.105.0...v1.106.0">v1.105.0...v1.106.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>client:</strong> support callable api_key (<a
href="https://redirect.github.com/openai/openai-python/issues/2588">#2588</a>)
(<a
href="e1bad015b8">e1bad01</a>)</li>
<li>improve future compat with pydantic v3 (<a
href="6645d9317a">6645d93</a>)</li>
</ul>
<h2>1.105.0 (2025-09-03)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.2...v1.105.0">v1.104.2...v1.105.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> Add gpt-realtime models (<a
href="8502041480">8502041</a>)</li>
</ul>
<h2>1.104.2 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.1...v1.104.2">v1.104.1...v1.104.2</a></p>
<h3>Bug Fixes</h3>
<ul>
<li><strong>types:</strong> add aliases back for web search tool types
(<a
href="2521cd8445">2521cd8</a>)</li>
</ul>
<h2>1.104.1 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.104.0...v1.104.1">v1.104.0...v1.104.1</a></p>
<h3>Chores</h3>
<ul>
<li><strong>api:</strong> manual updates for ResponseInputAudio (<a
href="0db5061966">0db5061</a>)</li>
</ul>
<h2>1.104.0 (2025-09-02)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.103.0...v1.104.0">v1.103.0...v1.104.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>types:</strong> replace List[str] with SequenceNotStr in
params (<a
href="bc00bda880">bc00bda</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="2adf111129"><code>2adf111</code></a>
release: 1.106.1</li>
<li><a
href="c4f9d0b997"><code>c4f9d0b</code></a>
chore(internal): move mypy configurations to <code>pyproject.toml</code>
file</li>
<li><a
href="2de8d7cde5"><code>2de8d7c</code></a>
release: 1.106.0</li>
<li><a
href="2cf4ed5072"><code>2cf4ed5</code></a>
feat: improve future compat with pydantic v3</li>
<li><a
href="25d16be18b"><code>25d16be</code></a>
feat(client): support callable api_key (<a
href="https://redirect.github.com/openai/openai-python/issues/2588">#2588</a>)</li>
<li><a
href="8672413735"><code>8672413</code></a>
release: 1.105.0</li>
<li><a
href="2c60d78b37"><code>2c60d78</code></a>
feat(api): Add gpt-realtime models</li>
<li><a
href="a52463c932"><code>a52463c</code></a>
release: 1.104.2</li>
<li><a
href="5a6931dafd"><code>5a6931d</code></a>
fix(types): add aliases back for web search tool types</li>
<li><a
href="fb152d967e"><code>fb152d9</code></a>
release: 1.104.1</li>
<li>Additional commits viewable in <a
href="https://github.com/openai/openai-python/compare/v1.102.0...v1.106.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [locust](https://github.com/locustio/locust) from 2.39.1 to
2.40.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/releases">locust's
releases</a>.</em></p>
<blockquote>
<h2>2.40.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Pytest plugin: Delay imports to avoid monkey patching until someone
uses the fixtures by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3204">locustio/locust#3204</a></li>
<li>Move pytest plugin to its own directory, to prevent accidental
import by <a href="https://github.com/cyberw"><code>@cyberw</code></a>
in <a
href="https://redirect.github.com/locustio/locust/pull/3205">locustio/locust#3205</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/locustio/locust/compare/2.40.0...2.40.1">https://github.com/locustio/locust/compare/2.40.0...2.40.1</a></p>
<h2>2.40.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Refactor FastHttpSession to be more like HttpSession by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3198">locustio/locust#3198</a></li>
<li>Update Dockerfile base to Python 3.13 by <a
href="https://github.com/adaamz"><code>@adaamz</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3193">locustio/locust#3193</a></li>
<li>Avoid exception in HttpUser if requests has lost track of the
request it made by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3201">locustio/locust#3201</a></li>
<li>Support pytests as locustfiles by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3200">locustio/locust#3200</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/adaamz"><code>@adaamz</code></a> made
their first contribution in <a
href="https://redirect.github.com/locustio/locust/pull/3193">locustio/locust#3193</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/locustio/locust/compare/2.39.1...2.40.0">https://github.com/locustio/locust/compare/2.39.1...2.40.0</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/blob/master/CHANGELOG.md">locust's
changelog</a>.</em></p>
<blockquote>
<h1>Detailed changelog</h1>
<p>The most important changes can also be found in <a
href="https://docs.locust.io/en/latest/changelog.html">the
documentation</a>.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="5df19da06a"><code>5df19da</code></a>
Merge pull request <a
href="https://redirect.github.com/locustio/locust/issues/3205">#3205</a>
from locustio/move-pytest-plugin-to-own-directory</li>
<li><a
href="d41141bedd"><code>d41141b</code></a>
Move pytest plugin to its own directory, to prevent accidental import of
locu...</li>
<li><a
href="6422848afd"><code>6422848</code></a>
mention that only one locustfile can be distributed</li>
<li><a
href="aa3da739fe"><code>aa3da73</code></a>
Merge pull request <a
href="https://redirect.github.com/locustio/locust/issues/3204">#3204</a>
from locustio/delay-imports-in-pytest-plugin-to-avoi...</li>
<li><a
href="12050dedfd"><code>12050de</code></a>
Pytest plugin: Delay imports to avoid monkey patching until someone
actually ...</li>
<li><a
href="488d1f8491"><code>488d1f8</code></a>
docs</li>
<li><a
href="439b7ab91b"><code>439b7ab</code></a>
docs fix</li>
<li><a
href="fcd76a8ac3"><code>fcd76a8</code></a>
docs: rephrase</li>
<li><a
href="70c7e9b2d8"><code>70c7e9b</code></a>
docs: move pytest further up</li>
<li><a
href="06dbf98013"><code>06dbf98</code></a>
docs: fix link</li>
<li>Additional commits viewable in <a
href="https://github.com/locustio/locust/compare/2.39.1...2.40.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.4.1 to
8.4.2.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest/releases">pytest's
releases</a>.</em></p>
<blockquote>
<h2>8.4.2</h2>
<h1>pytest 8.4.2 (2025-09-03)</h1>
<h2>Bug fixes</h2>
<ul>
<li>
<p><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13478">#13478</a>:
Fixed a crash when using
<code>console_output_style</code>{.interpreted-text
role="confval"} with <code>times</code> and a module is
skipped.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13530">#13530</a>:
Fixed a crash when using <code>pytest.approx</code>{.interpreted-text
role="func"} and
<code>decimal.Decimal</code>{.interpreted-text role="class"}
instances with the <code>decimal.FloatOperation</code>{.interpreted-text
role="class"} trap set.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13549">#13549</a>:
No longer evaluate type annotations in Python <code>3.14</code> when
inspecting function signatures.</p>
<p>This prevents crashes during module collection when modules do not
explicitly use <code>from __future__ import annotations</code> and
import types for annotations within a <code>if TYPE_CHECKING:</code>
block.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13559">#13559</a>:
Added missing [int]{.title-ref} and [float]{.title-ref} variants to the
[Literal]{.title-ref} type annotation of the [type]{.title-ref}
parameter in <code>pytest.Parser.addini</code>{.interpreted-text
role="meth"}.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13563">#13563</a>:
<code>pytest.approx</code>{.interpreted-text role="func"} now
only imports <code>numpy</code> if NumPy is already in
<code>sys.modules</code>. This fixes unconditional import behavior
introduced in [8.4.0]{.title-ref}.</p>
</li>
</ul>
<h2>Improved documentation</h2>
<ul>
<li><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13577">#13577</a>:
Clarify that <code>pytest_generate_tests</code> is discovered in test
modules/classes; other hooks must be in <code>conftest.py</code> or
plugins.</li>
</ul>
<h2>Contributor-facing changes</h2>
<ul>
<li><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13480">#13480</a>:
Self-testing: fixed a few test failures when run with
<code>-Wdefault</code> or a similar override.</li>
<li><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13547">#13547</a>:
Self-testing: corrected expected message for
<code>test_doctest_unexpected_exception</code> in Python
<code>3.14</code>.</li>
<li><a
href="https://redirect.github.com/pytest-dev/pytest/issues/13684">#13684</a>:
Make pytest's own testsuite insensitive to the presence of the
<code>CI</code> environment variable -- by
<code>ogrisel</code>{.interpreted-text role="user"}.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="bfae4224fd"><code>bfae422</code></a>
Prepare release version 8.4.2</li>
<li><a
href="89905381a1"><code>8990538</code></a>
Fix passenv CI in tox ini and make tests insensitive to the presence of
the C...</li>
<li><a
href="ca676bfe00"><code>ca676bf</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13687">#13687</a>
from pytest-dev/patchback/backports/8.4.x/e63f6e51c...</li>
<li><a
href="975a60a63c"><code>975a60a</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13686">#13686</a>
from pytest-dev/patchback/backports/8.4.x/12bde8af6...</li>
<li><a
href="7723ce84b8"><code>7723ce8</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13683">#13683</a>
from even-even/fix_Exeption_to_Exception_in_errorMe...</li>
<li><a
href="b7f05680d1"><code>b7f0568</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13685">#13685</a>
from CoretexShadow/fix/docs-pytest-generate-tests</li>
<li><a
href="2c94c4a694"><code>2c94c4a</code></a>
add missing colon (<a
href="https://redirect.github.com/pytest-dev/pytest/issues/13640">#13640</a>)
(<a
href="https://redirect.github.com/pytest-dev/pytest/issues/13641">#13641</a>)</li>
<li><a
href="c3d7684bc0"><code>c3d7684</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13606">#13606</a>
from pytest-dev/patchback/backports/8.4.x/5f9938563...</li>
<li><a
href="dc6e3be2dd"><code>dc6e3be</code></a>
Merge pull request <a
href="https://redirect.github.com/pytest-dev/pytest/issues/13605">#13605</a>
from The-Compiler/training-update-2025-07</li>
<li><a
href="f87289c36c"><code>f87289c</code></a>
Fix crash with <code>times</code> output style and skipped module (<a
href="https://redirect.github.com/pytest-dev/pytest/issues/13573">#13573</a>)
(<a
href="https://redirect.github.com/pytest-dev/pytest/issues/13579">#13579</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest/compare/8.4.1...8.4.2">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
* Adds a horizontal nav bar for easy access to the API reference and the
Llama Stack Github repo
<img width="2696" height="520" alt="image"
src="https://github.com/user-attachments/assets/82daffe1-c206-4e20-b95b-1e090011eecc"
/>
## Test Plan
* Built the docs and ran the local HTML server to verify changes
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
- Use BackgroundLogger when logging metric events.
- Reuse event loop in BackgroundLogger
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
### RPS from 57 -> 62
# What does this PR do?
Fix fireworks chat completion broken due to telemetry expecting
response.usage
Closes https://github.com/llamastack/llama-stack/issues/3391
## Test Plan
1. `uv run --with llama-stack llama stack build --distro starter
--image-type venv --run`
Try
```
curl -X POST http://0.0.0.0:8321/v1/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
```
{"id":"chatcmpl-ee922a08-0df0-4974-b0d3-b322113e8bc0","choices":[{"message":{"role":"assistant","content":"Hello! How can I assist you today?","name":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","created":1757456375,"model":"fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"}%
```
Without fix fails as mentioned in
https://github.com/llamastack/llama-stack/issues/3391
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
update VertexAI inference provider to use openai-python for
openai-compat functions
## Test Plan
```
$ VERTEX_AI_PROJECT=... uv run llama stack build --image-type venv --providers inference=remote::vertexai --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model vertexai/vertex_ai/gemini-2.5-flash tests/integration/inference/test_openai_completion.py
...
```
i don't have an account to test this. `get_api_key` may also need to be
updated per
https://cloud.google.com/vertex-ai/generative-ai/docs/start/openai
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Fix pre-commit issues: non executable shebang file, @pytest.mark.asyncio
decorator
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The test_query_adds_vector_db_id_to_chunk_metadata test was failing
because MemoryToolRuntimeImpl.__init__() now requires a files_api
parameter.
Fixes failing unit tests for Python 3.12 and 3.13.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
When running RAG in a multi vector DB setting, it can be difficult to
trace where retrieved chunks originate from. This PR adds the
`vector_db_id` into each chunk’s metadata, making it easier to
understand which database a given chunk came from. This is helpful for
debugging and for analyzing retrieval behavior of multiple DBs.
Relevant code:
```python
for vector_db_id, result in zip(vector_db_ids, results):
for chunk, score in zip(result.chunks, result.scores):
if not hasattr(chunk, "metadata") or chunk.metadata is None:
chunk.metadata = {}
chunk.metadata["vector_db_id"] = vector_db_id
chunks.append(chunk)
scores.append(score)
```
## Test Plan
* Ran Llama Stack in debug mode.
* Verified that `vector_db_id` was added to each chunk’s metadata.
* Confirmed that the metadata was printed in the console when using the
RAG tool.
---------
Co-authored-by: are-ces <cpompeia@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
enables completions storage when using `llama stack build --providers` -
- GET /v1/chat/completions
- GET /v1/chat/completions/{id}
todo: llama stack build and distro codegen should use the same code
paths
## Test Plan
ci
This PR refactors the integration test system to use global "setups"
which provides better separation of concerns:
**suites = what to test, setups = how to configure.**
NOTE: if you naming suggestions, please provide feedback
Changes:
- New `tests/integration/setups.py` with global, reusable configurations
(ollama, vllm, gpt, claude)
- Modified `scripts/integration-tests.sh` options to match with the
underlying pytest options
- Updated documentation to reflect the new global setup system
The main benefit is that setups can be reused across multiple suites
(e.g., use "gpt" with any suite) even though sometimes they could
specifically tailored for a suite (vision <> ollama-vision). It is now
easier to add new configurations without modifying existing suites.
Usage examples:
- `pytest tests/integration --suite=responses --setup=gpt`
- `pytest tests/integration --suite=vision` # auto-selects
"ollama-vision" setup
- `pytest tests/integration --suite=base --setup=vllm`
# What does this PR do?
This PR adds support for OpenAI Prompts API.
Note, OpenAI does not explicitly expose the Prompts API but instead
makes it available in the Responses API and in the [Prompts
Dashboard](https://platform.openai.com/docs/guides/prompting#create-a-prompt).
I have added the following APIs:
- CREATE
- GET
- LIST
- UPDATE
- Set Default Version
The Set Default Version API is made available only in the Prompts
Dashboard and configures which prompt version is returned in the GET
(the latest version is the default).
Overall, the expected functionality in Responses will look like this:
```python
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
prompt={
"id": "pmpt_68b0c29740048196bd3a6e6ac3c4d0e20ed9a13f0d15bf5e",
"version": "2",
"variables": {
"city": "San Francisco",
"age": 30,
}
}
)
```
### Resolves https://github.com/llamastack/llama-stack/issues/3276
## Test Plan
Unit tests added. Integration tests can be added after client
generation.
## Next Steps
1. Update Responses API to support Prompt API
2. I'll enhance the UI to implement the Prompt Dashboard.
3. Add cache for lower latency
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add Kubernetes authentication provider support
- Add KubernetesAuthProvider class for token validation using Kubernetes
SelfSubjectReview API
- Add KubernetesAuthProviderConfig with configurable API server URL, TLS
settings, and claims mapping
- Implement authentication via POST requests to
/apis/authentication.k8s.io/v1/selfsubjectreviews endpoint
- Add support for parsing Kubernetes SelfSubjectReview response format
to extract user information
- Add KUBERNETES provider type to AuthProviderType enum
- Update create_auth_provider factory function to handle 'kubernetes'
provider type
- Add comprehensive unit tests for KubernetesAuthProvider functionality
- Add documentation with configuration examples and usage instructions
The provider validates tokens by sending SelfSubjectReview requests to
the Kubernetes API server and extracts user information from the
userInfo structure in the response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
What This Verifies:
Authentication header validation
Token validation with Kubernetes SelfSubjectReview and kubernetes server
API endpoint
Error handling for invalid tokens and HTTP errors
Request payload structure and headers
```
python -m pytest tests/unit/server/test_auth.py -k "kubernetes" -v
```
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
Bumps
[@radix-ui/react-dropdown-menu](https://github.com/radix-ui/primitives)
from 2.1.14 to 2.1.16.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[react-dom](https://github.com/facebook/react/tree/HEAD/packages/react-dom)
and
[@types/react-dom](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-dom).
These dependencies needed to be updated together.
Updates `react-dom` from 19.1.0 to 19.1.1
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/facebook/react/releases">react-dom's
releases</a>.</em></p>
<blockquote>
<h2>19.1.1 (July 28, 2025)</h2>
<h3>React</h3>
<ul>
<li>Fixed Owner Stacks to work with ES2015 function.name semantics (<a
href="https://redirect.github.com/facebook/react/pull/33680">#33680</a>
by <a href="https://github.com/hoxyq"><code>@hoxyq</code></a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/facebook/react/blob/main/CHANGELOG.md">react-dom's
changelog</a>.</em></p>
<blockquote>
<h2>19.1.1 (July 28, 2025)</h2>
<h3>React</h3>
<ul>
<li>Fixed Owner Stacks to work with ES2015 function.name semantics (<a
href="https://redirect.github.com/facebook/react/pull/33680">#33680</a>
by <a href="https://github.com/hoxyq"><code>@hoxyq</code></a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="87e33ca2b7"><code>87e33ca</code></a>
Set release versions to 19.1.1</li>
<li><a
href="b793948e15"><code>b793948</code></a>
Bump next prerelease version numbers (<a
href="https://github.com/facebook/react/tree/HEAD/packages/react-dom/issues/32782">#32782</a>)</li>
<li>See full diff in <a
href="https://github.com/facebook/react/commits/v19.1.1/packages/react-dom">compare
view</a></li>
</ul>
</details>
<br />
Updates `@types/react-dom` from 19.1.5 to 19.1.9
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-dom">compare
view</a></li>
</ul>
</details>
<br />
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
update the Anthropic inference provider to use openai-python for the
openai-compat endpoints
## Test Plan
ci
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
update Groq inference provider to use OpenAIMixin for openai-compat
endpoints
changes on api.groq.com -
- json_schema is now supported for specific models, see
https://console.groq.com/docs/structured-outputs#supported-models
- response_format with streaming is now supported for models that
support response_format
- groq no longer returns a 400 error if tools are provided and
tool_choice is not "required"
## Test Plan
```
$ GROQ_API_KEY=... uv run llama stack build --image-type venv --providers inference=remote::groq --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model groq/llama-3.3-70b-versatile tests/integration/inference/test_openai_completion.py -k 'not store'
...
SKIPPED [3] tests/integration/inference/test_openai_completion.py:44: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support OpenAI completions.
SKIPPED [3] tests/integration/inference/test_openai_completion.py:94: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support vllm extra_body parameters.
SKIPPED [4] tests/integration/inference/test_openai_completion.py:73: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support n param.
SKIPPED [1] tests/integration/inference/test_openai_completion.py💯 Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support chat completion calls with base64 encoded files.
======================= 8 passed, 11 skipped, 8 deselected, 2 warnings in 5.13s ========================
```
---------
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
this test runs on each PR and uses a new conformance workflow to compare
the base (main) branch openapi spec to the one on this PR if one of our
"stable" APIs change, the test will fail.
this workflow uses `oasdiff` to identify breaking changes for paths we
want to ensure comptability for.
specifically this is using `oasdiff breaking` with `--match-path` which
only checks breaking changes for the specified paths.
As a follow up to this, we can add an optional way to make it so that it
is OK to make these change if properly documented or in a changelog or
something. or by using a label on the PR to override the failing test.
related to #3237
## Test Plan
conformance test should pass given there are no changes
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
update SambaNova inference provider to use OpenAIMixin for openai-compat
endpoints
## Test Plan
```
$ SAMBANOVA_API_KEY=... uv run llama stack build --image-type venv --providers inference=remote::sambanova --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model sambanova/Meta-Llama-3.3-70B-Instruct tests/integration/inference -k 'not store'
...
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=sambanova/Meta-Llama-3.3-70B-Instruct-inference:chat_completion:tool_calling_tools_absent-True] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=sambanova/Meta-Llama-3.3-70B-Instruct-inference:chat_completion:tool_calling_tools_absent-False] - llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An une...
=========== 2 failed, 16 passed, 68 skipped, 8 deselected, 3 xfailed, 13 warnings in 15.85s ============
```
the two failures also exist before this change. they are part of the
deprecated inference.chat_completion tests that flow through litellm.
they can be resolved later.
# What does this PR do?
update the Gemini inference provider to use openai-python for the
openai-compat endpoints
partially addresses #3349, does not address /inference/completion or
/inference/chat-completion
## Test Plan
ci
Our integration tests need to be 'grouped' because each group often
needs a specific set of models it works with. We separated vision tests
due to this, and we have a separate set of tests which test "Responses"
API.
This PR makes this system a bit more official so it is very easy to
target these groups and apply all testing infrastructure towards all the
groups (for example, record-replay) uniformly.
There are three suites declared:
- base
- vision
- responses
Note that our CI currently runs the "base" and "vision" suites.
You can use the `--suite` option when running pytest (or any of the
testing scripts or workflows.) For example:
```
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/ --stack-config starter --suite vision
```
# What does this PR do?
This change migrates the VectorDB id generation to Vector Stores.
This is a breaking change for **_some users_** that may have application
code using the `vector_db_id` parameter in the request of the VectorDB
protocol instead of the `VectorDB.identifier` in the response.
By default we will now create a Vector Store every time we register a
VectorDB. The caveat with this approach is that this maps the
`vector_db_id` → `vector_store.name`. This is a reasonable tradeoff to
transition users towards OpenAI Vector Stores.
As an added benefit, registering VectorDBs will result in them appearing
in the VectorStores admin UI.
### Why?
This PR makes the `POST` API call to `/v1/vector-dbs` swap the
`vector_db_id` parameter in the **request body** into the VectorStore's
name field and sets the `vector_db_id` to the generated vector store id
(e.g., `vs_038247dd-4bbb-4dbb-a6be-d5ecfd46cfdb`).
That means that users would have to do something like follows in their
application code:
```python
res = client.vector_dbs.register(
vector_db_id='my-vector-db-id',
embedding_model='ollama/all-minilm:l6-v2',
embedding_dimension=384,
)
vector_db_id = res.identifier
```
And then the rest of their code would behave, including `VectorIO`'s
insert protocol using `vector_db_id` in the request.
An alternative implementation would be to just delete the `vector_db_id`
parameter in `VectorDB` but the end result would still require users
having to write `vector_db_id = res.identifier` since
`VectorStores.create()` generates the ID for you.
So this approach felt the easiest way to migrate users towards
VectorStores (subsequent PRs will be added to trigger `files.create()`
and `vector_stores.files.create()`).
## Test Plan
Unit tests and integration tests have been added.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Improved bedrock provider config to read from environment variables like
AWS_ACCESS_KEY_ID. Updated all
fields to use default_factory with lambda patterns like the nvidia
provider does.
Now the environment variables work as documented.
Closes#3305
## Test Plan
Ran the new bedrock config tests:
```bash
python -m pytest tests/unit/providers/inference/bedrock/test_config.py
-v
Verified existing provider tests still work:
python -m pytest tests/unit/providers/test_configs.py -v
# What does this PR do?
The inference store writes were moved to asyncio.create_task and not
await anymore
## Test Plan
❯ OLLAMA_URL=http://localhost:11434 LLAMA_STACK_CONFIG=server:starter uv
run --with pytest-repeat pytest tests/integration/inference
--text-model="ollama/llama3.2:3b-instruct-fp16" -vvs -k
"test_inference_store_tool_calls and 3b-instruct-fp16-True" --count=10
Uninstalled 2 packages in 102ms
Installed 2 packages in 138ms
INFO 2025-09-04 14:10:17,775 tests.integration.conftest:66 tests:
Setting DISABLE_CODE_SANDBOX=1 for macOS
==========================================================================================================
test session starts
===========================================================================================================
platform darwin -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0 --
/Users/erichuang/.cache/uv/builds-v0/.tmpSGMlgt/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.3', 'Platform':
'macOS-15.6.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.1',
'pluggy': '1.6.0'}, 'Plugins': {'repeat': '0.9.4', 'anyio': '4.9.0',
'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report':
'1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1',
'nbval': '0.11.0'}}
rootdir: /Users/erichuang/projects/llama-stack-git
configfile: pyproject.toml
plugins: repeat-0.9.4, anyio-4.9.0, html-4.1.1, socket-0.7.0,
asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1,
cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None,
asyncio_default_test_loop_scope=function
collected 970 items / 950 deselected / 20 selected
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-1-10]
instantiating llama_stack_client
Starting llama stack server with config 'starter' on port 8321...
Waiting for server at http://localhost:8321... (0.0s elapsed)
Waiting for server at http://localhost:8321... (0.5s elapsed)
Waiting for server at http://localhost:8321... (5.1s elapsed)
Waiting for server at http://localhost:8321... (5.6s elapsed)
Waiting for server at http://localhost:8321... (10.1s elapsed)
Waiting for server at http://localhost:8321... (10.6s elapsed)
Waiting for server at http://localhost:8321... (15.2s elapsed)
Waiting for server at http://localhost:8321... (15.7s elapsed)
Server is ready at http://localhost:8321
llama_stack_client instantiated in 20.583s
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-2-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-3-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-4-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-5-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-6-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-7-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-8-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-9-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=ollama/llama3.2:3b-instruct-fp16-True-10-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-1-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-2-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-3-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-4-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-5-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-6-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-7-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-8-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-9-10]
PASSED
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=ollama/llama3.2:3b-instruct-fp16-True-10-10]
PASSEDTerminating llama stack server process...
Terminating process 53307 and its group...
Server process and children terminated gracefully
What does this PR do?
Fixes error handling when MCP server connections fail. Instead of
returning generic 500 errors, now provides
descriptive error messages with proper HTTP status codes.
Closes#3107
Test Plan
Before fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"detail": "Internal server error: An unexpected error
occurred."} (500)
After fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"error": {"detail": "Failed to connect to MCP server at
http://localhost:9999/sse: Connection
refused"}} (502)
Tests:
- Added unit test for ConnectionError → 502 translation
- Manually tested with unreachable MCP servers (connection refused)
# What does this PR do?
Noticed the test
https://github.com/llamastack/llama-stack-ops/actions/workflows/test-maybe-cut.yaml
are still failing randomly.
Earlier fixed this with 0.18.0 of fireworks here
https://github.com/llamastack/llama-stack/pull/3267, the local testing
may have inadvertently picked a lower version with `<=` which I assumed
picks latest version.
Now tested with `==` to find the version where it broke and pinning to
version(`<=`) where it was passing.
## Test Plan
Tested locally with the following commands to start a container
Build container
`llama stack build --distro starter --image-type container`
start container `docker run -d -p 8321:8321 --name llama-stack-test
distribution-starter:0.2.20`
check health `http://localhost:8321/v1/health`
Above steps fails without the fix
Tested with `==` to ensure the same version is picked in local testing
instead of anything lower.
Following here for the fix from `fireworks-ai`
1410674695https://github.com/llamastack/llama-stack/issues/3273
- Wrap model loading with asyncio.to_thread() to prevent blocking during
model download/initialization
- Wrap encoding operations with asyncio.to_thread() to run in background
thread
- Convert _load_sentence_transformer_model() to async method
This ensures the async event loop remains responsive during embedding
operations.
Closes: #3332
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is eliminating hardcoded status codes: `409` CONFLICT and `404`
NOT_FOUND in `server.py` using `httpx` built-in constants. This
implementation will follow the existing structure to improve
readability, extensibility and developer experience. This is already was
implemented in #3131
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
`./scripts/unit-tests.sh`
Update the file pattern from 'llama_stack/templates' to
'llama_stack/distributions' to properly trigger the Distribution
Template Codegen hook when distribution files change.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Sometimes the stream don't have chunks with finish_reason, e.g. canceled
stream, which throws a pydantic error as OpenAIChoice.finish_reason: str
## Test Plan
observe no more such error when benchmarking
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.
This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.
## Test Plan
Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
Generated with CC:
Replace cryptic KeyError with clear, actionable error message that
shows:
- Which API the failing provider belongs to
- The provider ID and type that's failing
- Which dependency is missing
- Clear instructions on how to fix the issue
## Test plan
Use a run config with Agents API and no safety provider
Before: KeyError: <Api.safety: 'safety'>
After: Failed to resolve 'agents' provider 'meta-reference' of type
'inline::meta-reference': required dependency 'safety' is not available.
Please add a 'safety' provider to your configuration or check if the
provider is properly configured.
# What does this PR do?
This PR updates the Watsonx provider dependencies from
`ibm_watson_machine_learning` to `ibm_watsonx_ai`.
The old package `ibm_watson_machine_learning` is in **deprecation mode**
([[PyPI
link](https://pypi.org/project/ibm-watson-machine-learning/)](https://pypi.org/project/ibm-watson-machine-learning/))
and relies on older versions of dependencies such as `pandas`. Updating
to `ibm_watsonx_ai` ensures compatibility with current dependency
versions and ongoing support.
## Test Plan
I verified the update by running an inference using a model provided by
Watsonx. The model ran successfully, confirming that the new dependency
works as expected.
Co-authored-by: are-ces <cpompeia@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to refactor `SQLiteVecIndex` to eliminate
redundant code and simplify the code using generic
`WeightedInMemoryAggregator` that can be used for any vector db
provider. This pattern is already implemented for `PGVectorIndex` in
#3064
CC: @franciscojavierarceo
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. `./scripts/unit-tests.sh`
2. Integration tests in CI Workflow
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is eliminating `lama-api-client` dependency at `pyproject.toml`
because it's not used in Llama Stack codebase
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
` ./scripts/unit-tests.sh`
Bumps [locust](https://github.com/locustio/locust) from 2.39.0 to
2.39.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/releases">locust's
releases</a>.</em></p>
<blockquote>
<h2>2.39.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Avoid broken gevent version for now by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3196">locustio/locust#3196</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/JumboBear"><code>@JumboBear</code></a>
made their first contribution in <a
href="https://redirect.github.com/locustio/locust/pull/3195">locustio/locust#3195</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/locustio/locust/compare/2.39.0...2.39.1">https://github.com/locustio/locust/compare/2.39.0...2.39.1</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/blob/master/CHANGELOG.md">locust's
changelog</a>.</em></p>
<blockquote>
<h1>Detailed changelog</h1>
<p>The most important changes can also be found in <a
href="https://docs.locust.io/en/latest/changelog.html">the
documentation</a>.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="934c5c33e4"><code>934c5c3</code></a>
changelog</li>
<li><a
href="9350084ec0"><code>9350084</code></a>
disable macos build for now</li>
<li><a
href="705e2f658b"><code>705e2f6</code></a>
Disable another unit test on macos because of annoying behavior on GH
(really...</li>
<li><a
href="d888b9db2b"><code>d888b9d</code></a>
Disable another unit test on macos because of annoying behavior on
GH</li>
<li><a
href="45bc4d84fd"><code>45bc4d8</code></a>
Disable annoying test case on macos for now. Only has issues on GH. <a
href="https://github.com/amadeupp"><code>@amadeupp</code></a>...</li>
<li><a
href="9d7710a2da"><code>9d7710a</code></a>
unit tests: give extra time for testing on macOS</li>
<li><a
href="fcbc740e04"><code>fcbc740</code></a>
Avoid broken gevent version for now (<a
href="https://redirect.github.com/locustio/locust/issues/3196">#3196</a>)</li>
<li><a
href="cd1f600d44"><code>cd1f600</code></a>
mypy</li>
<li><a
href="0cf52dc990"><code>0cf52dc</code></a>
Autogen changelog for 2.39.0</li>
<li><a
href="094395e024"><code>094395e</code></a>
Merge pull request <a
href="https://redirect.github.com/locustio/locust/issues/3195">#3195</a>
from JumboBear/pyproject</li>
<li>Additional commits viewable in <a
href="https://github.com/locustio/locust/compare/2.39.0...2.39.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@radix-ui/react-tooltip](https://github.com/radix-ui/primitives)
from 1.2.6 to 1.2.8.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node)
from 20.17.47 to 24.3.0.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [framer-motion](https://github.com/motiondivision/motion) from
11.18.2 to 12.23.12.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/motiondivision/motion/blob/main/CHANGELOG.md">framer-motion's
changelog</a>.</em></p>
<blockquote>
<h2>[12.23.12] 2025-07-29</h2>
<h3>Added</h3>
<ul>
<li>Exporting internal APIs for use in view animations.</li>
</ul>
<h2>[12.23.11] 2025-07-28</h2>
<h3>Added</h3>
<ul>
<li>Children of variants with <code>delayChildren: stagger()</code> will
now be staggered correctly alongside their newly-entering siblings.</li>
</ul>
<h2>[12.23.10] 2025-07-28</h2>
<h3>Fixed</h3>
<ul>
<li>Fixed shared layout animation in situations where no
<code>motion</code> components have re-rendered between shared element
switching.</li>
</ul>
<h2>[12.23.9] 2025-07-24</h2>
<h3>Changed</h3>
<ul>
<li>Removing redundant <code>renderRequest</code>
<code>MotionValue</code> lifecycle.</li>
</ul>
<h2>[12.23.8] 2025-07-24</h2>
<h3>Fixed</h3>
<ul>
<li>Ensuring that when an animation is skipped via <code>duration =
0</code> that we also set <code>type = "keyframes"</code> so
that <code>duration</code> takes effect.</li>
</ul>
<h2>[12.23.7] 2025-07-23</h2>
<h3>Fixed</h3>
<ul>
<li><code>springValue</code> cleanup.</li>
<li>Removed additional <code>removeNode</code> from
<code>AnimatePresence</code> when using <code>popLayout</code>.</li>
</ul>
<h2>[12.23.6] 2025-07-11</h2>
<h3>Changed</h3>
<ul>
<li>Added explainer for reduced motion warning.</li>
<li>Refactored <code>motion</code> component creation to remove
indirection.</li>
</ul>
<h2>[12.23.5] 2025-07-11</h2>
<h3>Fixed</h3>
<ul>
<li>Fix animation timings within dynamically-generated popups.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="e0f7e07570"><code>e0f7e07</code></a>
v12.23.12</li>
<li><a
href="994515fef3"><code>994515f</code></a>
Updating changelog</li>
<li><a
href="95d82ff919"><code>95d82ff</code></a>
Merge pull request <a
href="https://redirect.github.com/motiondivision/motion/issues/3338">#3338</a>
from motiondivision/feature/next-page-transitions</li>
<li><a
href="58b2e8cde4"><code>58b2e8c</code></a>
Exporting APIs for view transitions</li>
<li><a
href="b6f2132fb6"><code>b6f2132</code></a>
Update README.md</li>
<li><a
href="38298c41fc"><code>38298c4</code></a>
Update README.md</li>
<li><a
href="76396b0187"><code>76396b0</code></a>
Update README.md</li>
<li><a
href="b273d064a3"><code>b273d06</code></a>
Update README.md</li>
<li><a
href="c0bd6effa9"><code>c0bd6ef</code></a>
v12.23.11</li>
<li><a
href="e9b52af3e2"><code>e9b52af</code></a>
Updating changelog</li>
<li>Additional commits viewable in <a
href="https://github.com/motiondivision/motion/compare/v11.18.2...v12.23.12">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
BFCL scoring function is not supported, removing it.
Also minor fixes as the llama stack run is broken for open-benchmark for
test plan verification
1. Correct the model paths for supported models
2. Fix another issue as there is no `provider_id` for DatasetInput but
logger assumes it exists.
```
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 332, in construct_stack
await register_resources(run_config, impls)
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 108, in register_resources
logger.debug(f"registering {rsrc.capitalize()} {obj} for provider {obj.provider_id}")
^^^^^^^^^^^^^^^
File "/Users/swapna942/llama-stack/.venv/lib/python3.13/site-packages/pydantic/main.py", line 991, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'DatasetInput' object has no attribute 'provider_id'
```
## Test Plan
```llama stack build --distro open-benchmark --image-type venv``` and run the server succeeds
Issue Link: https://github.com/llamastack/llama-stack/issues/3282
# What does this PR do?
add the ability to use inequalities in the where clause of the sqlstore.
this is infrastructure for files expiration.
## Test Plan
unit tests
# What does this PR do?
1725364988
Fixes the issue with open ai package incompatibilty introduced through
new dependency of fireworks-ai==0.19.18->reward-kit by pinning to
fireworks older version that doesnt pull in reward-kit
## Test Plan
Tested locally with the following commands to start a container
1. Build container
`llama stack build --distro starter --image-type container`
2. start container `docker run -d -p 8321:8321 --name llama-stack-test
distribution-starter:0.2.19`
3. check health http://localhost:8321/v1/health
Above steps fails without the fix
# What does this PR do?
During env var replacement, we're implicitly converting all config types
to their apparent types (e.g., "true" to True, "123" to 123). This may
be arguably useful for when doing an env var substitution, as those are
always strings, but we should definitely avoid touching config values
that have explicit types and are uninvolved in env var substitution.
## Test Plan
Unit
**Description:**
Adding information and guidelines on when contributors should create an
in-tree vs out-of-tree provider.
Im still learning a bit about this subject so Im very open to feedback
on this PR
Will also add this section to the API Providers section of the docs
# What does this PR do?
Finding these issues while moving to github pages.
## Test Plan
uv run --group docs sphinx-autobuild docs/source docs/build/html
--write-all
# What does this PR do?
the post training docs are missing references to the more indepth
`huggingface.md` and `torchtune.md` which explain how to actually use
the providers.
These files show up in search though.
Add references to these files into the `inline_..md` files currently
pointed to by `index.md`
Signed-off-by: Charlie Doern <cdoern@redhat.com>
The `trl` dependency brings in `accelerate` which brings in nvidia
dependencies for torch. We cannot have that in the starter distro. As
such, no CPU-only post-training for the huggingface provider.
# What does this PR do?
Add LLAMAStack + Langchain integration example notebook
## Test Plan
Ran in Jupyter notebook, works end to end.
(Used Claude mainly for documentation and coding/debugging help)
Recording files use a predictable naming format, making the SQLite index
redundant. The binary SQLite file was causing frequent git conflicts.
Simplify by calculating file paths directly from request hashes.
Signed-off-by: Derek Higgins <derekh@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.5.0 to 6.6.0.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="4959332f0f"><code>4959332</code></a>
Bump dependencies (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/532">#532</a>)</li>
<li><a
href="adeb28643f"><code>adeb286</code></a>
Add support for .tools-versions (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/531">#531</a>)</li>
<li><a
href="fce199e243"><code>fce199e</code></a>
Add log message before long API calls to GitHub (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/530">#530</a>)</li>
<li><a
href="f758a4a1eb"><code>f758a4a</code></a>
chore: update known versions for 0.8.12 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/529">#529</a>)</li>
<li><a
href="c0e7e93474"><code>c0e7e93</code></a>
chore: update known versions for 0.8.11 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/526">#526</a>)</li>
<li><a
href="fda2399cb3"><code>fda2399</code></a>
chore: update known versions for 0.8.10 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/525">#525</a>)</li>
<li>See full diff in <a
href="d9e0f98d3f...4959332f0f">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@testing-library/dom](https://github.com/testing-library/dom-testing-library)
from 10.4.0 to 10.4.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/testing-library/dom-testing-library/releases"><code>@testing-library/dom</code>'s
releases</a>.</em></p>
<blockquote>
<h2>v10.4.1</h2>
<h2><a
href="https://github.com/testing-library/dom-testing-library/compare/v10.4.0...v10.4.1">10.4.1</a>
(2025-07-27)</h2>
<h3>Bug Fixes</h3>
<ul>
<li><strong>deps:</strong> replace chalk with picocolors (<a
href="https://redirect.github.com/testing-library/dom-testing-library/issues/1341">#1341</a>)
(<a
href="225a3e4cfa">225a3e4</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="225a3e4cfa"><code>225a3e4</code></a>
fix(deps): replace chalk with picocolors (<a
href="https://redirect.github.com/testing-library/dom-testing-library/issues/1341">#1341</a>)</li>
<li>See full diff in <a
href="https://github.com/testing-library/dom-testing-library/compare/v10.4.0...v10.4.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The starter distribution added post-training which added torch
dependencies which pulls in all the nvidia CUDA libraries. This made our
starter container very big. We have worked hard to keep the starter
container small so it serves its purpose as a starter. This PR tries to
get it back to its size by forking off duplicate "-gpu" providers for
post-training. These forked providers are then used for a new
`starter-gpu` distribution which can pull in all dependencies.
# What does this PR do?
closes https://github.com/llamastack/llama-stack/issues/3236
mypy considered our default implementations (raise NotImplementedError)
to be trivial. the result was we implemented the same stubs in
providers.
this change puts enough into the default impls so mypy considers them
non-trivial. this allows us to remove the duplicate implementations.
# What does this PR do?
As described in #3134 a langchain example works against openai's
responses impl, but not against llama stack's. This turned out to be due
to the order of the inputs. The langchain example has the two function
call outputs first, followed by each call result in turn. This seems to
be valid as it is accepted by openai's impl. However in llama stack,
these inputs are converted to chat completion inputs and the resulting
order for that api is not accpeted by openai.
This PR fixes the issue by ensuring that the converted chat completions
inputs are in the expected order.
Closes#3134
## Test Plan
Added unit and integration tests. Verified this fixes original issue as
reported.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
1. Adds `scripts/run-ui-linter.sh`
- Light script that checks whether `node_modules`,`eslint`, and
`prettier` exist before running linter
- When I introduced [the linter for the
UI](https://github.com/llamastack/llama-stack/pull/3156/files#diff-63a9c44a44acf85fea213a857769990937107cf072831e1a26808cfde9d096b9)
it forced the UI linter on all users, the small `node_modules` check
means that only users that have installed the UI locally (since
`node_modules` is in the gitignore) will actually end up having this
run. Additionally this does not do any install and just runs the
existing linter/prettier as requested by @mattf
2. Updates `.github/workflows/pre-commit.yml` to run CI again
- When I introduced the UI linter in the CI [in this
PR](https://github.com/llamastack/llama-stack/pull/3191) a failure
occurred because dependabot needed to be updated to also bump the
`package-lock.json` which was done [in this
PR](https://github.com/llamastack/llama-stack/pull/3212). All of this to
say, we shouldn't observe failures from dependabot again.
3. Updates `.pre-commit-config.yaml`
- Calls `scripts/run-ui-linter.sh`
## AI Assistance Notice
I used Copilot minimally.
## Test Plan
As
[requested](https://github.com/llamastack/llama-stack/pull/3207#discussion_r2288004872)
by @mattf I ran this after removing all of my `node_modules` and the
linter passed.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Context: https://github.com/meta-llama/llama-stack/issues/2937
The API design is inspired by existing offerings, but not exactly the
same:
* `top_n` as the parameter to control number of results, instead of
`top_k`, since `n` is conventional to control number
* `truncation` bool instead of `max_token_per_doc`, since we should just
handle the truncation automatically depending on model capability,
instead of user setting the context length manually.
* `data` field in the response, to be consistent with other OpenAI APIs
(though they don't have a rerank API). Also, it is one less name to
learn in the API.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
- Introduces the Agent Session creation for the Playground and allows
users to set tools
- note tools are actually not usable yet and this is marked explicitly
- this also caches sessions locally for faster loading on the UI and
deletes them appropriately
- allows users to easily create new sessions as well
- Moved Model Configuration settings and "System Message" / Prompt to
the left component
- Added new logo and favicon
- Added new typing animation when LLM is generating
### Create New Session
<img width="1916" height="1393" alt="Screenshot 2025-08-21 at 4 18
08 PM"
src="https://github.com/user-attachments/assets/52c70ae3-a33e-4338-8522-8184c692c320"
/>
### List of Sessions
<img width="1920" height="1391" alt="Screenshot 2025-08-21 at 4 18
56 PM"
src="https://github.com/user-attachments/assets/ed78c3c6-08ec-486c-8bad-9b7382c11360"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Unit tests added
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
Currently the embedding integration test cases fail due to a
misalignment in the error type. This PR fixes the embedding integration
test by fixing the error type.
## Test Plan
```
pytest -s -v tests/integration/inference/test_embedding.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
# What does this PR do?
- Documentation update and fix for the NVIDIA Inference provider.
- Update the `run_moderation` for safety API with a
`NotImplementedError` placeholder. Otherwise initialization NVIDIA
inference client will raise an error.
## Test Plan
N/A
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR removes `init()` from `LlamaStackAsLibrary`
Currently client.initialize() had to be invoked by user.
To improve dev experience and to avoid runtime errors, this PR init
LlamaStackAsLibrary implicitly upon using the client.
It prevents also multiple init of the same client, while maintaining
backward ccompatibility.
This PR does the following
- Automatic Initialization: Constructor calls initialize_impl()
automatically.
- Client is fully initialized after __init__ completes.
- Prevents consecutive initialization after the client has been
successfully initialized.
- initialize() method still exists but is now a no-op.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2946
---------
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Adds flexible CORS (Cross-Origin Resource Sharing) configuration support
to the FastAPI
server with both local development and explicit configuration modes:
- **Local development mode**: `cors: true` enables localhost-only access
with regex
pattern `https?://localhost:\d+`
- **Explicit configuration mode**: Specific origins configuration with
credential support
and validation
- Prevents insecure combinations (wildcards with credentials)
- FastAPI CORSMiddleware integration via `model_dump()`
Addresses the need for configurable CORS policies to support web
frontends and
cross-origin API access while maintaining security.
Closes#2119
## Test Plan
1. Ran Unit Tests.
2. Manual tests: FastAPI middleware integration with actual HTTP
requests
- Local development mode localhost access validation
- Explicit configuration mode origins validation
- Preflight OPTIONS request handling
Some screenshots of manual tests.
<img width="1920" height="927" alt="image"
src="https://github.com/user-attachments/assets/79322338-40c7-45c9-a9ea-e3e8d8e2f849"
/>
<img width="1911" height="1037" alt="image"
src="https://github.com/user-attachments/assets/1683524e-b0c9-48c9-a0a5-782e949cde01"
/>
cc: @leseb @rhuss @franciscojavierarceo
Bumps [llama-api-client](https://github.com/meta-llama/llama-api-python)
from 0.1.2 to 0.2.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/meta-llama/llama-api-python/releases">llama-api-client's
releases</a>.</em></p>
<blockquote>
<h2>v0.2.0</h2>
<h2>0.2.0 (2025-08-07)</h2>
<p>Full Changelog: <a
href="https://github.com/meta-llama/llama-api-python/compare/v0.1.2...v0.2.0">v0.1.2...v0.2.0</a></p>
<h3>Features</h3>
<ul>
<li>clean up environment call outs (<a
href="4afbd01ed7">4afbd01</a>)</li>
<li><strong>client:</strong> support file upload requests (<a
href="ec42e80b62">ec42e80</a>)</li>
</ul>
<h3>Bug Fixes</h3>
<ul>
<li><strong>api:</strong> remove chat completion request model (<a
href="94c4e9fd50">94c4e9f</a>)</li>
<li><strong>client:</strong> don't send Content-Type header on GET
requests (<a
href="efec88aa51">efec88a</a>)</li>
<li><strong>parsing:</strong> correctly handle nested discriminated
unions (<a
href="b6276863be">b627686</a>)</li>
<li><strong>parsing:</strong> ignore empty metadata (<a
href="d6ee85101e">d6ee851</a>)</li>
<li><strong>parsing:</strong> parse extra field types (<a
href="f03ca22860">f03ca22</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>add examples (<a
href="abfa065721">abfa065</a>)</li>
<li><strong>internal:</strong> bump pinned h11 dep (<a
href="d40e1b1d73">d40e1b1</a>)</li>
<li><strong>internal:</strong> fix ruff target version (<a
href="c900ebc528">c900ebc</a>)</li>
<li><strong>package:</strong> mark python 3.13 as supported (<a
href="ef5bc36693">ef5bc36</a>)</li>
<li><strong>project:</strong> add settings file for vscode (<a
href="e3103801d6">e310380</a>)</li>
<li><strong>readme:</strong> fix version rendering on pypi (<a
href="786f9fbdb7">786f9fb</a>)</li>
<li>sync repo (<a
href="7e697f6550">7e697f6</a>)</li>
<li>update SDK settings (<a
href="de22c0ece7">de22c0e</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>code of conduct (<a
href="efe1af28fb">efe1af2</a>)</li>
<li>readme and license (<a
href="d53eafd104">d53eafd</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/meta-llama/llama-api-python/blob/main/CHANGELOG.md">llama-api-client's
changelog</a>.</em></p>
<blockquote>
<h2>0.2.0 (2025-08-07)</h2>
<p>Full Changelog: <a
href="https://github.com/meta-llama/llama-api-python/compare/v0.1.2...v0.2.0">v0.1.2...v0.2.0</a></p>
<h3>Features</h3>
<ul>
<li>clean up environment call outs (<a
href="4afbd01ed7">4afbd01</a>)</li>
<li><strong>client:</strong> support file upload requests (<a
href="ec42e80b62">ec42e80</a>)</li>
</ul>
<h3>Bug Fixes</h3>
<ul>
<li><strong>api:</strong> remove chat completion request model (<a
href="94c4e9fd50">94c4e9f</a>)</li>
<li><strong>client:</strong> don't send Content-Type header on GET
requests (<a
href="efec88aa51">efec88a</a>)</li>
<li><strong>parsing:</strong> correctly handle nested discriminated
unions (<a
href="b6276863be">b627686</a>)</li>
<li><strong>parsing:</strong> ignore empty metadata (<a
href="d6ee85101e">d6ee851</a>)</li>
<li><strong>parsing:</strong> parse extra field types (<a
href="f03ca22860">f03ca22</a>)</li>
</ul>
<h3>Chores</h3>
<ul>
<li>add examples (<a
href="abfa065721">abfa065</a>)</li>
<li><strong>internal:</strong> bump pinned h11 dep (<a
href="d40e1b1d73">d40e1b1</a>)</li>
<li><strong>internal:</strong> fix ruff target version (<a
href="c900ebc528">c900ebc</a>)</li>
<li><strong>package:</strong> mark python 3.13 as supported (<a
href="ef5bc36693">ef5bc36</a>)</li>
<li><strong>project:</strong> add settings file for vscode (<a
href="e3103801d6">e310380</a>)</li>
<li><strong>readme:</strong> fix version rendering on pypi (<a
href="786f9fbdb7">786f9fb</a>)</li>
<li>sync repo (<a
href="7e697f6550">7e697f6</a>)</li>
<li>update SDK settings (<a
href="de22c0ece7">de22c0e</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>code of conduct (<a
href="efe1af28fb">efe1af2</a>)</li>
<li>readme and license (<a
href="d53eafd104">d53eafd</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="7a8c5838af"><code>7a8c583</code></a>
release: 0.2.0</li>
<li><a
href="4f1a04e5c1"><code>4f1a04e</code></a>
chore(internal): fix ruff target version</li>
<li><a
href="06485e995a"><code>06485e9</code></a>
feat(client): support file upload requests</li>
<li><a
href="131b474ad1"><code>131b474</code></a>
chore(project): add settings file for vscode</li>
<li><a
href="ef4cee6d8b"><code>ef4cee6</code></a>
fix(parsing): parse extra field types</li>
<li><a
href="fcbc699718"><code>fcbc699</code></a>
fix(parsing): ignore empty metadata</li>
<li><a
href="b6656cd0b8"><code>b6656cd</code></a>
fix(api): remove chat completion request model</li>
<li><a
href="0deda5590c"><code>0deda55</code></a>
feat: clean up environment call outs</li>
<li><a
href="ecf91026ac"><code>ecf9102</code></a>
fix(client): don't send Content-Type header on GET requests</li>
<li><a
href="0ac6285cbe"><code>0ac6285</code></a>
chore(readme): fix version rendering on pypi</li>
<li>Additional commits viewable in <a
href="https://github.com/meta-llama/llama-api-python/compare/v0.1.2...v0.2.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@radix-ui/react-collapsible](https://github.com/radix-ui/primitives)
from 1.1.11 to 1.1.12.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[eslint-config-prettier](https://github.com/prettier/eslint-config-prettier)
from 10.1.5 to 10.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/prettier/eslint-config-prettier/releases">eslint-config-prettier's
releases</a>.</em></p>
<blockquote>
<h2>v10.1.8</h2>
<p>republish latest version</p>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/prettier/eslint-config-prettier/compare/v10.1.5...v10.1.8">https://github.com/prettier/eslint-config-prettier/compare/v10.1.5...v10.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md">eslint-config-prettier's
changelog</a>.</em></p>
<blockquote>
<h1>eslint-config-prettier</h1>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9b0b0a47ec"><code>9b0b0a4</code></a>
fix: release a new latest version</li>
<li>See full diff in <a
href="https://github.com/prettier/eslint-config-prettier/compare/v10.1.5...v10.1.8">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@radix-ui/react-separator](https://github.com/radix-ui/primitives) from
1.1.6 to 1.1.7.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/radix-ui/primitives/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [tailwind-merge](https://github.com/dcastil/tailwind-merge) from
3.3.0 to 3.3.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/dcastil/tailwind-merge/releases">tailwind-merge's
releases</a>.</em></p>
<blockquote>
<h2>v3.3.1</h2>
<h3>Bug Fixes</h3>
<ul>
<li>Fix arbitrary value using <code>color-mix()</code> not being
detected as color by <a
href="https://github.com/dcastil"><code>@dcastil</code></a> in <a
href="https://redirect.github.com/dcastil/tailwind-merge/pull/591">dcastil/tailwind-merge#591</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/dcastil/tailwind-merge/compare/v3.3.0...v3.3.1">https://github.com/dcastil/tailwind-merge/compare/v3.3.0...v3.3.1</a></p>
<p>Thanks to <a
href="https://github.com/brandonmcconnell"><code>@brandonmcconnell</code></a>,
<a href="https://github.com/manavm1990"><code>@manavm1990</code></a>,
<a href="https://github.com/langy"><code>@langy</code></a>, <a
href="https://github.com/roboflow"><code>@roboflow</code></a>, <a
href="https://github.com/syntaxfm"><code>@syntaxfm</code></a>, <a
href="https://github.com/getsentry"><code>@getsentry</code></a>, <a
href="https://github.com/codecov"><code>@codecov</code></a>, <a
href="https://github.com/sourcegraph"><code>@sourcegraph</code></a>, a
private sponsor, <a
href="https://github.com/block"><code>@block</code></a> and <a
href="https://github.com/shawt3000"><code>@shawt3000</code></a> for
sponsoring tailwind-merge! ❤️</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="40d8feed6a"><code>40d8fee</code></a>
v3.3.1</li>
<li><a
href="429ea54ac8"><code>429ea54</code></a>
add changelog for v3.3.1</li>
<li><a
href="d3df8775cc"><code>d3df877</code></a>
Merge pull request <a
href="https://redirect.github.com/dcastil/tailwind-merge/issues/591">#591</a>
from dcastil/bugfix/590/fix-arbitrary-value-using-col...</li>
<li><a
href="fdd9cdfa14"><code>fdd9cdf</code></a>
add <code>color-mix()</code> to <code>colorFunctionRegex</code></li>
<li><a
href="d49e03a28c"><code>d49e03a</code></a>
add test case for border colors being merged incorrectly</li>
<li><a
href="47155f0ebe"><code>47155f0</code></a>
Merge pull request <a
href="https://redirect.github.com/dcastil/tailwind-merge/issues/585">#585</a>
from dcastil/renovate/all-minor-patch</li>
<li><a
href="2d29675ab0"><code>2d29675</code></a>
Update all non-major dependencies</li>
<li><a
href="c3d7208367"><code>c3d7208</code></a>
Merge pull request <a
href="https://redirect.github.com/dcastil/tailwind-merge/issues/578">#578</a>
from dcastil/dependabot/npm_and_yarn/dot-github/actio...</li>
<li><a
href="527214bf13"><code>527214b</code></a>
Bump undici from 5.28.5 to 5.29.0 in
/.github/actions/metrics-report</li>
<li>See full diff in <a
href="https://github.com/dcastil/tailwind-merge/compare/v3.3.0...v3.3.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [locust](https://github.com/locustio/locust) from 2.38.0 to
2.39.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/releases">locust's
releases</a>.</em></p>
<blockquote>
<h2>2.39.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add MilvusUser and example by <a
href="https://github.com/zhuwenxing"><code>@zhuwenxing</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3168">locustio/locust#3168</a></li>
<li>Add SocketIOUser by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3189">locustio/locust#3189</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/zhuwenxing"><code>@zhuwenxing</code></a> made
their first contribution in <a
href="https://redirect.github.com/locustio/locust/pull/3168">locustio/locust#3168</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/locustio/locust/compare/2.38.1...2.39.0">https://github.com/locustio/locust/compare/2.38.1...2.39.0</a></p>
<h2>2.38.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix test flakyness and update error message by <a
href="https://github.com/amadeuppereira"><code>@amadeuppereira</code></a>
in <a
href="https://redirect.github.com/locustio/locust/pull/3187">locustio/locust#3187</a></li>
<li>FastHttpUser: Dont send zstd in Accept-Encoding header by <a
href="https://github.com/cyberw"><code>@cyberw</code></a> in <a
href="https://redirect.github.com/locustio/locust/pull/3188">locustio/locust#3188</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/locustio/locust/compare/2.38.0...2.38.1">https://github.com/locustio/locust/compare/2.38.0...2.38.1</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/locustio/locust/blob/master/CHANGELOG.md">locust's
changelog</a>.</em></p>
<blockquote>
<h1>Detailed changelog</h1>
<p>The most important changes can also be found in <a
href="https://docs.locust.io/en/latest/changelog.html">the
documentation</a>.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1810fef1ae"><code>1810fef</code></a>
Tiny doc fixes</li>
<li><a
href="48b4dfce8f"><code>48b4dfc</code></a>
Link SocketIOUser from main docs.</li>
<li><a
href="6e4fd7f067"><code>6e4fd7f</code></a>
Merge pull request <a
href="https://redirect.github.com/locustio/locust/issues/3189">#3189</a>
from locustio/Add-SocketioUser</li>
<li><a
href="95eca45476"><code>95eca45</code></a>
better documentation of on_message</li>
<li><a
href="a56ef663af"><code>a56ef66</code></a>
SocketIOUser docs: Link to example on GH</li>
<li><a
href="adaa71b5f9"><code>adaa71b</code></a>
SocketIOUser, add method docstrings and link to python-socketio's
readthedocs</li>
<li><a
href="9fb3ff0f89"><code>9fb3ff0</code></a>
Add testcase for SocketIOUser</li>
<li><a
href="7047247f9d"><code>7047247</code></a>
SocketIOUser: Fix use of environment object. Remove SocketIOClient.</li>
<li><a
href="f8ddc9c798"><code>f8ddc9c</code></a>
rename socketio echo_server</li>
<li><a
href="ae28acf027"><code>ae28acf</code></a>
add contrib dependencies to docs build</li>
<li>Additional commits viewable in <a
href="https://github.com/locustio/locust/compare/2.38.0...2.39.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
This should fix dependabot based on this thread:
https://stackoverflow.com/questions/60201543/dependabot-only-updates-lock-file
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Handles MCP tool calls in a previous response
Closes#3105
## Test Plan
Made call to create response with tool call, then made second call with
the first linked through previous_response_id. Did not get error.
Also added unit test.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
We noticed that when llama-stack is running for a long time, we would
run into database errors when trying to run messages through the agent
(which we configured to persist against postgres), seemingly due to the
database connections being stale or disconnected. This commit adds
`pool_pre_ping=True` to the SQLAlchemy engine creation to help mitigate
this issue by checking the connection before using it, and
re-establishing it if necessary.
More information in:
https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects
We're also open to other suggestions on how to handle this issue, this
PR is just a suggestion.
## Test Plan
We have not tested it yet (we're in the process of doing that) and we're
hoping it's going to resolve our issue.
# What does this PR do?
Fix broken `package-lock.json` not caught by [github bot in this
commit](7f0b2a8764).
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
NVIDIA asymmetric embedding models (e.g.,
`nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter
not present in the standard OpenAI embeddings API. This PR adds the
`input_type="query"` as default and updates the documentation to suggest
using the `embedding` API for passage embeddings.
<!-- If resolving an issue, uncomment and update the line below -->
Resolves#2892
## Test Plan
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
# What does this PR do?
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Adds npm to pre-commit.yml installation and caches ui
Removes node installation during pre-commit.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205
While fixing this, I encountered a large amount of badness in our CI
workflow definitions.
- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.
## Test Plan
Re-record using:
```
sh scripts/integration-tests.sh --stack-config ci-tests \
--provider ollama --test-pattern test_agent_name --inference-mode record
```
Then re-run with `--inference-mode replay`. But:
Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
# What does this PR do?
Add CodeScanner implementations
## Test Plan
`SAFETY_MODEL=CodeScanner LLAMA_STACK_CONFIG=starter uv run pytest -v
tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
This PR need to land after this
https://github.com/meta-llama/llama-stack/pull/3098
This OpenAI client release
0843a11164
ends up breaking litellm
169a17400f/litellm/types/llms/openai.py (L40)
Update the dependency pin. Also make the imports a bit more defensive
anyhow if something else during `llama stack build` ends up moving
openai to a previous version.
## Test Plan
Run pre-release script integration tests.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I noticed somehow
[build_conda_env.sh](https://github.com/llamastack/llama-stack/blob/main/llama_stack/core/build_conda_env.sh)
exists in main branch. We need to kill it to be consistent with
[#2969](https://github.com/llamastack/llama-stack/pull/2969)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Update triagers to current state
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
commands where the output is important like `llama stack build
--print-deps-only` (soon to be `llama stack show`) print some log.py
`cprint`'s on _every_ execution of the CLI
for example:
<img width="912" height="331" alt="Screenshot 2025-08-18 at 1 16 30 PM"
src="https://github.com/user-attachments/assets/e5bf18fb-74a1-438c-861a-8a26eea7d014"
/>
the yellow text is likely unnecessary.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Small docs change as requested in
https://github.com/llamastack/llama-stack/pull/3160#pullrequestreview-3125038932
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
See comment here:
https://github.com/llamastack/llama-stack/pull/3162#issuecomment-3192859097
-- TL;DR it is quite complex to invoke the recording workflow correctly
for an end developer writing tests. This script simplifies the work.
No more manual GitHub UI navigation!
## Script Functionality
- Auto-detects your current branch and associated PR
- Finds the right repository context (works from forks!)
- Runs the workflow where it can actually commit back
- Validates prerequisites and provides helpful error messages
## How to Use
First ensure you are on the branch which introduced a new test and want
it recorded. **Make sure you have pushed this branch remotely, easiest
is to create a PR.**
```
# Record tests for current branch
./scripts/github/schedule-record-workflow.sh
# Record specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --run-vision-tests
# Record tests matching a pattern
./scripts/github/schedule-record-workflow.sh --test-pattern "test_streaming"
```
## Test Plan
Ran `./scripts/github/schedule-record-workflow.sh -s inference -k
tool_choice` which started
4820409329
which successfully committed recorded outputs.
# What does this PR do?
Recording tests has become a nightmare. This is the first part of making
that process simpler by making it _less_ automatic. I tried to be too
clever earlier.
It simplifies the record-integration-tests workflow to use workflow
dispatch inputs instead of PR labels. No more opaque stuff. Just go to
the GitHub UI and run the workflow with inputs. I will soon add a helper
script for this also.
Other things to aid re-running just the small set of things you need to
re-record:
- Replaces the `test-types` JSON array parameter with a more intuitive
`test-subdirs` comma-separated list. The whole JSON array crap was for
matrix.
- Adds a new `test-pattern` parameter to allow filtering tests using
pytest's `-k` option
## Test Plan
Note that this PR is in a fork not the source repository.
- Replay tests on this PR are green
- Manually
[ran](1699856292)
the replay workflow with a test-subdir and test-pattern filter, worked
- Manually
[ran](4819508034)
the **record** workflow with a simple pattern, it has worked and updated
_this_ PR.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Replace chat_completion calls with openai_chat_completion to eliminate
dependency on legacy inference APIs.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3067
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Creates a structured testing documentation section with multiple detailed pages:
- Testing overview explaining the record-replay architecture
- Integration testing guide with practical usage examples
- Record-replay system technical documentation
- Guide for writing effective tests
- Troubleshooting guide for common testing issues
Hopefully this makes things a bit easier.
# What does this PR do?
Updates test recordings.
## Test Plan
Started ollama serving the 3.2:3b model. Then ran the server:
```
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings/ \
SQLITE_STORE_DIR=$(mktemp -d) \
OLLAMA_URL=http://localhost:11434 \
llama stack build --template starter --image-type venv --run
```
Then ran the tests which needed recording:
```
pytest -sv tests/integration/agents/test_openai_responses.py \
--stack-config=server:starter \
--text-model ollama/llama3.2:3b-instruct-fp16 -k test_responses_store
```
Then, restarted the server with `LLAMA_STACK_TEST_INFERENCE_MODE=replay`, re-ran the tests and verified they passed.
# What does this PR do?
A _bunch_ on cleanup for the Responses tests.
- Got rid of YAML test cases, moved them to just use simple pydantic models
- Splitting the large monolithic test file into multiple focused test files:
- `test_basic_responses.py` for basic and image response tests
- `test_tool_responses.py` for tool-related tests
- `test_file_search.py` for file search specific tests
- Adding a `StreamingValidator` helper class to standardize streaming response validation
## Test Plan
Run the tests:
```
pytest -s -v tests/integration/non_ci/responses/ \
--stack-config=starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
# What does this PR do?
Adds proper streaming events for MCP tool listing (`mcp_list_tools.in_progress` and `mcp_list_tools.completed`). Also refactors things a bit more.
## Test Plan
Verified existing integration tests pass with the refactored code. The test `test_response_streaming_multi_turn_tool_execution` has been updated to check for the new MCP list tools streaming events
# What does this PR do?
Refactors the OpenAI response conversion utilities by moving helper functions from `openai_responses.py` to `utils.py`. Adds unit tests.
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
The OpenAI compatibility layer was incorrectly importing
ChatCompletionMessageToolCallParam instead of the
ChatCompletionMessageFunctionToolCall class. This caused "Cannot
instantiate typing.Union" errors when processing agent requests with
tool calls.
Closes: #3141
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Adds content part streaming events to the OpenAI-compatible Responses API to support more granular streaming of response content. This introduces:
1. New schema types for content parts: `OpenAIResponseContentPart` with variants for text output and refusals
2. New streaming event types:
- `OpenAIResponseObjectStreamResponseContentPartAdded` for when content parts begin
- `OpenAIResponseObjectStreamResponseContentPartDone` for when content parts complete
3. Implementation in the reference provider to emit these events during streaming responses. Also emits MCP arguments just like function call ones.
## Test Plan
Updated existing streaming tests to verify content part events are properly emitted
# What does this PR do?
Enhances tool execution streaming by adding support for real-time progress events during tool calls. This implementation adds streaming events for MCP and web search tools, including in-progress, searching, completed, and failed states.
The refactored `_execute_tool_call` method now returns an async iterator that yields streaming events throughout the tool execution lifecycle.
## Test Plan
Updated the integration test `test_response_streaming_multi_turn_tool_execution` to verify the presence and structure of new streaming events, including:
- Checking for MCP in-progress and completed events
- Verifying that progress events contain required fields (item_id, output_index, sequence_number)
- Ensuring completed events have the necessary sequence_number field
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to eliminate hardcoded status codes in
server's responses and replace it by `httpx.codes` functionality for
better consistency across the whole project and improvement in code
readability.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run `./scripts/unit-tests.sh`
**Description:**
The standard markdown [!NOTE] format is not supported on Sphinx
generated documentation, replacing those instances. Also updating other
Notes, Tips and Warning blocks throughout the source docs
WIP: Working to update the provider code gen
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to make the behavior DELETE API endpoints be
consistent with standard RESTful conventions and eliminate confusion for
API consumers.
Old Behavior
```
HTTP Status: 200 OK
Response Body: null
```
Eg. `curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
`INFO 2025-08-12 16:11:57,932 console_span_processor:65 telemetry:
15:11:57.929 [INFO] ::1:59805 - "DELETE /v1/shields/test-shield
HTTP/1.1" 200 `
Updated Behavior
```
HTTP Status: 204 No Content
Response Body: empty (no body)
```
Eg. `curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`INFO 2025-08-12 16:18:16,645 console_span_processor:62 telemetry:
15:18:16.637 [INFO] ::1:60283 - "DELETE /v1/shields/test-shield
HTTP/1.1" 204 `
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3090
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run `./scripts/unit-tests.sh`
# What does this PR do?
1. Updates `AgentPersistence.list_sessions()` to properly filter out
`Turn` keys from `Session` keys.
2. Adds a suite of unit tests to confirm the `list_sessions()` behavior
and tests the failed sample in
https://github.com/meta-llama/llama-stack/issues/3048
## Fixes https://github.com/meta-llama/llama-stack/issues/3048
## Test Plan
Unit tests added.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR changes the group name from github.ref to
github.even.pull_request_number. The reason for this is that github.ref
does not act as a unique identifier in the pull_request_target event and
only is unique in pull_request. The github action was getting canceled
was because the group name was not unique in the concurrency section.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3102
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
To test this I have created a fake github action and ran it trough act
to see what the github.ref variable produced and what alternatives can
be used. This confirmed that the github.ref was not unique and that
github.event.pull_request_number is unique to the PR.
Some fixes to MCP tests. And a bunch of fixes for Vector providers.
I also enabled a bunch of Vector IO tests to be used with
`LlamaStackLibraryClient`
## Test Plan
Run Responses tests with llama stack library client:
```
pytest -s -v tests/integration/non_ci/responses/ --stack-config=server:starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
Do the same with `-k openai_client`
The rest should be taken care of by CI.
Well our Responses tests use it so we better include it in the API, no?
I discovered it because I want to make sure `llama-stack-client` can be
used always instead of `openai-python` as the client (we do want to be
_truly_ compatible.)
# What does this PR do?
the minimum python version for the project was bumped to 3.12 a couple
months ago, but there remains some artifacts in the repo suggesting we
support >=3.10
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR addresses an issue where `PromptGuardSafetyImpl` was an
incomplete implementation of an abstract class. The class was missing
the required run_moderation method from its parent interface.
Currently, running `pre-commit` locally fails with the error below.
```
llama_stack/providers/inline/safety/prompt_guard/__init__.py:15: error: Cannot instantiate abstract class "PromptGuardSafetyImpl" with abstract attribute "run_moderation" [abstract]
Found 1 error in 1 file (checked 410 source files)
```
This PR fixes the issue as follows
- Added the missing run_moderation method to PromptGuardSafetyImpl
- Method raises NotImplementedError with appropriate message indicating
this functionality is not implemented for PromptGuard
- This allows the class to be properly instantiated while clearly
indicating the limitation
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Using commas is much more shell-friendly. A semi-colon is a statement
delimiter and must be escaped.
This change is backwards incompatible but I imagine not many people are
using this. I could be wrong. Looking for feedback.
# What does this PR do?
- Adds documentation on how to contribute a Vector DB provider.
- Updates the testing section to be a little friendlier to navigate.
- Also added new shortcut for search so that `/` and `⌘ K` or `ctrl+K`
trigger search
<img width="1903" height="1346" alt="Screenshot 2025-08-11 at 10 10
12 AM"
src="https://github.com/user-attachments/assets/6995b3b8-a2ab-4200-be72-c5b03a784a29"
/>
<img width="1915" height="1438" alt="Screenshot 2025-08-11 at 10 10
25 AM"
src="https://github.com/user-attachments/assets/1f54d30e-5be1-4f27-b1e9-3c3537dcb8e9"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
This updates the sidebar to look a little more like other popular ones.
<img width="1913" height="1352" alt="Screenshot 2025-08-08 at 11 25
31 PM"
src="https://github.com/user-attachments/assets/00738412-1101-48ec-8864-cde4a8733ec1"
/>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for
authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro,
gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
<!-- If resolving an issue, uncomment and update the line below -->
relates to https://github.com/meta-llama/llama-stack/issues/2747
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Eran Cohen <eranco@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Updates READMe to add
1. GitHub badge highlighting Llama Stack as #1 Repo of the Day
2. GitHub Star History (cumulative stars chart)
3. Contributor shout out
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Update Milvus doc on using search modes.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
PR adds Flash-Lite 2.0 and 2.5 models to the Gemini inference provider
Closes#3046
## Test Plan
I was not able to locate any existing test for this provider, so I
performed manual testing. But the change is really trivial and
straightforward.
# What does this PR do?
This PR updates the UI to create new:
1. `/files/{file_id}`
2. `files/{file_id}/contents`
3. `files/{file_id}/contents/{content_id}`
The list of files are clickable which brings the user to the FIles
Detail page
The File Details page shows all of the content
The content details page shows the individual chunk/content parsed
These only use our existing OpenAI compatible APIs. I have a separate
branch where I expose the embedding and the portal is correctly
populated. I included the FE rendering code for that in this PR.
1. `vector-stores/{vector_store_id}/files/{file_id}`
<img width="1913" height="1351" alt="Screenshot 2025-08-06 at 10 20
12 PM"
src="https://github.com/user-attachments/assets/08010d5e-60c8-4bd9-9f3e-a2731ed1ad55"
/>
2. `vector-stores/{vector_store_id}/files/{file_id}/contents`
<img width="1920" height="1272" alt="Screenshot 2025-08-06 at 10 21
23 PM"
src="https://github.com/user-attachments/assets/3b91e67b-5d64-4fe6-91b6-18f14587e850"
/>
3.
`vector-stores/{vector_store_id}/files/{file_id}/contents/{content_id}`
<img width="1916" height="1273" alt="Screenshot 2025-08-06 at 10 21
45 PM"
src="https://github.com/user-attachments/assets/d38ca996-e8d9-460c-9e39-7ff0cb5ec0dd"
/>
## Test Plan
I tested this locally and reviewed the code. I generated a significant
share of the code with Claude and some manual intervention. After this,
I'll begin adding tests to the UI.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This PR kills the verifications infrastructure which is no longer used.
It was relocated to the `llama-stack-evals`
(https://github.com/meta-llama/llama-stack-evals) repository previously.
Responses tests used this infrastructure but that wasn't quite
necessary, just a little useful back when @bbrownin introduced the
tests. On Discord, we agreed that tests can be moved to our regular
integrations test infra.
## Test Plan
Some tests currently do fail (although they run!) I will send a
follow-up PR which makes them all pass.
# What does this PR do?
`AgentEventLogger` only supports streaming responses, so I suggest
adding a comment near the bottom of `demo_script.py` letting the user
know this, e.g., if they change the `stream` value to `False` in the
call to `create_turn`, they need to comment out the logging lines.
See https://github.com/llamastack/llama-stack-client-python/issues/15
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Dean Wampler <dean.wampler@ibm.com>
# What does this PR do?
This PR implements hybrid search for Milvus DB based on the inbuilt
milvus support.
To test:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s
--tb=long --disable-warnings --asyncio-mode=auto
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
Adds a blurb to the `CONTRIBUTING.md` encouraging the use of the
standardized custom exception classes for resources where applicable
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Remove pure venv (without uv) references in docs
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
1. Introduce new base custom exception class `ResourceNotFoundError`
2. All other "not found" exception classes now inherit from
`ResourceNotFoundError`
Closes#3030
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
This PR adds a minimum version `0.7.0` to the project. The diff issue
happens because an `upload-time` field in the `uv.lock` file did not
exist in older uv versions (pre `0.6.15`). This effectively prevents
large diffs in PRs from devs that use older versions of uv.
Closes#2887
---------
Co-authored-by: Charlie Doern <charlie@doern.me>
A bunch of miscellaneous cleanup focusing on tests, but ended up
speeding up starter distro substantially.
- Pulled llama stack client init for tests into `pytest_sessionstart` so
it does not clobber output
- Profiling of that told me where we were doing lots of heavy imports
for starter, so lazied them
- starter now starts 20seconds+ faster on my Mac
- A few other smallish refactors for `compat_client`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Extend the Shields Protocol and implement the capability to unregister
previously registered shields and CLI for shields management.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2581
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
First of, test API for shields
1. Install and start Ollama:
`ollama serve`
2. Pull Llama Guard Model in Ollama:
`ollama pull llama-guard3:8b`
3. Configure env variables:
```
export ENABLE_OLLAMA=ollama
export OLLAMA_URL=http://localhost:11434
```
4. Build Llama Stack distro:
`llama stack build --template starter --image-type venv `
5. Start Llama Stack server:
`llama stack run starter --port 8321`
6. Check if Ollama model is available:
`curl -X GET http://localhost:8321/v1/models | jq '.data[] |
select(.provider_id=="ollama")'`
7. Register a new Shield using Ollama provider:
```
curl -X POST http://localhost:8321/v1/shields \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"provider_id": "llama-guard",
"provider_shield_id": "ollama/llama-guard3:8b",
"params": {}
}'
```
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
8. Check if shield was registered:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
9. Run shield:
```
curl -X POST http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"messages": [
{
"role": "user",
"content": "How can I hack into someone computer?"
}
],
"params": {}
}'
```
`{"violation":{"violation_level":"error","user_message":"I can't answer
that. Can I help with something
else?","metadata":{"violation_type":"S2"}}}% `
10. Unregister shield:
`curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
11. Verify shield was deleted:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"detail":"Invalid value: Shield 'test-shield' not found"}%`
All tests passed ✅
```
========================================================================== 430 passed, 194 warnings in 19.54s ==========================================================================
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:78: RuntimeWarning: coroutine 'close_litellm_async_clients' was never awaited
loop.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Wrote HTML report to htmlcov-3.12/index.html
```
# What does this PR do?
1. Creates a new `SessionNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
1. Creates a new `ToolGroupNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Bumps [openai](https://github.com/openai/openai-python) from 1.97.1 to
1.98.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/releases">openai's
releases</a>.</em></p>
<blockquote>
<h2>v1.98.0</h2>
<h2>1.98.0 (2025-07-30)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.97.2...v1.98.0">v1.97.2...v1.98.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> manual updates (<a
href="88a8036c5e">88a8036</a>)</li>
</ul>
<h2>v1.97.2</h2>
<h2>1.97.2 (2025-07-30)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.97.1...v1.97.2">v1.97.1...v1.97.2</a></p>
<h3>Chores</h3>
<ul>
<li><strong>client:</strong> refactor streaming slightly to better
future proof it (<a
href="71c0c74713">71c0c74</a>)</li>
<li><strong>project:</strong> add settings file for vscode (<a
href="29c22c90fd">29c22c9</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/openai/openai-python/blob/main/CHANGELOG.md">openai's
changelog</a>.</em></p>
<blockquote>
<h2>1.98.0 (2025-07-30)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.97.2...v1.98.0">v1.97.2...v1.98.0</a></p>
<h3>Features</h3>
<ul>
<li><strong>api:</strong> manual updates (<a
href="88a8036c5e">88a8036</a>)</li>
</ul>
<h2>1.97.2 (2025-07-30)</h2>
<p>Full Changelog: <a
href="https://github.com/openai/openai-python/compare/v1.97.1...v1.97.2">v1.97.1...v1.97.2</a></p>
<h3>Chores</h3>
<ul>
<li><strong>client:</strong> refactor streaming slightly to better
future proof it (<a
href="71c0c74713">71c0c74</a>)</li>
<li><strong>project:</strong> add settings file for vscode (<a
href="29c22c90fd">29c22c9</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="a3315d9fcc"><code>a3315d9</code></a>
release: 1.98.0 (<a
href="https://redirect.github.com/openai/openai-python/issues/2503">#2503</a>)</li>
<li><a
href="48188cc8d5"><code>48188cc</code></a>
release: 1.97.2 (<a
href="https://redirect.github.com/openai/openai-python/issues/2494">#2494</a>)</li>
<li>See full diff in <a
href="https://github.com/openai/openai-python/compare/v1.97.1...v1.98.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
As the title says. Distributions is in, Templates is out.
`llama stack build --template` --> `llama stack build --distro`. For
backward compatibility, the previous option is kept but results in a
warning.
Updated `server.py` to remove the "config_or_template" backward
compatibility since it has been a couple releases since that change.
# What does this PR do?
Implement vector store search test
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
```
pytest tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes --stack-config=http://localhost:8321 --embedding-model=all-MiniLM-L6-v2 -v
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
Remove score_threshold based check from `OpenAIVectorStoreMixin`
Closes: https://github.com/meta-llama/llama-stack/issues/3018
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for removal of Conda support in Llama Stack
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2539
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
closes#2995
update SambaNovaInferenceAdapter to efficiently use LiteLLMOpenAIMixin
## Test Plan
```
$ uv run pytest -s -v tests/integration/inference --stack-config inference=sambanova --text-model sambanova/Meta-Llama-3.1-8B-Instruct
...
======================== 10 passed, 84 skipped, 3 xfailed, 51 warnings in 8.14s ========================
```
# What does this PR do?
Update README for supported DBs
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Adds support to Vector store Open AI APIs in Qdrant.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2463
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
This should be more robust as sometimes its run without running build
first.
## Test Plan
OLLAMA_URL=http://localhost:11434 LLAMA_STACK_TEST_INFERENCE_MODE=replay
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings
LLAMA_STACK_CONFIG=server:starter uv run --with pytest-repeat pytest
tests/integration/telemetry
--text-model="ollama/llama3.2:3b-instruct-fp16" -vvs
# What does this PR do?
This PR (1) enables the files API for Weaviate and (2) enables
integration tests for Weaviate, which adds a docker container to the
github action.
This PR also handles a couple of edge cases for in creating the
collection and ensuring the tests all pass.
## Test Plan
CI enabled
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
We are going to split record and replay workflows completely to simplify
the concurrency key design.
We can add vision tests by just adding to our matrix.
# What does this PR do?
Improve user experience by providing specific guidance when no API key
is available, showing both provider data header and config options with
the correct field name for each provider.
Also adds comprehensive test coverage for API key resolution scenarios.
addresses #2990 for providers using litellm openai mixin
## Test Plan
`./scripts/unit-tests.sh
tests/unit/providers/inference/test_litellm_openai_mixin.py`
This PR significantly refactors the Integration Tests workflow. The main
goal behind the PR was to enable recording of vision tests which were
never run as part of our CI ever before. During debugging, I ended up
making several other changes refactoring and hopefully increasing the
robustness of the workflow.
After doing the experiments, I have updated the trigger event to be
`pull_request_target` so this workflow can get write permissions by
default but it will run with source code from the base (main) branch in
the source repository only. If you do change the workflow, you'd need to
experiment using the `workflow_dispatch` triggers. This should not be
news to anyone using Github Actions (except me!)
It is likely to be a little rocky though while I learn more about GitHub
Actions, etc. Please be patient :)
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
I realized that when a new PR is opened, the integration tests aren't
triggering (or aren't always?) since the replay logic was introduced
amend the concurrency logic a bit to trigger on opened PRs
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
get_vector_db() will raise an exception if a vector store won't be
returned
client handling is redundant
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
It looks like the coverage badge is still present in the README. This PR
removes it.
For more context: https://github.com/meta-llama/llama-stack/pull/2950
**Description**
This PR adjusts the external providers documentation to align with the
new providers format. Splits up sections into the existing external
providers and how to create them as well.
<img width="1049" height="478" alt="Screenshot 2025-07-31 at 9 48 26 AM"
src="https://github.com/user-attachments/assets/f13599cb-2fd1-4e57-8ca9-27b067264e33"
/>
Open to feedback and adjusting titles
What does this PR do?
This PR adds support for Direct Preference Optimization (DPO) training
via the existing HuggingFace inline provider. It introduces a new DPO
training recipe, config schema updates, dataset integration, and
end-to-end testing to support preference-based fine-tuning with TRL.
Test Plan
Added integration test:
tests/integration/post_training/test_post_training.py::TestPostTraining::test_preference_optimize
Ran tests on both CPU and CUDA environments
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
I've been tinkering a little with a simple chat playground in the UI, so
I'm opening the PR with what's kind of a WIP.
If you look at the first commit, that includes the big part of the
changes. The rest of the files changed come from adding installing the
`shadcn` components.
Note this is missing a lot; e.g.,
- sessions
- document upload
- audio (the shadcn components install these by default from
https://shadcn-chatbot-kit.vercel.app/docs/components/chat)
I still need to wire up a lot more to make it actually fully functional
but it does basic chat using the LS Typescript Client.
Basic demo:
<img width="1329" height="1430" alt="Image"
src="https://github.com/user-attachments/assets/917a2096-36d4-4925-b83b-f1f2cda98698"
/>
<img width="1319" height="1424" alt="Image"
src="https://github.com/user-attachments/assets/fab1583b-1c72-4bf3-baf2-405aee13c6bb"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This PR focuses on improving the developer experience by adding
comprehensive docstrings to the API data models across the Llama Stack.
These docstrings provide detailed explanations for each model and its
fields, making the API easier to understand and use.
**Key changes:**
- **Added Docstrings:** Added reST formatted docstrings to Pydantic
models in the `llama_stack/apis/` directory. This includes models for:
- Agents (`agents.py`)
- Benchmarks (`benchmarks.py`)
- Datasets (`datasets.py`)
- Inference (`inference.py`)
- And many other API modules.
- **OpenAPI Spec Update:** Regenerated the OpenAPI specification
(`docs/_static/llama-stack-spec.yaml` and
`docs/_static/llama-stack-spec.html`) to include the new docstrings.
This will be reflected in the API documentation, providing richer
information to users.
**Impact:**
- Developers using the Llama Stack API will have a better understanding
of the data structures.
- The auto-generated API documentation is now more informative.
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
1. Creates a new `VectorStoreNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
1. Adds a broad schema for custom exception classes in the Llama Stack
project
2. Creates a new `DatasetNotFoundError` class
3. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes the following error in unit test that was running on up to
date main branch:
```
FAILED tests/unit/distribution/test_inference_recordings.py::TestInferenceRecording::test_recording_mode - ModuleNotFoundError: No module named 'ollama'
FAILED tests/unit/distribution/test_inference_recordings.py::TestInferenceRecording::test_replay_mode - ModuleNotFoundError: No module named 'ollama'
FAILED tests/unit/distribution/test_inference_recordings.py::TestInferenceRecording::test_replay_missing_recording - ModuleNotFoundError: No module named 'ollama'
FAILED tests/unit/distribution/test_inference_recordings.py::TestInferenceRecording::test_embeddings_recording - ModuleNotFoundError: No module named 'ollama'
=============================== 4 failed, 499 passed, 198 warnings in 34.50s ================================
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run `./scripts/unit-tests.sh`
# What does this PR do?
1. Creates a new `ModelNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
We want to avoid re-triggering the workflow when random other labels are
added (e.g., `meta-cla`, etc.) Also no point restarting the workflow
when someone _unlabels_.
**Description**
This PR removes some of the warnings when uv builds the docs
- Errors appear when generating docs about .md files not appearing in
toctree. ~~Adding content to the `providers-gen.py ` file that adds `---
orphan: true ---` to to each file.~~. Added a toctree generator to the
`providers-gen.py` file, this gets rid of the errors in the builds.
- Deletes the `_openai_compat` files, extension of PR #2849
- Adds the `files` APIs section to the `providers` toctree on the index
page
- Manually adds the `--- orphan: true ---` to the advanced apis. Ill try
to find a way to modify the providers code gen so it automatically adds
it, but this fixes the errors.
- Adds the `testing.md` to the `contributing` toctree
- Adds `starting_llama_stack_server.md` to `distributions` toctree
There are some other warnings im still looking at but this PR gets rid
of most of the toctree errors
Theres also an issue with the actual distribution-codegen that I can
investigate in another PR. Opened a bug for it here #2873
We tried to always keep Ollama enabled. However doing so makes the
provider implementation half-assed -- should it error when it cannot
connect to Ollama or not? What happens during periodic model refresh?
Etc. Instead do the same thing we do for vLLM -- use the `OLLAMA_URL` to
conditionally enable the provider.
## Test Plan
Run `uv run llama stack build --template starter --image-type venv
--run` with and without `OLLAMA_URL` set. Verify using
`llama-stack-client provider list` that ollama is correctly enabled.
# What does this PR do?
- Initialize route_impls to None in constructor to prevent
AttributeError
- Consolidate initialization checks to single point in request() method
- Improve error message to be more helpful ("Please call initialize()
first")
- Add comprehensive test suite to prevent regressions
The library client now has better error handling when users forget to
call initialize(), showing a clear ValueError instead of confusing
AttributeError. All initialization validation is now centralized in the
request() method, with internal methods (_call_non_streaming,
_call_streaming, _convert_body) relying on this single check for
cleaner, more maintainable code.
closes#2943
## Test Plan
`./scripts/unit-tests.sh`
A couple of important updates:
- When recording tests, we cannot be generating a matrix because all the
independent recordings will conflict.
- In fact, we just don't need a matrix on test types any more because
things are very fast and the overhead of `llama stack build` and setting
up `uv` etc. is much more.
- Refactored the running of tests into an independent action
This PR makes setting up Ollama optional for CI. By default, we use
`replay` mode for inference requests and use the stored results from the
`tests/integration/recordings/` directory.
Every so often, users will update tests which will need us to re-record.
To do this, we check for the existence of a label `re-record-tests` on
the PR. If detected,
- ollama is spun up
- inference mode is set to record
- after the tests are done, if any new changes are detected, they are
pushed back to the PR
## Test Plan
This is GitHub CI. Gotta test it live.
Continuing with https://github.com/meta-llama/llama-stack/pull/2952
This also includes a "fix" to inference store related tests so that we
pull a large number of inference responses from the DB so as to always
find the one we just wrote.
Post training tests need _much_ better thinking before we can re-enable
them to be run on every single PR. Running periodically should be
approached only when it is shown that the tests are reliable and as
light-weight as can be; otherwise, it is just kicking the can down the
road.
Continue to build on top of
https://github.com/meta-llama/llama-stack/pull/2941
## Test Plan
Run server with `LLAMA_STACK_TEST_INFERENCE_MODE=record` and then run
the integration tests with `--stack-config=server:starter`. Then restart
the server with `LLAMA_STACK_TEST_INFERENCE_MODE=replay` and re-run the
tests. Verify that no request hit Ollama at any point.
# What does this PR do?
when --image-name is not provided the build script default to the
image_name in the config, this makes sure the same is done for the run
script
## Test Plan
llama stack build w/o --image-name
At the moment, the code coverage action has just been failing. It's
misleading when interpreting the status badge on the main branch.
https://github.com/meta-llama/llama-stack/actions/workflows/coverage-badge.yml
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Implements a comprehensive recording and replay system for inference API
calls that eliminates dependency on online inference providers during
testing. The system treats inference as deterministic by recording real
API responses and replaying them in subsequent test runs. Applies to
OpenAI clients (which should cover many inference requests) as well as
Ollama AsyncClient.
For storing, we use a hybrid system: Sqlite for fast lookups and JSON
files for easy greppability / debuggability.
As expected, tests become much much faster (more than 3x in just
inference testing.)
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record LLAMA_STACK_TEST_RECORDING_DIR=<...> \
uv run pytest -s -v tests/integration/inference \
--stack-config=starter \
-k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
--text-model="ollama/llama3.2:3b-instruct-fp16" \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay LLAMA_STACK_TEST_RECORDING_DIR=<...> \
uv run pytest -s -v tests/integration/inference \
--stack-config=starter \
-k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
--text-model="ollama/llama3.2:3b-instruct-fp16" \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
- `LLAMA_STACK_TEST_INFERENCE_MODE`: `live` (default), `record`, or
`replay`
- `LLAMA_STACK_TEST_RECORDING_DIR`: Storage location (must be specified
for record or replay modes)
# What does this PR do?
- Change max_seq_length to max_length in SFTConfig constructor
- TRL deprecated max_seq_length in Feb 2024 and removed it in v0.20.0
- Reference: https://github.com/huggingface/trl/pull/2895
This resolves the SFT training failure in CI tests
# What does this PR do?
OpenAI Chat Completions supports passing a base64 encoded PDF file to a
model, but Llama Stack currently does not allow for this behavior. This
PR extends our implementation of the OpenAI API spec to change that.
Closes#2129
## Test Plan
A new functional test has been added to test the validity of such a
request
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Updates provider template from outdated `ollama` to `starter`
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes: #2839
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
We don't need this. We have kept it since existing wisdom is that "it
helps with back-compat". Well, the entire ecosystem is moving to `uv` at
an unprecedented rate and keeping this creates unnecessary work and
confusion. The specific reason I am killing this is that it confuses
`dependabot` which ends up not bumping `uv.lock` which is the more
important file to change.
**What:**
- Added OpenAIChatCompletionTextOnlyMessageContent type for text-only
content validation
- Modified OpenAISystemMessageParam, OpenAIAssistantMessageParam,
OpenAIDeveloperMessageParam, and OpenAIToolMessageParam to use text-only
content type instead of mixed content
- OpenAIUserMessageParam unchanged - still accepts both text and images
- Updated OpenAPI spec files to reflect text-only content restrictions
in schemas
closes#2894
**Why:**
- Enforces OpenAI API compatibility by restricting image content to user
messages only
- Prevents API misuse where images might be sent in message types that
don't support them
- Aligns with OpenAI's actual API behavior where only user messages can
contain multimodal content
- Improves type safety and validation at the API boundary
**Test plan:**
- Added comprehensive parametrized tests covering all 5 OpenAI message
types
- Tests verify text string acceptance for all message types
- Tests verify text list acceptance for all message types
- Tests verify image rejection for system/assistant/developer/tool
messages (ValidationError expected)
- Tests verify user messages still accept images (backward compatibility
maintained)
# What does this PR do?
- Add base_url field to OpenAIConfig with default
"https://api.openai.com/v1"
- Update sample_run_config to support OPENAI_BASE_URL environment
variable
- Modify get_base_url() to return configured base_url instead of
hardcoded value
- Add comprehensive test suite covering:
- Default base URL behavior
- Custom base URL from config
- Environment variable override
- Config precedence over environment variables
- Client initialization with configured URL
- Model availability checks using configured URL
This enables users to configure custom OpenAI-compatible API endpoints
via environment variables or configuration files.
Closes#2910
## Test Plan
run unit tests
# What does this PR do?
external provider docs mention setting provider_id in the build yaml.
Since we changed that to just be provider_type and module, remove
instances of provider_id
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
provider_id is no longer valid in a build.yaml, remove it in the
external provider test
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
This enhancement allows inference providers using LiteLLMOpenAIMixin to
validate model availability against LiteLLM's official provider model
listings, improving reliability and user experience when working with
different AI service providers.
- Add litellm_provider_name parameter to LiteLLMOpenAIMixin constructor
- Add check_model_availability method to LiteLLMOpenAIMixin using
litellm.models_by_provider
- Update Gemini, Groq, and SambaNova inference adapters to pass
litellm_provider_name
## Test Plan
standard CI.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- Added `set -e` to the beginning of the unit test script to ensure the
script exits on failure and correctly fails the CI when tests do not
pass.
- Fixed all unit tests that were silently failing in the CI.
- Fixed Python 3.13 unit test CI failing silently.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2877
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- **Previously:** Unit tests passing in CI eventhough it failed 11 tests
->
[CI-run](4683681501 (step):4:2097)
- **Made the fix. Now, ensuring CI fails as expected on test failures:**
Unit tests failing in CI with 1 failed test ->
[CI-run](4684234247 (step):4:1506)
- This PR shows the CI passing and all unit tests passing.
# What does this PR do?
the server logs have a persistent `core: refreshing registry` log that
clogs up the output. Switch it to debug
this is what it looked like:
<img width="1126" height="1028" alt="Screenshot 2025-07-28 at 9 56
44 AM"
src="https://github.com/user-attachments/assets/a1880fd3-7fc7-4a97-bfb8-89a62e4c5c19"
/>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
currently the external provider tests don't upload log files as
artifacts nor do they use LLAMA_STACK_LOG_FILE. align with the other
integration tests
## Test Plan
logs should be present in the two tests on this PR
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
in #2637, I combined the run and build config provider types to both use
`Provider`
since this includes a provider_id, a user must now specify this when
writing a build yaml. This is not very clear because all a user should
care about upon build is the code to be installed (the module and the
provider_type)
introduce `BuildProvider` and fixup the parts of the code impacted by
this
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Our CI is entirely undocumented, this commit adds a README.md file with
a table of the current CI and what is does
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Add support for deleting individual chunks from vector stores
- Add abstract remove_chunk() method to EmbeddingIndex base class
- Implement chunk deletion for Faiss provider, SQLite Vec, Milvus,
PGVector
- Placeholder implementations with NotImplementedError for
Chroma/Qdrant/Weaviate
- Integrate chunk deletion into OpenAI vector store file deletion flow
- removed xfail from
test_openai_vector_store_delete_file_removes_from_vector_store
Closes: #2477
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Enable Chroma inline unit tests and fix integration tests.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Avoid the error message:
```
INFO 2025-07-24 21:51:54,530 __main__:598 server: Received interrupt signal, shutting down gracefully...
ERROR 2025-07-24 21:51:54,692 asyncio:1826 uncategorized: Task was destroyed but it is pending!
task: <Task pending name='Task-15' coro=<refresh_registry() running at
/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/stack.py:356> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=>
```
# What does this PR do?
Today, external providers are installed via the `external_providers_dir`
in the config. This necessitates users to understand the `ProviderSpec`
and set up their directories accordingly. This process splits up the
config for the stack across multiple files, directories, and formats.
Most (if not all) external providers today have a
[get_provider_spec](559cb18fbb/src/ramalama_stack/provider.py (L9))
method that sits unused. Utilizing this method rather than the
providers.d route allows for a much easier installation process for
external providers and limits the amount of extra configuration a
regular user has to do to get their stack off the ground.
To accomplish this and wire it throughout the build process, Introduce
the concept of a `module` for users to specify for an external provider
upon build time. In order to facilitate this, align the build and run
spec to use `Provider` class rather than the stringified provider_type
that build currently uses.
For example, say this is in your build config:
```
- provider_id: ramalama
provider_type: remote::ramalama
module: ramalama_stack
```
during build (in the various `build_...` scripts), additionally to
installing any pip dependencies we will also install this module and use
the `get_provider_spec` method to retrieve the ProviderSpec that is
currently specified using `providers.d`.
In production so far, providing instructions for installing external
providers for users has been difficult: they need to install the module
as a pre-req, create the providers.d directory, copy in the provider
spec, and also copy in the necessary build/run yaml files. Accessing an
external provider should be as easy as possible, and pointing to its
installable module aligns more with the rest of our build and dependency
management process.
For now, `external_providers_dir` still exists as an alternate more
declarative method of using external providers.
## Test Plan
added an integration test installing an external provider from module
and more unit test coverage for `get_provider_registry`
( the warning in yellow is expected, the module is installed inside of
the build env, not where we are running the command)
<img width="1119" height="400" alt="Screenshot 2025-07-24 at 11 30
48 AM"
src="https://github.com/user-attachments/assets/1efbaf45-b9e8-451a-bd63-264ed664706d"
/>
<img width="1154" height="618" alt="Screenshot 2025-07-24 at 11 31
14 AM"
src="https://github.com/user-attachments/assets/feb2b3ea-c5dd-418e-9662-9a3bd5dd6bdc"
/>
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.4.1 to 6.4.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v6.4.3 🌈 fix relative paths starting with dots</h2>
<h2>🐛 Bug fixes</h2>
<ul>
<li>fix relative paths starting with dots <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/500">#500</a>)</li>
</ul>
<h2>v6.4.2 🌈 Interpret relative inputs as under working-directory</h2>
<h2>Changes</h2>
<p>This release will interpret relative paths in inputs as relative
to the value of <code>working-directory</code> (default is <code>${{
github.workspace }}</code>) .
This means the following configuration</p>
<pre lang="yaml"><code>- uses: astral-sh/setup-uv@v6
with:
working-directory: /my/path
cache-dependency-glob: uv.lock
</code></pre>
<p>will look for the <code>cache-dependency-glob</code> under
<code>/my/path/uv.lock</code></p>
<h2>🐛 Bug fixes</h2>
<ul>
<li>interpret relative inputs as under working-directory <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/498">#498</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>chore: update known versions for 0.8.1/0.8.2 @<a
href="https://github.com/apps/github-actions">github-actions[bot]</a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/497">#497</a>)</li>
<li>chore: update known versions for 0.8.0 @<a
href="https://github.com/apps/github-actions">github-actions[bot]</a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/491">#491</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="e92bafb625"><code>e92bafb</code></a>
fix relative paths starting with dots (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/500">#500</a>)</li>
<li><a
href="2c7142f755"><code>2c7142f</code></a>
interpret relative inputs as under working-directory (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/498">#498</a>)</li>
<li><a
href="23482a31a8"><code>23482a3</code></a>
chore: update known versions for 0.8.1/0.8.2 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/497">#497</a>)</li>
<li><a
href="4ac06a054e"><code>4ac06a0</code></a>
chore: update known versions for 0.8.0 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/491">#491</a>)</li>
<li>See full diff in <a
href="7edac99f96...e92bafb625">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [form-data](https://github.com/form-data/form-data) from 4.0.2 to
4.0.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/form-data/form-data/releases">form-data's
releases</a>.</em></p>
<blockquote>
<h2>v4.0.4</h2>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.3...v4.0.4">v4.0.4</a>
- 2025-07-16</h2>
<h3>Commits</h3>
<ul>
<li>[meta] add <code>auto-changelog</code> <a
href="811f68282f"><code>811f682</code></a></li>
<li>[Tests] handle predict-v8-randomness failures in node < 17 and
node > 23 <a
href="1d11a76434"><code>1d11a76</code></a></li>
<li>[Fix] Switch to using <code>crypto</code> random for boundary values
<a
href="3d1723080e"><code>3d17230</code></a></li>
<li>[Tests] fix linting errors <a
href="5e340800b5"><code>5e34080</code></a></li>
<li>[meta] actually ensure the readme backup isn’t published <a
href="316c82ba93"><code>316c82b</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a
href="58c25d7640"><code>58c25d7</code></a></li>
<li>[meta] fix readme capitalization <a
href="2300ca1959"><code>2300ca1</code></a></li>
</ul>
<h2>v4.0.3</h2>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.3">v4.0.3</a>
- 2025-06-05</h2>
<h3>Fixed</h3>
<ul>
<li>[Fix] <code>append</code>: avoid a crash on nullish values <a
href="https://redirect.github.com/form-data/form-data/issues/577"><code>[#577](https://github.com/form-data/form-data/issues/577)</code></a></li>
</ul>
<h3>Commits</h3>
<ul>
<li>[eslint] use a shared config <a
href="426ba9ac44"><code>426ba9a</code></a></li>
<li>[eslint] fix some spacing issues <a
href="20941917f0"><code>2094191</code></a></li>
<li>[Refactor] use <code>hasown</code> <a
href="81ab41b46f"><code>81ab41b</code></a></li>
<li>[Fix] validate boundary type in <code>setBoundary()</code> method <a
href="8d8e469309"><code>8d8e469</code></a></li>
<li>[Tests] add tests to check the behavior of <code>getBoundary</code>
with non-strings <a
href="837b8a1f75"><code>837b8a1</code></a></li>
<li>[Dev Deps] remove unused deps <a
href="870e4e6659"><code>870e4e6</code></a></li>
<li>[meta] remove local commit hooks <a
href="e6e83ccb54"><code>e6e83cc</code></a></li>
<li>[Dev Deps] update <code>eslint</code> <a
href="4066fd6f65"><code>4066fd6</code></a></li>
<li>[meta] fix scripts to use prepublishOnly <a
href="c4bbb13c0e"><code>c4bbb13</code></a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/form-data/form-data/blob/master/CHANGELOG.md">form-data's
changelog</a>.</em></p>
<blockquote>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.3...v4.0.4">v4.0.4</a>
- 2025-07-16</h2>
<h3>Commits</h3>
<ul>
<li>[meta] add <code>auto-changelog</code> <a
href="811f68282f"><code>811f682</code></a></li>
<li>[Tests] handle predict-v8-randomness failures in node < 17 and
node > 23 <a
href="1d11a76434"><code>1d11a76</code></a></li>
<li>[Fix] Switch to using <code>crypto</code> random for boundary values
<a
href="3d1723080e"><code>3d17230</code></a></li>
<li>[Tests] fix linting errors <a
href="5e340800b5"><code>5e34080</code></a></li>
<li>[meta] actually ensure the readme backup isn’t published <a
href="316c82ba93"><code>316c82b</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a
href="58c25d7640"><code>58c25d7</code></a></li>
<li>[meta] fix readme capitalization <a
href="2300ca1959"><code>2300ca1</code></a></li>
</ul>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.3">v4.0.3</a>
- 2025-06-05</h2>
<h3>Fixed</h3>
<ul>
<li>[Fix] <code>append</code>: avoid a crash on nullish values <a
href="https://redirect.github.com/form-data/form-data/issues/577"><code>[#577](https://github.com/form-data/form-data/issues/577)</code></a></li>
</ul>
<h3>Commits</h3>
<ul>
<li>[eslint] use a shared config <a
href="426ba9ac44"><code>426ba9a</code></a></li>
<li>[eslint] fix some spacing issues <a
href="20941917f0"><code>2094191</code></a></li>
<li>[Refactor] use <code>hasown</code> <a
href="81ab41b46f"><code>81ab41b</code></a></li>
<li>[Fix] validate boundary type in <code>setBoundary()</code> method <a
href="8d8e469309"><code>8d8e469</code></a></li>
<li>[Tests] add tests to check the behavior of <code>getBoundary</code>
with non-strings <a
href="837b8a1f75"><code>837b8a1</code></a></li>
<li>[Dev Deps] remove unused deps <a
href="870e4e6659"><code>870e4e6</code></a></li>
<li>[meta] remove local commit hooks <a
href="e6e83ccb54"><code>e6e83cc</code></a></li>
<li>[Dev Deps] update <code>eslint</code> <a
href="4066fd6f65"><code>4066fd6</code></a></li>
<li>[meta] fix scripts to use prepublishOnly <a
href="c4bbb13c0e"><code>c4bbb13</code></a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="41996f5ac7"><code>41996f5</code></a>
v4.0.4</li>
<li><a
href="316c82ba93"><code>316c82b</code></a>
[meta] actually ensure the readme backup isn’t published</li>
<li><a
href="2300ca1959"><code>2300ca1</code></a>
[meta] fix readme capitalization</li>
<li><a
href="811f68282f"><code>811f682</code></a>
[meta] add <code>auto-changelog</code></li>
<li><a
href="5e340800b5"><code>5e34080</code></a>
[Tests] fix linting errors</li>
<li><a
href="1d11a76434"><code>1d11a76</code></a>
[Tests] handle predict-v8-randomness failures in node < 17 and node
> 23</li>
<li><a
href="58c25d7640"><code>58c25d7</code></a>
[Dev Deps] update <code>@ljharb/eslint-config</code></li>
<li><a
href="3d1723080e"><code>3d17230</code></a>
[Fix] Switch to using <code>crypto</code> random for boundary
values</li>
<li><a
href="d8d67dc8ac"><code>d8d67dc</code></a>
v4.0.3</li>
<li><a
href="e6e83ccb54"><code>e6e83cc</code></a>
[meta] remove local commit hooks</li>
<li>Additional commits viewable in <a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/meta-llama/llama-stack/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
- Added ability to specify `required_scope` when declaring an API. This
is part of the `@webmethod` decorator.
- If auth is enabled, a user can access an API only if
`user.attributes['scope']` includes the `required_scope`
- We add `required_scope='telemetry.read'` to the telemetry read APIs.
## Test Plan
CI with added tests
1. Enable server.auth with github token
2. Observe `client.telemetry.query_traces()` returns 403
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds support for the new Streamable HTTP transport for MCP, as
well as falling back to the SSE protocol if the Streamable HTTP
connection fails.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2542
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Calum Murray <cmurray@redhat.com>
# What does this PR do?
Prototype on a new feature to allow new APIs to be plugged in Llama
Stack. Opened for early feedback on the approach and test appetite on
the functionality.
@ashwinb @raghotham open for early feedback, thanks!
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
currently `print` is being used with custom formatting to achieve
telemetry output in the console_span_processor
This causes telemetry not to show up in log files when using
`LLAMA_STACK_LOG_FILE`. During testing it looks like telemetry is not
being captured when it is
switch to using Rich formatting with the logger and then strip the
formatting off when a log file is being used so the formatting looks
normal
## Test Plan
before:
console:
<img width="967" height="127" alt="Screenshot 2025-07-21 at 4 02 15 PM"
src="https://github.com/user-attachments/assets/b09518cc-9d38-4970-9877-70e2c41fcbb5"
/>
log file (no telemetry):
```
2025-07-21 16:01:32,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 16:01:34,779 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 16:01:35,083 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 16:01:35,091 uvicorn.error:84 uncategorized: Started server process [68679]
2025-07-21 16:01:35,091 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 16:01:35,092 __main__:163 server: Starting up
2025-07-21 16:01:35,092 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 16:01:35,092 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 16:01:37,167 uvicorn.access:473 uncategorized: 127.0.0.1:53145 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
after:
console:
<img width="797" height="165" alt="Screenshot 2025-07-22 at 3 28 44 PM"
src="https://github.com/user-attachments/assets/44d40e3b-6502-439d-9ea5-38058b289962"
/>
log file:
```
2025-07-21 15:59:51,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 15:59:53,801 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 15:59:54,059 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 15:59:54,066 uvicorn.error:84 uncategorized: Started server process [68578]
2025-07-21 15:59:54,067 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 15:59:54,067 __main__:163 server: Starting up
2025-07-21 15:59:54,067 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 15:59:54,068 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 15:59:55,381 [TELEMETRY] 19:59:55.381 /v1/openai/v1/chat/completions
2025-07-21 15:59:55,619 uvicorn.access:473 uncategorized: 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
2025-07-21 15:59:55,621 [TELEMETRY] 19:59:55.621 /v1/openai/v1/chat/completions [StatusCode.OK] (240.07ms)
2025-07-21 15:59:55,622 [TELEMETRY] 19:59:55.620 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Our demo installation script should pull the starter image. Ollama is
not being updated anymore as a distribution.
Signed-off-by: Sébastien Han <seb@redhat.com>
This flips #2823 and #2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.
In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.
# What does this PR do?
Adds type guards in /distribution/inspect.py and ignores a valid-type
mypy error in library_client.py. This PR is part of issue #2647 . I'm
rather unsure whether ignoring the valid-type error is correct in this
case. It appears that args[0] is interpreted as [any] but I didn't find
any way to specify the type.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Bulk improvements:
* The script has a better error reporting, when a command fails it will
print the logs of the failed command
* Better error handling using a trap to catch signal and perform proper
cleanup
* Cosmetic changes
* Added CI to test the image code against main
* Use the starter image and its latest tag
Signed-off-by: Sébastien Han <seb@redhat.com>
- Add setup-vllm GitHub action to start VLLM container
- Extend integration test matrix to support both ollama and vllm
providers
- Make test setup conditional based on provider type
- Add provider-specific environment variables and configurations
- vllm tests setup to run weekly or can be triggered manually (only
ollama on PR)
TODO:
investigate failing tests for vllm provider (safety and post_training)
Also need a proper fix for #2713 (tmp fix for this in the first commit
in this PR)
Closes: #1648
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This pull request adds documentation to clarify the differences between
the Agents API and the OpenAI Responses API, including use cases for
each. It also updates the index page to reference the new documentation.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2368
# What does this PR do?
Updates the script `scripts/check-workflows-use-hashes.sh` to improve
error reporting by adopting GitHub Actions error annotation format.
* Updated the script to use GitHub Actions error annotation format
(`::error file={name},line={line},col={col}::{message}`) making error
messages more actionable and easier to locate in workflows.
* Modified the script to include line numbers for `uses:` references by
using `grep -n` and extracting line numbers, improving the precision of
error reporting.
Closes#2778
## Test Plan
- Violation check - Created test file with mixed SHA/non-SHA actions
```
echo 'uses: actions/checkout@v4' > test-workflow.yml
echo 'uses: actions/upload-artifact@main' >> test-workflow.yml
```
Result: Correctly detected violations with precise line numbers
```
./scripts/check-workflows-use-hashes.sh
Output:
::error file=test-workflow.yml,line=14::uses non-SHA action ref: uses: actions/checkout@v4
::error file=test-workflow.yml,line=20::uses non-SHA action ref: uses: actions/upload-artifact@main
```
- Verified existing project workflows pass
```
./scripts/check-workflows-use-hashes.sh
# Result: Exit code 0 (all workflows properly SHA-pinned)
```
# What does this PR do?
openai/models.py has backward compat entries for litellm model names.
the starter template includes these in the list of registered models.
the inclusion results in duplicate model registrations.
the backward compat is no longer necessary.
## Test Plan
ci
# What does this PR do?
This PR implements the openai compatible endpoints for chromadb
Closes#2462
## Test Plan
Ran ollama llama stack server and ran the command
`pytest -sv --stack-config=http://localhost:8321
tests/integration/vector_io/test_openai_vector_stores.py
--embedding-model all-MiniLM-L6-v2`
8 failed, 27 passed, 8 skipped, 1 xfailed
The failed ones are regarding files api
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
Co-authored-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
- Use printf to to escape special characters (e.g. < > )
- Apply escaping to pip_dependencies and special_pip_deps
Resolves shell interpretation of >= operators as redirections that were
causing build failing to respect versions and unexpected file creation
in /app directory.
Closes: #2866
## Test Plan
Manually tested, will also be tested by existing CI
Signed-off-by: Derek Higgins <derekh@redhat.com>
- rm TEMP_DIR when build_container.sh succeeds
- prevents multiple temp directories with Containerfile being left in
/tmp
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I fixed test_access_policy() function providing provider_model_id in
each register model endpoint to pass assertions.
Initially I faced this issue:
```
tests/unit/server/test_quota.py::test_authenticated_quota_allows_up_to_limit
tests/unit/server/test_quota.py::test_authenticated_quota_blocks_after_limit
tests/unit/server/test_quota.py::test_anonymous_quota_allows_up_to_limit
tests/unit/server/test_quota.py::test_anonymous_quota_blocks_after_limit
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/aiosqlite/core.py:105: DeprecationWarning: The default datetime adapter is deprecated as of Python 3.12; see the sqlite3 documentation for suggested replacement recipes
result = function()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================== short test summary info ===============================================================================
FAILED tests/unit/server/test_access_control.py::test_access_policy - AssertionError: assert 'test_provider/model-1' == 'model-1'
==================================================================== 1 failed, 436 passed, 194 warnings in 20.09s ====================================================================
```
After resolved, all works:
```
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= 437 passed, 194 warnings in 19.41s =========================================================================
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run ` ./scripts/unit-tests.sh`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I noticed a few issues with my implementation of the search mode
validation for RagQuery.
This PR replaces the check for search mode in RagQuery with a Literal.
There were issues before with
```
TypeError: Object of type RAGSearchMode is not JSON serializable
```
When using
```
query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
```
It also fixes the fact that despite user input "vector" was always the
used search mode.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Verify that a chosen search mode works when using Rag Query or use below
agent config:
```
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
"query_config": {
"mode": "keyword",
"max_chunks": 6
}
},
}
],
)
```
Running Unit Tests:
```
uv sync --extra dev
uv run pytest tests/unit/rag/test_rag_query.py -v
```
# What does this PR do?
Moving vector store and vector store files helper methods to
`openai_vector_store_mixin.py`
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
The tests are already supported in the CI and tests the inline providers
and current integration tests.
Note that the `vector_index` fixture will be test `milvus_vec_adapter`,
`faiss_vec_adapter`, and `sqlite_vec_adapter` in
`tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py`.
Additionally, the integration tests in `integration-vector-io-tests.yml`
runs `tests/integration/vector_io` tests for the following providers:
```python
vector-io-provider: ["inline::faiss", "inline::sqlite-vec", "inline::milvus", "remote::chromadb", "remote::pgvector"]
```
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
add an `OpenAIMixin` for use by inference providers who remote endpoints
support an OpenAI compatible API.
use is demonstrated by refactoring
- OpenAIInferenceAdapter
- NVIDIAInferenceAdapter (adds embedding support)
- LlamaCompatInferenceAdapter
## Test Plan
existing unit and integration tests
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/2716/ broke commands
like:
```
python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml
```
And will fail with:
```
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 626, in <module>
main()
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 402, in main
config_file = resolve_config_or_template(args.config, Mode.RUN)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/utils/config_resolution.py", line 43, in resolve_config_or_template
config_path = Path(config_or_template)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 1162, in __init__
super().__init__(*args)
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 373, in __init__
raise TypeError(
TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'NoneType'
```
Complaining that no positional arguments are present. We now honour the
deprecation until --config and --template are removed completely.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Both ` python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml` and ` python -m
llama_stack.distribution.server.server
llama_stack/templates/starter/run.yaml` should run the server. Same for
`--template starter`.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Remove --no-cache flags from uv pip install commands to enable caching
- Mount host uv cache directory to container for persistent caching
- Set UV_LINK_MODE=copy to prevent uv using hardlinks
- When building the starter image
o Build time reduced from ~4:45 to ~3:05 on subsequent builds
(environment specific)
o Eliminates re-downloading of 3G+ of data on each build
o Cache size: ~6.2G (when building starter image)
Fixes excessive data downloads during distro container builds.
Signed-off-by: Derek Higgins <derekh@redhat.com>
This PR updates model registration and lookup behavior to be slightly
more general / flexible. See
https://github.com/meta-llama/llama-stack/issues/2843 for more details.
Note that this change is backwards compatible given the design of the
`lookup_model()` method.
## Test Plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes flaky telemetry tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
See https://github.com/meta-llama/llama-stack/pull/2814
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
When podman is used and the registry is omitted, podman will prompt the
user. However, we're piping the output of podman to /dev/null and the
user will not see the prompt, the script will end abruptly and this is
confusing.
This commit explicitly uses the docker.io registry for the ollama image
and the llama-stack image so that the prompt is avoided.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
I ran the script on a machine with podman and the issue was resolved
## Image
Before the fix, this is what would happen:
<img width="748" height="95" alt="image"
src="https://github.com/user-attachments/assets/9c609f88-c0a8-45e7-a789-834f64f601e5"
/>
Signed-off-by: Omer Tuchfeld <omer@tuchfeld.dev>
# What does this PR do?
chore: Making name optional in openai_create_vector_store
# Closes https://github.com/meta-llama/llama-stack/issues/2706
## Test Plan
CI and unit tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Ensures that session turns retrieved from the agent persistence layer
are sorted by their `started_at` timestamp, as the key-value store does
not guarantee order.
Closes#2852
## Test Plan
- [ ] Add unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
minor update of the pgvector doc, changing 'faiss' to 'pgvector'
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
This PR adds the quickstart as a file to the docs so that it can be more
easily maintained and run, as mentioned in
https://github.com/meta-llama/llama-stack/pull/2800.
## Test Plan
I could add this as a test in the CI but I wasn't sure if we wanted to
add additional jobs there. 😅
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Refactors the vector store routing logic by moving OpenAI-compatible
vector store operations from the `VectorIORouter` to the
`VectorDBsRoutingTable`.
Closes https://github.com/meta-llama/llama-stack/issues/2761
## Test Plan
Added unit tests to cover new routing logic and ACL checks.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Part of #2696
## Test Plan
Run `llama stack run starter`
Error:
```
myenv ❯ llama stack run starters
WARNING 2025-07-10 12:12:43,052 llama_stack.cli.stack.run:82 server: Conda detected. Using conda environment myenv for the run.
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
[--image-type {conda,venv}] [--enable-ui]
[config | template]
llama stack run: error: Could not resolve config or template 'starters'.
Tried the following locations:
1. As file path: /Users/erichuang/projects/llama-stack-git/starters
2. As template: /Users/erichuang/projects/llama-stack-git/llama_stack/templates/starters/run.yaml
3. As built distribution: (/Users/erichuang/.llama/distributions/llamastack-starters/starters-run.yaml, /Users/erichuang/.llama/distributions/starters/starters-run.yaml)
Available templates: dell, test-env, vllm-gpu, test-template, cerebras, openai-api-verification, sambanova, passthrough, direct-config, together, openai, fireworks, meta-reference-gpu, __pycache__, dev, ollama, watsonx, remote-vllm, llama_api, groq, dummy, oracle, nvidia, ci-tests, postgres-demo, test-stack, bedrock, starter, hf-serverless, hf-endpoint, tgi, open-benchmark, verification
Did you mean one of these templates?
- starter
- together
- postgres-demo
```
# What does this PR do?
After https://github.com/meta-llama/llama-stack/pull/2818, SIGINT will
print a stack trace. This is because uvicorn re-raises SIGINT and it
gets converted by Python internal signal handler (default handles
SIGINT) to KeyboardInterrupt exception. We know simply catch the
exception to get a clean exit, this is not changing the behavior on
SIGINT.
## Test Plan
Run the server, hit Ctrl+C or `kill -2 <server pid>` and expect a clean
exit with no stack trace.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The pre-commit workflow was failing in the main branch and removing
`@pytest.mark.asyncio `from `test_get_raw_document_text.py` fixed that.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR add `provider_id` field to `VectorDBInput` class.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2819
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
The workflow that automatically creates a PR to update the Coverage
Badge fails as the `GITHUB_TOKEN` doesn't have write permissions.
As opposed to providing write permissions to the token, we can provide
the permissions for just this workflow with this PR.
Just like #2805 but for vLLM.
We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.
## Test Plan
Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:
```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
uv run llama stack run --image-type venv /tmp/starter.yaml
```
Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
Inline _inference_ providers haven't proved to be very useful -- they
are rarely used. And for good reason -- it is almost never a good idea
to include a complex (distributed) inference engine bundled into a
distributed stateful front-end server serving many other things.
Responsibility should be split properly.
See Discord discussion:
1395849853
For self-hosted providers like Ollama (or vLLM), the backing server is
running a set of models. That server should be treated as the source of
truth and the Stack registry should just be a cache for those models. Of
course, in production environments, you may not want this (because you
know what model you are running statically) hence there's a config
boolean to control this behavior.
_This is part of a series of PRs aimed at removing the requirement of
needing to set `INFERENCE_MODEL` env variables for running Llama Stack
server._
## Test Plan
Copy and modify the starter.yaml template / config and enable
`refresh_models: true, refresh_models_interval: 10` for the ollama
provider. Then, run:
```
LLAMA_STACK_LOGGING=all=debug \
ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml
```
See a gargantuan amount of logs, but verify that the provider is
periodically refreshing models. Stop and prune a model from ollama
server, restart the server. Verify that the model goes away when I call
`uv run llama-stack-client models list`
# What does this PR do?
This PR fixes the `DPOAlignmentConfig` schema to use the correct Direct
Preference Optimization (DPO) parameters.
The current schema incorrectly uses PPO-inspired parameters
(`reward_scale`, `reward_clip`, `epsilon`, `gamma`) that are not part of
the DPO algorithm. This PR updates it to use the standard DPO
parameters:
- `beta`: The KL divergence coefficient that controls deviation from the
reference model
- `loss_type`: The type of DPO loss function (sigmoid, hinge, ipo,
kto_pair)
These parameters align with standard DPO implementations like
HuggingFace's TRL library.
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
When we call `construct_stack()`, providers are instantiated and
`initialize()` is called. This call can end up doing _anything_ at all
-- specifically, providers are free to create long running background
tasks as part of this. If we wrapped this within a `asyncio.run()` as in
the current code, these tasks get canceled when the stack construction
finishes. This is not correct. The PR addresses the issue by creating a
persistent event loop which is used for both the stack as well as for
running the uvicorn server. In other words, the lifetime of the
providers (and downstream async code) is now the same as the lifetime of
the uvicorn server.
## Test Plan
This should not affect any current code since we don't have background
tasks created right now. However,
https://github.com/meta-llama/llama-stack/pull/2805 will start using
this functionality.
# What does this PR do?
'build' command didn't take into account ENABLE flags for starter distro
for some reason, I was having issues with HuggingFace access for the
embedding model, so added a tip for that as well
Closes#2779
## Test Plan
I ran the described steps manually, but it would be nice if someone else
could try it and verify this still works
We might consider having some CI job ensure the QSG remains functional -
it's not a great experience for new users if they try Llama Stack for
the first time and it doesn't work as we describe
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- Added coverage badge to README. - [See my
fork](https://github.com/ChristianZaccaria/llama-stack)
- Added a GitHub Actions workflow that runs the tests and updates the
coverage badge. - [See
run](4574811323)
- Documented steps in `testing.md` for running the tests locally, and
viewing the `html` report.
- Excluded non-essential files from coverage reporting to provide a more
accurate measurement.
Automatically created PR to update coverage badge:
https://github.com/ChristianZaccaria/llama-stack/pull/9
# Note for reviewers
1. Currently the coverage report shows a 45% coverage. Wondering if
there are other files or directories that should also be excluded from
the report to increase the percentage. The directories with the least
test coverage are `llama_stack/cli`, `llama_stack/models`, and
`llama_stack/ui`. - Should we exclude these?
2. **[Required]** The `GITHUB_TOKEN` should have write permissions to
open a PR to update the coverage badge.
# GitHub Issue
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2355
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
The `testing.md` file describes how to run the unit tests locally.
# What does this PR do?
trigger integration tests on ALL changes to `tests/` to catch failures
before they merge into main
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
some async test markers are in the codebase causing pre-commit to fail
due to #2744
remove these pytest fixtures
## Test Plan
pre-commit passes
Signed-off-by: Charlie Doern <cdoern@redhat.com>
If I am running `uv run llama stack run --image-type venv` it should not
be saying to me "Conda detected" because I am pretty clearly telling it
I need venv. The root cause is the offending line.
# What does this PR do?
## Test Plan
ENABLE_OLLAMA=ollama LLAMA_STACK_CONFIG=starter uv run pytest
tests/integration/telemetry
--text-model="ollama/llama3.2:3b-instruct-fp16"
# What does this PR do?
let's users register models available at
https://integrate.api.nvidia.com/v1/models that isn't already in
llama_stack/providers/remote/inference/nvidia/models.py
## Test Plan
1. run the nvidia distro
2. register a model from https://integrate.api.nvidia.com/v1/models that
isn't already know, as of this writing
nvidia/llama-3.1-nemotron-ultra-253b-v1 is a good example
3. perform inference w/ the model
- POST /v1/models accepts optional provider_model_id
- ModelsRoutingTable.register_model handler ensures it is non-None,
providing a default
usage of Model.provider_model_id will no longer need to detect None
Move sentence-transformers to be the first embedding in the list of
models. This ensures it will always be the default and is more
consistent then having the default change based on what env variables
are available
Closes: #2702
## Test Plan
Manually verified
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
currently each disabled provider is printed as a warning, switch to
debug. This level of verbosity isn't necessary, especially if we intend
to grow the list of providers over time that can be in a single run yaml
## Test Plan
before:
<img width="1144" height="667" alt="Screenshot 2025-07-16 at 12 37
18 PM"
src="https://github.com/user-attachments/assets/d14dbf76-6e40-4996-8a27-111e6a987d71"
/>
after:
<img width="925" height="141" alt="Screenshot 2025-07-16 at 12 37 42 PM"
src="https://github.com/user-attachments/assets/81efdbe1-923c-4c5f-9731-f89729043920"
/>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Resolves https://github.com/meta-llama/llama-stack/issues/2770. It
replaces characters in SQLite table names that are not alphanumeric or
underscores with underscores and quotes the table names with square
brackets in SQL statements.
Closes #[2770]
## Test Plan
I added a ".123" suffix to the bank_id on the following line
```
index = await SQLiteVecIndex.create(dimension=embedding_dimension, db_path=db_path, bank_id="test_bank.123")
```
in tests/unit/providers/vector_io/test_sqlite_vec.py, which, without the
fix in place, demonstrates the issue.
# What does this PR do?
this was causing an unnessessary logger warning
## Test Plan
Run `LLAMA_STACK_DIR=. ENABLE_OLLAMA=ollama
OLLAMA_INFERENCE_MODEL=llama3.2:3b llama stack build --template starter
--image-type venv --run` and then `Crtl-C` to shutdown
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
This was introduced in
https://github.com/meta-llama/llama-stack/pull/523 but as far as I can
tell has never been used. It's been over six months so it feels fair to
remove it at this point.
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
The vision models are now available at the standard URL, so the
workaround code has been removed. This also simplifies the codebase by
eliminating the need for per-model client caching.
- Remove special URL handling for meta/llama-3.2-11b/90b-vision-instruct
models
- Convert _get_client method to _client property for cleaner API
- Remove unnecessary lru_cache decorator and functools import
- Simplify client creation logic to use single base URL for all models
# What does this PR do?
Adding OpenAI Vector Stores Files API compatibility for PGVector
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Updated CI to include PGVector
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Adds new documentation that was missing for the Llama Stack Python
Client as well as updates old/outdated docs
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/2490 introduced a new
function for type conversion of strings.
However, a side effect of this is that it will cast any string that can
be cast to an integer if possible, which for something like `image_name`
is not desired as we only accept strings for this field in the
`StackRunConfig`
This PR introduces logic to ensure that `image_name` remains a string
Closes#2749
## Test Plan
You can run the original step to reproduce from the bug to verify this
manually
```bash
OPENAI_API_KEY=bogus llama stack build --image-type venv --image-name 2745 --providers inference=remote::openai --run
```
I have also added an additional unit test to prevent any future
regression here
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Resolves https://github.com/meta-llama/llama-stack/issues/2735
Currently, if you test against OpenAI's Vector Stores API the
`client.vector_stores.search` call fails with an invalid vector_db
during routing (see the script referenced in the clickable item under
the Test Plan section).
This PR ensures that `client.vector_stores.search()` is compatible with
OpenAI's Vector Stores API.
Two biggest changes:
1. The `name`, which was previously used as the `vector_db_id`, has been
changed to be consistent with OpenAI's `vs_{uuid}` format.
2. The vector store ID has to be referenced by the ID, the name is not
reliable as every `client.vector_stores.create` results in a new vector
store.
NOTE: I believe this is a breaking change for end users as they'll need
to update their VectorDB identifiers.
## Test Plan
Unit tests:
```bash
./scripts/unit-tests.sh tests/unit/providers/vector_io/ -v
```
Integration tests:
```bash
ENABLE_MILVUS=milvus llama stack run /Users/farceo/dev/llama-stack/llama_stack/templates/starter/run.yaml --image-type venv
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/vector_io/test_openai_vector_stores.py --embedding-model=all-MiniLM-L6-v2 -vv
```
Unit tests and test script below 👇
<details>
<summary>Click here for script used to test OpenAI and Llama Stack
Vector Store implementation</summary>
```python
import json
import argparse
from openai import OpenAI, pagination
import logging
from colorama import Fore, Style, init
import traceback
import os
# Initialize colorama for color support in terminal
init(autoreset=True)
# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
DEMO_VECTOR_STORE_NAME = "Support FAQ FJA"
global DEMO_VECTOR_STORE_ID
global DEMO_VECTOR_STORE_ID2
def colored_print(color, text):
"""Prints text to the console with the specified color."""
print(f"{color}{text}{Style.RESET_ALL}")
def log_and_print(color, message, level=logging.INFO):
"""Logs a message and prints it to the console with the specified color."""
logging.log(level, message)
colored_print(color, message)
def run_tests(client, prefix="openai"):
"""
Runs all tests using the provided OpenAI client and saves the output
to JSON files with the given prefix.
"""
# Create the directory if it doesn't exist
os.makedirs('openai_testing', exist_ok=True)
# Default values in case tests fail
global DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
DEMO_VECTOR_STORE_ID = None
DEMO_VECTOR_STORE_ID2 = None
def test_idempotent_vector_store_creation():
"""
Test that creating a vector store with the same name is idempotent.
"""
log_and_print(Fore.BLUE, "Starting vector store creation test...")
try:
vector_store = client.vector_stores.create(
name=DEMO_VECTOR_STORE_NAME,
)
# Attempt to create the same vector store again
vector_store2 = client.vector_stores.create(
name=DEMO_VECTOR_STORE_NAME,
)
# Check instead of assert
if vector_store2.id != vector_store.id:
log_and_print(Fore.YELLOW, f"FAILED IDEMPOTENCY: the same VectorStore name for {prefix.upper()} does not return the same ID",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, f"PASSED IDEMPOTENCY: f{vector_store2.id} == {vector_store.id} the same VectorStore name for {prefix.upper()} returns the same ID")
vector_store_data = vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.create = {json.dumps(vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_create.json', 'w') as f:
json.dump(vector_store_data, f, indent=2)
global DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
DEMO_VECTOR_STORE_ID = vector_store.id
DEMO_VECTOR_STORE_ID2 = vector_store2.id
return DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
except Exception as e:
log_and_print(Fore.RED, f"Idempotent vector store creation test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
# Create a fallback vector store ID if needed
if 'vector_store' in locals() and vector_store:
DEMO_VECTOR_STORE_ID = vector_store.id
return DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
def test_vector_store_list():
"""
Test listing vector stores.
"""
log_and_print(Fore.BLUE, "Starting vector store list test...")
try:
vector_stores = client.vector_stores.list()
# Check instead of assert
if not isinstance(vector_stores, pagination.SyncCursorPage):
log_and_print(Fore.YELLOW, f"FAILED: Expected a list of vector stores, got {type(vector_stores)}",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Vector store list test passed!")
vector_stores_data = vector_stores.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.list = {json.dumps(vector_stores_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_list.json', 'w') as f:
json.dump(vector_stores_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Vector store list test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_retrieve_vector_store():
"""
Test retrieving a specific vector store.
"""
log_and_print(Fore.BLUE, "Starting retrieve vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping retrieve vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
vector_store = client.vector_stores.retrieve(
vector_store_id=DEMO_VECTOR_STORE_ID,
)
# Check instead of assert
if vector_store.id != DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "FAILED: Retrieved vector store ID does not match", level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Retrieve vector store test passed!")
vector_store_data = vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.retrieve = {json.dumps(vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_retrieve.json', 'w') as f:
json.dump(vector_store_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Retrieve vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_modify_vector_store():
"""
Test modifying a vector store.
"""
log_and_print(Fore.BLUE, "Starting modify vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping modify vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
updated_vector_store = client.vector_stores.update(
vector_store_id=DEMO_VECTOR_STORE_ID,
name="Updated Support FAQ FJA",
)
# Check instead of assert
if updated_vector_store.name != "Updated Support FAQ FJA":
log_and_print(Fore.YELLOW, "FAILED: Vector store name was not updated correctly", level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Modify vector store test passed!")
updated_vector_store_data = updated_vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.modify = {json.dumps(updated_vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_modify.json', 'w') as f:
json.dump(updated_vector_store_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Modify vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_delete_vector_store():
"""
Test deleting a vector store.
"""
log_and_print(Fore.BLUE, "Starting delete vector store test...")
if not DEMO_VECTOR_STORE_ID2:
log_and_print(Fore.YELLOW, "Skipping delete vector store test - no second vector store ID available",
level=logging.WARNING)
return
try:
response = client.vector_stores.delete(
vector_store_id=DEMO_VECTOR_STORE_ID2,
)
log_and_print(Fore.GREEN, "Delete vector store test passed!")
response_data = response.to_dict()
log_and_print(Fore.WHITE, f"Vector store delete response = {json.dumps(response_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_delete.json', 'w') as f:
json.dump(response_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Delete vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_create_vector_store_file():
log_and_print(Fore.BLUE, "Starting create vector store file test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping create vector store file test - no vector store ID available",
level=logging.WARNING)
return
try:
# create jsonl of files as an example
with open("mydata.jsonl", "w") as f:
f.write('{"text": "What is the return policy?", "metadata": {"category": "support"}}\n')
f.write('{"text": "How do I reset my password?", "metadata": {"category": "support"}}\n')
f.write('{"text": "Where can I find my order history?", "metadata": {"category": "support"}}\n')
f.write('{"text": "What are the shipping options?", "metadata": {"category": "support"}}\n')
f.write('{"text": "What is your favorite banana?", "metadata": {"category": "support"}}\n')
# Create a simple text file if my_data_small.txt doesn't exist
if not os.path.exists("my_data_small.txt"):
with open("my_data_small.txt", "w") as f:
f.write("This is a test file for vector store testing.\n")
created_file = client.files.create(
file=open("my_data_small.txt", "rb"),
purpose="assistants",
)
created_file_data = created_file.to_dict()
log_and_print(Fore.WHITE, f"Created file {json.dumps(created_file_data, indent=2)}")
with open(f'openai_testing/{prefix}_file_create.json', 'w') as f:
json.dump(created_file_data, f, indent=2)
retrieved_files = client.files.retrieve(created_file.id)
retrieved_files_data = retrieved_files.to_dict()
log_and_print(Fore.WHITE, f"Retrieved file {json.dumps(retrieved_files_data, indent=2)}")
with open(f'openai_testing/{prefix}_file_retrieve.json', 'w') as f:
json.dump(retrieved_files_data, f, indent=2)
vector_store_file = client.vector_stores.files.create(
vector_store_id=DEMO_VECTOR_STORE_ID,
file_id=created_file.id,
)
log_and_print(Fore.GREEN, "Create vector store file test passed!")
except Exception as e:
log_and_print(Fore.RED, f"Create vector store file test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_search_vector_store():
"""
Test searching a vector store.
"""
log_and_print(Fore.BLUE, "Starting search vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping search vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
query = "What is the banana policy?"
search_results = client.vector_stores.search(
vector_store_id=DEMO_VECTOR_STORE_ID,
query=query,
max_num_results=10,
ranking_options={
'ranker': 'default-2024-11-15',
'score_threshold': 0.0,
},
rewrite_query=False,
)
# Check instead of assert
if not isinstance(search_results, pagination.SyncPage):
log_and_print(Fore.YELLOW, f"FAILED: Expected a list of search results, got {type(search_results)}",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Search vector store test passed!")
search_results_dict = search_results.to_dict()
log_and_print(Fore.WHITE, f"Search results = {search_results_dict}")
with open(f'openai_testing/{prefix}_vector_store_search.json', 'w') as f:
json.dump(search_results_dict, f, indent=2)
log_and_print(Fore.WHITE, f"vector_stores.search = {search_results.to_json()}")
except Exception as e:
log_and_print(Fore.RED, f"Search vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
# Run all tests in sequence, even if some fail
test_results = []
try:
result = test_idempotent_vector_store_creation()
if result and len(result) == 2:
DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2 = result
test_results.append(True)
except Exception as e:
log_and_print(Fore.RED, f"Vector store creation test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
test_results.append(False)
for test_func in [
test_vector_store_list,
test_retrieve_vector_store,
test_modify_vector_store,
test_delete_vector_store,
test_create_vector_store_file,
test_search_vector_store
]:
try:
test_func()
test_results.append(True)
except Exception as e:
log_and_print(Fore.RED, f"{test_func.__name__} failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
test_results.append(False)
if all(test_results):
log_and_print(Fore.GREEN, f"All {prefix} tests completed successfully!")
else:
failed_count = test_results.count(False)
log_and_print(Fore.YELLOW, f"{failed_count} {prefix} test(s) failed, but script completed.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run OpenAI and/or LlamaStack tests.")
parser.add_argument(
"--provider",
type=str,
default="llama",
choices=["openai", "llama", "both"],
help="Specify which environment to test: openai, llama, or both. Default is both.",
)
args = parser.parse_args()
try:
if args.provider in ("openai", "both"):
openai_client = OpenAI()
run_tests(openai_client, prefix="openai")
if args.provider in ("llama", "both"):
llama_client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")
run_tests(llama_client, prefix="llama")
log_and_print(Fore.GREEN, "All tests completed!")
except Exception as e:
log_and_print(Fore.RED, f"Tests failed to complete: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
```
</details>
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR adds the keyword search implementation for Milvus. Along with
the implementation for remote Milvus, the tests require us to start a
Milvus containers locally.
In order to verify the implementation, run:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
You can also test the changes using the below script:
```
#!/usr/bin/env python3
import asyncio
import os
import uuid
from typing import List
from llama_stack_client import (
Agent,
AgentEventLogger,
LlamaStackClient,
RAGDocument
)
class MilvusRAGDemo:
def __init__(self, base_url: str = "http://localhost:8321/"):
self.client = LlamaStackClient(base_url=base_url)
self.vector_db_id = f"milvus_rag_demo_{uuid.uuid4().hex[:8]}"
self.model_id = None
self.embedding_model_id = None
self.embedding_dimension = None
def setup_models(self):
"""Get available models and select appropriate ones for LLM and embeddings."""
models = self.client.models.list()
# Select embedding model
embedding_models = [m for m in models if m.model_type == "embedding"]
if not embedding_models:
raise ValueError("No embedding models found")
self.embedding_model_id = embedding_models[0].identifier
self.embedding_dimension = embedding_models[0].metadata["embedding_dimension"]
def register_vector_db(self):
print(f"Registering Milvus vector database: {self.vector_db_id}")
response = self.client.vector_dbs.register(
vector_db_id=self.vector_db_id,
embedding_model=self.embedding_model_id,
embedding_dimension=self.embedding_dimension,
provider_id="milvus-remote", # Use remote Milvus
)
print(f"Vector database registered successfully")
return response
def insert_documents(self):
"""Insert sample documents into the vector database."""
print("\nInserting sample documents...")
# Sample documents about different topics
documents = [
RAGDocument(
document_id="ai_ml_basics",
content="""
Artificial Intelligence (AI) and Machine Learning (ML) are transforming the world.
AI refers to the simulation of human intelligence in machines, while ML is a subset
of AI that enables computers to learn and improve from experience without being
explicitly programmed. Deep learning, a subset of ML, uses neural networks with
multiple layers to process complex patterns in data.
Key concepts in AI/ML include:
- Supervised Learning: Training with labeled data
- Unsupervised Learning: Finding patterns in unlabeled data
- Reinforcement Learning: Learning through trial and error
- Neural Networks: Computing systems inspired by biological brains
""",
mime_type="text/plain",
metadata={"topic": "technology", "category": "ai_ml"},
),
]
# Insert documents with chunking
self.client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=self.vector_db_id,
chunk_size_in_tokens=200, # Smaller chunks for better granularity
)
print(f"Inserted {len(documents)} documents with chunking")
def test_keyword_search(self):
"""Test keyword-based search using BM25."""
queries = [
"neural networks",
"Python frameworks",
"data cleaning",
]
for query in queries:
response = self.client.vector_io.query(
vector_db_id=self.vector_db_id,
query=query,
params={
"mode": "keyword", # Keyword search
"max_chunks": 3,
"score_threshold": 0.0,
}
)
for i, (chunk, score) in enumerate(zip(response.chunks, response.scores)):
print(f" {i+1}. Score: {score:.4f}")
print(f" Content: {chunk.content[:100]}...")
print(f" Metadata: {chunk.metadata}")
def run_demo(self):
try:
self.setup_models()
self.register_vector_db()
self.insert_documents()
self.test_keyword_search()
except Exception as e:
print(f"Error during demo: {e}")
raise
def main():
"""Main function to run the demo."""
# Check if Llama Stack server is running
demo = MilvusRAGDemo()
try:
demo.run_demo()
except Exception as e:
print(f"Demo failed: {e}")
if __name__ == "__main__":
main()
```
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
- fireworks, together do not support Llama-guard 3 8b model anymore
- Need to default to ollama
- current safety shields logic was not correct since the shield_id was
the provider ( which had duplicates )
- Followed similar logic to models
Note: Seems a bit over-engineered but this can now be extended to other
providers and fits in the overall mechanism of how env_vars are used to
manage starter.
### How to test
```
ENABLE_OLLAMA=ollama ENABLE_FIREWORKS=fireworks SAFETY_MODEL=llama-guard3:1b pytest -s -v tests/integration/ --stack-config starter -k 'not(supervised_fine_tune or builtin_tool_code or safety_with_image or code_interpreter_for or rag_and_code or truncation or register_and_unregister)' --text-model fireworks/meta-llama/Llama-3.3-70B-Instruct --vision-model fireworks/meta-llama/Llama-4-Scout-17B-16E-Instruct --safety-shield llama-guard3:1b --embedding-model all-MiniLM-L6-v2
```
### Related but not obvious in this PR
In the llama-stack-ops repo, we run tests before publishing packages and
docker containers.
The actions in that repo were using the fireworks / together distros (
which are non-existent )
So need to update that to run with `starter` and use `ollama`
specifically for safety.
# What does this PR do?
Adds a template for technical debt. Currently we don't support blank
issues so everything filed has to a bug or a feature.
This would allow maintainers as well as community members to track
things we might want to merge to expose the functionality but should be
addressed later. Such things can also be "good first issues" for new
contributors.
## Example of what we constitute as technical debt
Inelegant code solutions, tests we intend to temporarily disable but
would like to restore, CI hacks around infrastructure or installation,
etc.
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
inference providers each have a static list of supported / known models.
some also have access to a dynamic list of currently available models.
this change gives prodivers using the ModelRegistryHelper the ability to
combine their static and dynamic lists.
for instance, OpenAIInferenceAdapter can implement
```
def query_available_models(self) -> list[str]:
return [entry.model for entry in self.openai_client.models.list()]
```
to augment its static list w/ a current list from openai.
## Test Plan
scripts/unit-test.sh
Remove both the metadata and content from the kvstore when a file is
being removed from the vector store.
Closes: #2685
Also add faiss provider to openai_vector_stores test suite
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
This PR improves documentation clarity around run.yaml file usage. It
adds comprehensive guidance to help users understand that generated
run.yaml files are templates meant to be customized for production use,
not used as-is.
## Changes
- Add new documentation section on customizing run.yaml files
- Clarify that generated run.yaml files are templates, not production
configs
- Add guidance on customization best practices and common scenarios
- Update existing documentation to reference customization guide
- Improve clarity around run.yaml file usage for better user experience
## Test Plan
- Verified new documentation file exists at correct location
- Confirmed documentation is properly integrated into the toctree
structure
- Checked all internal links use correct paths and reference existing
files
- Validated references are added to relevant existing documentation
files
- Documentation build testing will be handled by CI environment
# What does this PR do?
Adds input validation for mode in RagQueryConfig
This will prevent users from inputting search modes other than `vector`
and `keyword` for the time being with `hybrid` to follow when that
functionality is implemented.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
```
# Check out this PR and enter the LS directory
uv sync --extra dev
```
Run the quickstart
[example](https://llama-stack.readthedocs.io/en/latest/getting_started/#step-3-run-the-demo)
Alter the Agent to include a query_config
```
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
"query_config": {
"mode": "i-am-not-vector", # Test for non valid search mode
"max_chunks": 6
}
},
}
],
)
```
Ensure you get the following error:
```
400: {'errors': [{'loc': ['mode'], 'msg': "Value error, mode must be either 'vector' or 'keyword' if supported by the vector_io provider", 'type': 'value_error'}]}
```
## Running unit tests
```
uv sync --extra dev
uv run pytest tests/unit/rag/test_rag_query.py -v
```
[//]: # (## Documentation)
# What does this PR do?
- Adds two pages to UI
- Vector stores
- Vector store detail view
- Fixed darkmode navbar highlighting
- Updated darkmode font color
- Updated llama-stack-client package
<img width="1916" height="734" alt="Screenshot 2025-07-12 at 11 34
35 PM"
src="https://github.com/user-attachments/assets/3f9b6727-ee82-4e6b-9555-2e3ef36d24d2"
/>
<img width="1912" height="910" alt="Screenshot 2025-07-12 at 11 57
09 PM"
src="https://github.com/user-attachments/assets/0c9d3b5e-5592-4dfb-8e04-a57edc9fb406"
/>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
this blocks network access for all `tests/unit/` tests.
`tests/integration/` are untouched.
it also introduces an `allow_network` marker to explicitly allow network
access.
## Test Plan
`./scripts/unit-tests.sh`
# What does this PR do?
Some of our inference providers support passthrough authentication via
`x-llamastack-provider-data` header values. This fixes the providers
that support passthrough auth to not cache their clients to the backend
providers (mostly OpenAI client instances) so that the client connecting
to Llama Stack has to provide those auth values on each and every
request.
## Test Plan
I added some unit tests to ensure we're not caching clients across
requests for all the fixed providers in this PR.
```
uv run pytest -sv tests/unit/providers/inference/test_inference_client_caching.py
```
I also ran some of our OpenAI compatible API integration tests for each
of the changed providers, just to ensure they still work. Note that
these providers don't actually pass all these tests (for unrelated
reasons due to quirks of the Groq and Together SaaS services), but
enough of the tests passed to confirm the clients are still working as
intended.
### Together
```
ENABLE_TOGETHER="together" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "together/meta-llama/Llama-3.1-8B-Instruct"
```
### OpenAI
```
ENABLE_OPENAI="openai" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "openai/gpt-4o-mini"
```
### Groq
```
ENABLE_GROQ="groq" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "groq/meta-llama/Llama-3.1-8B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Update the shield register validation of Sambanova not to raise, but
only warn when a model is not available in the base url endpoint used,
also added warnings when model is not available in the base url endpoint
used
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
run starter distro with Sambanova enabled
# What does this PR do?
previously, developers who ran `./scripts/unit-tests.sh` would get
`asyncio-mode=auto`, which meant `@pytest.mark.asyncio` and
`@pytest_asyncio.fixture` were redundent. developers who ran `pytest`
directly would get pytest's default (strict mode), would run into errors
leading them to add `@pytest.mark.asyncio` / `@pytest_asyncio.fixture`
to their code.
with this change -
- `asyncio_mode=auto` is included in `pyproject.toml` making behavior
consistent for all invocations of pytest
- removes all redundant `@pytest_asyncio.fixture` and
`@pytest.mark.asyncio`
- for good measure, requires `pytest>=8.4` and `pytest-asyncio>=1.0`
## Test Plan
- `./scripts/unit-tests.sh`
- `uv run pytest tests/unit`
# What does this PR do?
Otherwise we can get old versions like 1.11 and experience this error:
```
ModuleNotFoundError: No module named 'opentelemetry.exporter.otlp.proto.http.metric_exporter'
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
COPY with chmod does not work, see
https://github.com/containers/buildah/issues/4614. Also Docker arguably
implements it.
Anyway, this command is not even needed since later don't we do:
```
RUN mkdir -p /.llama /.cache && chmod -R g+rw /app /.llama /.cache
```
And providers.d will get the right modes.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Build with CONTAINER_BINARY=podman and success
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The current authorized sql store implementation does not respect
user.principal (only checks attributes). This PR addresses that.
## Test Plan
Added test cases to integration tests.
# What does this PR do?
the "rfc" directory has only a single document in it, and its the
original RFC for creating Llama Stack
simply the project directory structure by moving this into the "docs"
directory and renaming it to "original_rfc" to preserve the context of
the doc
## Why did you do this?
A simplified top-level directory structure helps keep the project
simpler and prevents misleading new contributors into thinking we use it
(we really don't)
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
"install.sh" is something that a general user might not use e.g. it is
specific to using the "ollama" inference provider
cleanup the top-level structure of the repo by moving it into the
"scripts" dir and updating the relevant references accordingly
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
"dev" dependencies were moved in pyproject.toml
typo with guidance around automatic doc generation
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
This PR refactors and the VectorIO backend logic for `sqlite-vec` and
adds unit tests and fixtures to make it easy to test both `sqlite-vec`
and `milvus`.
Key changes:
- `sqlite-vec` migrated to `kvstore` registry
- added in-memory cache for sqlite-vec to be consistent with `milvus`
- default fixtures moved to `conftest.py`
- removed redundant tests from sqlite`-vec`
- made `test_vector_io_openai_vector_stores.py` more easily extensible
## Test Plan
Unit tests added testing inline providers.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
the project already had some config setup for https://pre-commit.ci/
this commit adds additional explicit fields
Closes#2711
**IMPORTANT:** A project maintainer must add `pre-commit.ci` to this
repo for this to work - this can be done via https://pre-commit.ci/
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
While investigating the `uv.lock` changes made in
https://github.com/meta-llama/llama-stack/pull/2695 I noticed several of
the pre-commit hook versions were out of date
This PR updates them and fixes some new `ruff` errors
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
currently when logging the run yaml, if there are path objects in the
object they are represented as:
```
external_providers_dir: !!python/object/apply:pathlib.PosixPath
- '~'
- .llama
- providers.d
```
now, with a config.model_dump(mode="json"), it works properly
```
external_providers_dir: ~/.llama/providers.d
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
We are now automatically building the list of integration test to run.
In that process, eval and files and being tested now.
This is pending https://github.com/meta-llama/llama-stack/pull/2628
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Terminate server process for real.
## Test Plan
```ENABLE_OPENAI=openai LLAMA_STACK_CONFIG=server:starter pytest -v tests/integration/agents/test_openai_responses.py --text-model "gpt-4o-mini" -vv -s -k 'test_list_response_input_items[' && lsof -ti:8321```
observe no process printed anymore
# What does this PR do?
`llama stack run starter` in conda environment fails with ' --config is
required for venv and conda environments' because it is passed as
--template and start_stack.sh doesn't process template.
## Test Plan
`llama stack run starter`
# What does this PR do?
`python-jose` recommends using the `cryptography` backend in their
installation docs:
https://github.com/mpdavis/python-jose?tab=readme-ov-file#cryptographic-backends
This PR modifies the LLS dependencies to use this instead of the current
`native-python`
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
We are now testing the safety capability with the starter image. This
includes a few changes:
* Enable the safety integration test
* Relax the shield model requirements from llama-guard to make it work
with llama-guard3:8b coming from Ollama
* Expose a shield for each inference provider in the starter distro. The
shield will only be registered if the provider is enabled.
Closes: https://github.com/meta-llama/llama-stack/issues/2528
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
Updates some broken or outdated links pointing to the Android Demo App
Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack/apis`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
- Fix constructor call missing files_api parameter
- Add kvstore field to MilvusVectorIOConfig
- Resolves#2626
# What does this PR do?
[https://github.com/meta-llama/llama-stack/issues/2626]
## Problem
The `MilvusVectorIOAdapter` fails to initialize due to two missing
configuration issues:
1. Missing `files_api` parameter in the constructor call
2. Missing `kvstore` field in the `MilvusVectorIOConfig` class
## Root Cause
1. The adapter constructor expects 3 parameters `(config, inference_api,
files_api)` but the `get_adapter_impl` function only passes 2 parameters
2. The `MilvusVectorIOConfig` class lacks the `kvstore` field that the
adapter's `initialize()` method expects for metadata persistence
## Solution
- Added `files_api = deps.get(Api.files, None)` to safely retrieve files
API from dependencies
- Pass the files_api parameter to MilvusVectorIOAdapter constructor
- Added `kvstore: KVStoreConfig | None = None` field to
MilvusVectorIOConfig
- Maintains backward compatibility since both files_api and kvstore can
be None
Closes#2626
## Test Plan
- [x] Tested with Milvus configuration - server starts successfully
```yaml
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: http://localhost:19530
token: root:Milvus
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/remote-vllm}/milvus_store.db
```
- [x] Vector operations work as expected
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types.shared_params.document import Document as RAGDocument
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger as AgentEventLogger
import os
endpoint = os.getenv("LLAMA_STACK_ENDPOINT")
model = os.getenv("INFERENCE_MODEL")
# Initialize the client
client = LlamaStackClient(base_url=endpoint)
vector_db_id = "my_documents"
response = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
provider_id="milvus",
)
urls = ["getting_started/Red_Hat_AI_Inference_Server-3.0-Getting_started-en-US.pdf", "vllm_server_arguments/Red_Hat_AI_Inference_Server-3.0-vLLM_server_arguments-en-US.pdf"]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=f"https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.0/pdf/{url}",
mime_type="application/pdf",
metadata={},
)
for i, url in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
rag_agent = Agent(
client,
model=model,
# Define instructions for the agent (system prompt)
instructions="You are a helpful assistant",
enable_session_persistence=False,
# Define tools available to the agent
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
},
}
],
)
session_id = rag_agent.create_session("test-session")
user_prompts = [
"How to start the AI Inference Server container image? use the knowledge_search tool to get information.",
]
for prompt in user_prompts:
print(f"User> {prompt}")
response = rag_agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
```
server logs:
```
INFO 2025-07-04 22:18:30,385 __main__:577 server: Listening on ['::', '0.0.0.0']:5000
INFO: Started server process [769725]
INFO: Waiting for application startup.
INFO 2025-07-04 22:18:30,390 __main__:158 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
INFO 2025-07-04 22:18:52,193 llama_stack.distribution.routing_tables.common:200 core: Setting owner for vector_db 'my_documents' to
20:18:52.194 [START] /v1/vector-dbs
INFO: 192.168.1.249:64170 - "POST /v1/vector-dbs HTTP/1.1" 200 OK
20:18:52.216 [END] /v1/vector-dbs [StatusCode.OK] (21.89ms)
20:18:52.222 [START] /v1/tool-runtime/rag-tool/insert
INFO 2025-07-04 22:18:56,265 llama_stack.providers.utils.inference.embedding_mixin:102 uncategorized: Loading sentence transformer for
all-MiniLM-L6-v2...
WARNING 2025-07-04 22:18:59,214 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
INFO 2025-07-04 22:18:59,339 sentence_transformers.SentenceTransformer:219 uncategorized: Use pytorch device_name: cuda:0
INFO 2025-07-04 22:18:59,340 sentence_transformers.SentenceTransformer:227 uncategorized: Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO: 192.168.1.249:64170 - "POST /v1/tool-runtime/rag-tool/insert HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "GET /v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session HTTP/1.1" 200 OK
20:19:01.834 [END] /v1/tool-runtime/rag-tool/insert [StatusCode.OK] (9612.06ms)
20:19:01.839 [START] /v1/agents
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session/d2706302-bb54-421d-a890-5e25df9cb47f/turn HTTP/1.1" 200 OK
20:19:01.839 [END] /v1/agents [StatusCode.OK] (0.18ms)
20:19:01.844 [START] /v1/tools
INFO 2025-07-04 22:19:01,853 llama_stack.providers.remote.inference.vllm.vllm:330 uncategorized: Initializing vLLM client with
base_url=http://192.168.1.183:8080/v1
20:19:01.858 [END] /v1/tools [StatusCode.OK] (14.92ms)
20:19:01.868 [START] /v1/agents/{agent_id}/session
20:19:01.868 [END] /v1/agents/{agent_id}/session [StatusCode.OK] (0.37ms)
20:19:01.873 [START] /v1/agents/{agent_id}/session/{session_id}/turn
20:19:01.885 [START] inference
20:19:05.506 [END] inference [StatusCode.OK] (3621.19ms)
INFO 2025-07-04 22:19:05,537 llama_stack.providers.inline.agents.meta_reference.agent_instance:890 agents: executing tool call: knowledge_search
with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.538 [START] tool_execution
20:19:05.928 [END] tool_execution [StatusCode.OK] (390.08ms)
20:19:05.538 [INFO] executing tool call: knowledge_search with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.935 [START] inference
20:19:17.539 [END] inference [StatusCode.OK] (11603.76ms)
20:19:17.560 [END] /v1/agents/{agent_id}/session/{session_id}/turn [StatusCode.OK] (15686.62ms)
```
- [x] No regressions in functionality
- [x] Configuration properly accepts kvstore settings
---------
Co-authored-by: Peter Gustafsson <peter.gustafsson6@gmail.com>
Co-authored-by: raghotham <rsm@meta.com>
Co-authored-by: Francisco Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
- fix env variables
- use gpu for vllm
- add eks/apply.py for aws
- add template to set hf secret
## Test Plan
bash apply.sh
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
- Enabling Unit tests for Milvus to start to test OpenAI compatibility
and fixing a few bugs.
- Also fixed an inconsistency in the Milvus config between remote and
inline.
- Added pymilvus to extras for testing in CI
I'm going to refactor this later to include the other inline providers
so that we can catch issues sooner.
I have another PR where I've been testing to find other bugs in the
implementation (and required changes drafted here:
https://github.com/meta-llama/llama-stack/pull/2617).
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
currently when a template is used, we still use `--config`.
`server.py` has a dedicated `--template` flag and logic, use that
instead
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The `nvidia` distro was previously collapsed into the `starter` distro.
However, the `nvidia` distro was setup specifically to use NVIDIA NeMo
microservices as providers for all APIs and not just inference, which
means it was doing quite a bit more than what the `starter` distro
covers today.
We should work with our friends at NVIDIA to determine the best place to
maintain this distro long-term, but for now this restores the `nvidia`
distro and its docs back to where they were so that things continue to
work for their users.
## Test Plan
I ensure the `nvidia` distro could build, and run at least to the point
of complaining that I didn't provide the necessary API keys.
```
uv run llama stack build --template nvidia --image-type venv
uv run llama stack run llama_stack/templates/nvidia/run.yaml
```
I also made sure the docs website built and looks reasonable, with the
`nvidia` distro docs at the same URL it was previously (because it has
incoming links from official NVIDIA NeMo docs, among other places).
```
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Rather than pointing to a dir in `llama_stack/templates` (the repo
directory)
we should point to `$BUILD_DIR/IMAGE_NAME-run.yaml`
(`~/.llama/distributions/IMAGE_NAME/IMAGE_NAME-run.yaml`)
currently we are printing:
```
You can find the newly-built template here: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/templates/starter/run.yaml
You can run the new Llama Stack distro via: llama stack run /Users/charliedoern/projects/Documents/llama-stack/llama_stack/templates/starter/run.yaml --image-type venv
```
but should be printing things like:
```
You can find the newly-built template here: /Users/charliedoern/.llama/distributions/starter/starter-run.yaml
You can run the new Llama Stack distro via: llama stack run /Users/charliedoern/.llama/distributions/starter/starter-run.yaml --image-type venv
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The CI was failing but the error was eaten by the pipe. Now we run the
task with pipefail.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- we are using `all-minilm:l6-v2` but the model we download from ollama
is `all-minilm:latest`
latest: https://ollama.com/library/all-minilm:latest 1b226e2802db
l6-v2: https://ollama.com/library/all-minilm:l6-v2 pin 1b226e2802db
- even currently they are exactly the same model but if
[all-minilm:l12-v2](https://ollama.com/library/all-minilm:l12-v2) is
updated, "latest" might not be the same for l6-v2.
- the only change in this PR is pin the model id in ollama
- also update detailed_tutorial with "starter" to replace deprecated
"ollama".
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
>INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
>llama stack build --run --template ollama --image-type venv
...
Build Successful!
You can find the newly-built template here: /home/wenzhou/zdtsw-forking/lls/llama-stack/llama_stack/templates/ollama/run.yaml
....
- metadata:
embedding_dimension: 384
model_id: all-MiniLM-L6-v2
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
- embedding
provider_id: ollama
provider_model_id: all-minilm:l6-v2
...
```
test
```
>llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
OpenAIChatCompletion(
id='chatcmpl-04f99071-3da2-44ba-a19f-03b5b7fc70b7',
choices=[
OpenAIChatCompletionChoice(
finish_reason='stop',
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role='assistant',
content="Here is a 2-sentence poem about the moon:\n\nSilver crescent in the midnight sky,\nLuna's gentle face, a beauty to the eye.",
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None
),
logprobs=None
)
],
created=1751644429,
model='llama3.2:3b-instruct-fp16',
object='chat.completion',
service_tier=None,
system_fingerprint='fp_ollama',
usage={'completion_tokens': 33, 'prompt_tokens': 36, 'total_tokens': 69, 'completion_tokens_details': None, 'prompt_tokens_details': None}
)
```
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Bumps [next](https://github.com/vercel/next.js) from 15.3.2 to 15.3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v15.3.3</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Reinstate <code>vary</code> (<a
href="https://redirect.github.com/vercel/next.js/issues/79939">#79939</a>)</li>
<li>fix(next-swc): Fix interestingness detection for React Compiler (<a
href="https://redirect.github.com/vercel/next.js/issues/79558">#79558</a>)</li>
<li>fix(next-swc): Fix react compiler usefulness detector (<a
href="https://redirect.github.com/vercel/next.js/issues/79480">#79480</a>)</li>
<li>fix(dev-overlay): Better handle edge-case file paths in launchEditor
(<a
href="https://redirect.github.com/vercel/next.js/issues/79526">#79526</a>)</li>
<li>Client router should discard stale prefetch entries for static pages
(<a
href="https://redirect.github.com/vercel/next.js/issues/79362">#79362</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/gaojude"><code>@gaojude</code></a>, <a
href="https://github.com/kdy1"><code>@kdy1</code></a>, <a
href="https://github.com/bgw"><code>@bgw</code></a>, and <a
href="https://github.com/unstubbable"><code>@unstubbable</code></a> for
helping!</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3ab8db7383"><code>3ab8db7</code></a>
v15.3.3</li>
<li><a
href="18c8113ebd"><code>18c8113</code></a>
[backport] Reinstate <code>vary</code> (<a
href="https://redirect.github.com/vercel/next.js/issues/79939">#79939</a>)</li>
<li><a
href="e18212f546"><code>e18212f</code></a>
re-enable vary header deploy test (<a
href="https://redirect.github.com/vercel/next.js/issues/79753">#79753</a>)</li>
<li><a
href="ec202eccf0"><code>ec202ec</code></a>
Revert "[next-server] skip setting vary header for basic
routes" (<a
href="https://redirect.github.com/vercel/next.js/issues/79426">#79426</a>)</li>
<li><a
href="e2f264fdce"><code>e2f264f</code></a>
fix(next-swc): Fix interestingness detection for React Compiler (15.3)
(<a
href="https://redirect.github.com/vercel/next.js/issues/79558">#79558</a>)</li>
<li><a
href="562fac78da"><code>562fac7</code></a>
fix(next-swc): Fix react compiler usefulness detector (15.3) (<a
href="https://redirect.github.com/vercel/next.js/issues/79480">#79480</a>)</li>
<li><a
href="06097fd7bb"><code>06097fd</code></a>
fix(dev-overlay): Better handle edge-case file paths in launchEditor (<a
href="https://redirect.github.com/vercel/next.js/issues/79526">#79526</a>)</li>
<li><a
href="bda731fa96"><code>bda731f</code></a>
Client router should discard stale prefetch entries for static pages (<a
href="https://redirect.github.com/vercel/next.js/issues/79362">#79362</a>)</li>
<li>See full diff in <a
href="https://github.com/vercel/next.js/compare/v15.3.2...v15.3.3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/meta-llama/llama-stack/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
* Use a single env variable to setup OTEL endpoint
* Update telemetry provider doc
* Update general telemetry doc with the metric with generate
* Left a script to setup telemetry for testing
Closes: https://github.com/meta-llama/llama-stack/issues/783
Note to reviewer: the `setup_telemetry.sh` script was useful for me, it
was nicely generated by AI, if we don't want it in the repo, and I can
delete it, and I would understand.
Signed-off-by: Sébastien Han <seb@redhat.com>
The starter template build.yaml was missing the inline::faiss provider
in the vector_io section, while it was properly configured in run.yaml
and starter.py's vector_io_providers list.
Fixes: #2624
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
The agent code is currently importing MCP modules even when MCP isn’t
enabled. Do we consider this worth fixing, or are we treating MCP as a
first-class dependency? I believe we should treat it as such.
If everyone agrees, let’s go ahead and close this.
Note: The current setup breaks if someone builds a distro without
including MCP in tool_group but still serves the agent API.
Also, we should bump the MCP version to support streamable responses, as
SSE is being deprecated.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Removes a bunch of distros
* Removed distros were added into the "starter" distribution
* Doc for "starter" has been added
* Partially reverts https://github.com/meta-llama/llama-stack/pull/2482
since inference providers are disabled by default and can be turned on
manually via env variable.
* Disables safety in starter distro
Closes: https://github.com/meta-llama/llama-stack/issues/2502.
~Needs: https://github.com/meta-llama/llama-stack/pull/2482 for Ollama
to work properly in the CI.~
TODO:
- [ ] We can only update `install.sh` when we get a new release.
- [x] Update providers documentation
- [ ] Update notebooks to reference starter instead of ollama
Signed-off-by: Sébastien Han <seb@redhat.com>
This occurred when marshmallow 4.0.0 was installed (which removed
__version_info__)
By pinning pymilvus to >=2.4.10, we ensure marshmallow doesn't get
installed.
Also set the dependency in InlineProviderSpec as this is the one that
takes effect
when using the "inline::milvus" provider.
Fixes https://github.com/meta-llama/llama-stack/issues/2588
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This handles an edge case for `generate_chunk_id` if the concatenation
of the `document_id` and `chunk_text` combination are not unique. Adding
the window location ensures uniqueness.
## Test Plan
Added unit test
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
### Summary
This pull request implements support for the OpenAI Vector Store Files
API for the Milvus vector store provider in `llama_stack`. It enables
storing, loading, updating, and deleting file metadata and file contents
in Milvus collections, allowing OpenAI vector store files to be managed
directly within Milvus.
### Main Changes
- **Milvus Vector Store Files API Implementation**
- Implements all required methods for storing, loading, updating, and
deleting vector store file metadata and contents
(`_save_openai_vector_store_file`, `_load_openai_vector_store_file`,
`_load_openai_vector_store_file_contents`,
`_update_openai_vector_store_file`,
`_delete_openai_vector_store_file_from_storage`).
- Uses two Milvus collections: `openai_vector_store_files` (for
metadata) and `openai_vector_store_files_contents` (for chunked file
contents).
- Collections are created dynamically if they do not exist, with
appropriate schema definitions.
- **Collection Name Sanitization**
- Adds a `sanitize_collection_name` utility to ensure Milvus collection
names only contain valid characters (letters, numbers, underscores).
- **Testing**
- Updates test skip logic to include `"inline::milvus"` for cases where
the OpenAI Vector Store Files API is not supported, improving
integration test accuracy.
- **Other Improvements**
- Passes `kvstore` to `MilvusIndex` for consistency.
- Removes obsolete NotImplementedErrors and legacy code for file
storage.
## Test Plan
CI and tested via a test script
## Notes
- `VectorDB` currently uses the `name` as the `identifier` in
`openai_create_vector_store`. We need to add `name` as a field to
`VectorDB` and generate the `identifier` upon creation. OpenAI is not
idempotent with respect to the `name` field that they pass (i.e., you
can pass the same name multiple times and OpenAI will generate a new
identifier). I'll add a follow up PR for this.
- The `Files` api needs to use `files-` as a prefix in the identifier. I
have updated the Vector Store to use the OpenAI prefix `vs_*`.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
* Use #2580 functionality to auto-start the server with the tests
* Reduce timeout to 30sec
* Print server logs on errors
* Pytest logs are collected to a file pytest.log
Signed-off-by: Sébastien Han <seb@redhat.com>
Resolves access control error visibility issues where 500 errors were
returned instead of proper 403 responses with actionable error messages.
• Enhance AccessDeniedError with detailed context and improve exception
handling
• Enhanced AccessDeniedError class to include user, action, and resource
context
- Added constructor parameters for action, resource, and user
- Generate detailed error messages showing user principal, attributes,
and attempted resource
- Backward compatible with existing usage (falls back to generic
message)
• Updated exception handling in server.py
- Import AccessDeniedError from access_control module
- Return proper 403 status codes with detailed error messages
- Separate handling for PermissionError (generic) vs AccessDeniedError
(detailed)
• Enhanced error context at raise sites
- Updated routing_tables/common.py to pass action, resource, and user
context
- Updated agents persistence to include context in access denied errors
- Provides better debugging information for access control issues
• Added comprehensive unit tests
- Created tests/unit/server/test_server.py with 13 test cases
- Covers AccessDeniedError with and without context
- Tests all exception types (ValidationError, BadRequestError,
AuthenticationRequiredError, etc.)
- Validates proper HTTP status codes and error message formats
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
```
server:
port: 8321
access_policy:
- permit:
principal: admin
actions: [create, read, delete]
when: user with admin in groups
- permit:
actions: [read]
when: user with system:authenticated in roles
```
then:
```
curl --request POST --url http://localhost:8321/v1/vector-dbs \
--header "Authorization: Bearer your-bearer" \
--data '{
"vector_db_id": "my_demo_vector_db",
"embedding_model": "ibm-granite/granite-embedding-125m-english",
"embedding_dimension": 768,
"provider_id": "milvus"
}'
```
depending if user is in group admin or not, you should get the
`AccessDeniedError`. Before this PR, this was leading to an error 500
and `Traceback` displayed in the logs.
After the PR, logs display a simpler error (unless DEBUG logging is set)
and a 403 Forbidden error is returned on the HTTP side.
---------
Signed-off-by: Akram Ben Aissi <<akram.benaissi@gmail.com>>
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/2490 broke postgres_demo,
as the config expected a str but the value was converted to int.
This PR:
1. Updates the type of port in sqlstore to be int
2. template generation uses `dict` instead of `StackRunConfig` so as to
avoid failing pydantic typechecks.
3. Adds `replace_env_vars` to StackRunConfig instantiation in
`configure.py` (not sure why this wasn't needed before).
## Test Plan
`llama stack build --template postgres_demo --image-type conda --run`
# What does this PR do?
- Adding a notebook equivalent of the
[getting_started/index.md#Quickstart
guide](https://github.com/meta-llama/llama-stack/blob/main/docs/source/getting_started/index.md).
## To discuss
**Note:** works locally, but I am encountering issues when attempting to
run through the notebook on Google Colab. Specifically, on the last step
to run the demo, the `knowledge_search` tool doesn't seem to be called
i.e.,:
```
rag_tool> Ingesting document: https://www.paulgraham.com/greatwork.html
prompt> How do you do great work?
inference> I don't have personal experiences or emotions, but I was trained on a large corpus of text data and use various techniques such as natural language processing (NLP) and machine learning algorithms to generate human-like responses.
```
I would expect to get something like:
```
rag_tool> Ingesting document: https://www.paulgraham.com/greatwork.html
prompt> How do you do great work?
inference> [knowledge_search(query="What is the key to doing great work")]
tool_execution> Tool:knowledge_search Args:{'query': 'What is the key to doing great work'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:
....
....
```
- Switch from BRAVE_SEARCH_API_KEY to TAVILY_SEARCH_API_KEY
- Add provider_data to LlamaStackClient for API key passing
- Use builtin::websearch toolgroup instead of manual tool config
- Fix message types to use UserMessage instead of plain dict
- Add streaming support with proper type casting
- Remove async from EventLogger loop (bug fix)
Fixes websearch functionality in agents tutorial by properly configuring
Tavily search provider integration.
# What does this PR do?
Fixes the Agents101 tutorial notebook to work with the current Llama
Stack websearch implementation. The tutorial was using outdated Brave
Search configuration that no longer works with the current server setup.
**Key Changes:**
- **Switch API provider**: Change from `BRAVE_SEARCH_API_KEY` to
`TAVILY_SEARCH_API_KEY` to match server configuration
- **Fix client setup**: Add `provider_data` to `LlamaStackClient` to
properly pass API keys to server
- **Modernize tool usage**: Replace manual tool configuration with
`tools=["builtin::websearch"]`
- **Fix type safety**: Use `UserMessage` type instead of plain
dictionaries for messages
- **Fix streaming**: Add proper streaming support with `stream=True` and
type casting
- **Fix EventLogger**: Remove incorrect `async for` usage (should be
`for`)
**Why needed:** Users following the tutorial were getting 401
Unauthorized errors because the notebook wasn't properly configured for
the Tavily search provider that the server actually uses.
## Test Plan
**Prerequisites:**
1. Start Llama Stack server with Ollama template and
`TAVILY_SEARCH_API_KEY` environment variable
2. Set `TAVILY_SEARCH_API_KEY` in your `.env` file
**Testing Steps:**
1. **Clone and setup:**
```bash
git checkout fix-2558-update-agents101
cd docs/zero_to_hero_guide/
```
2. **Start server with API key:**
```bash
export TAVILY_SEARCH_API_KEY="your_tavily_api_key"
podman run -it --network=host -v ~/.llama:/root/.llama:Z \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434 \
--env TAVILY_SEARCH_API_KEY=$TAVILY_SEARCH_API_KEY \
llamastack/distribution-ollama --port $LLAMA_STACK_PORT
```
3. **Run the notebook:**
- Open `07_Agents101.ipynb` in Jupyter
- Execute all cells in order
- Cell 5 should run without errors and show successful web search
results
**Expected Results:**
- ✅ No 401 Unauthorized errors
- ✅ Agent successfully calls `brave_search.call()` with web results
- ✅ Switzerland travel recommendations appear in output
- ✅ Follow-up questions work correctly
**Before this fix:** Users got `401 Unauthorized` errors and tutorial
failed
**After this fix:** Tutorial works end-to-end with proper web search
functionality
**Tested with:**
- Tavily API key (free tier)
- Ollama distribution template
- Llama-3.2-3B-Instruct model
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- add model_type in example
- change "Memory" to "VectorIO" as column name
- update index.md and README.md
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
run pre-commit to catch changes.
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Set parameter `usedforsecurity=False` when calling hashlib.md5 in order
to fix rag_tool.insert on FIPS clusters
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2571
---------
Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>
## Summary
Add support for `server:<config>` format in `--stack-config` option to
enable seamless one-step integration testing. This eliminates the need
to manually start servers in separate terminals before running tests.
## Key Features
- **Auto-start server**: Automatically launches `llama stack run
<config>` if target port is available
- **Smart reuse**: Reuses existing server if port is already occupied
- **Health check polling**: Waits up to 2 minutes for server readiness
via `/v1/health` endpoint
- **Custom port support**: Use `server:<config>:<port>` for non-default
ports
- **Clean output**: Server runs quietly in background without cluttering
test output
- **Backward compatibility**: All existing `--stack-config` formats
continue to work
## Usage Examples
```bash
# Auto-start server with default port 8321
pytest tests/integration/inference/ --stack-config=server:fireworks
# Use custom port
pytest tests/integration/safety/ --stack-config=server:together:8322
# Run multiple test suites seamlessly
pytest tests/integration/inference/ tests/integration/agents/ --stack-config=server:starter
```
## Implementation Details
- Enhanced `llama_stack_client` fixture with server management
- Updated documentation with cleaner organization and comprehensive
examples
- Added utility functions for port checking, server startup, and health
verification
## Test Plan
- Verified server auto-start when port 8321 is available
- Verified server reuse when port 8321 is occupied
- Tested health check polling via `/v1/health` endpoint
- Confirmed custom port configuration works correctly
- Verified backward compatibility with existing config formats
## Before/After Comparison
**Before (2 steps):**
```bash
# Terminal 1: Start server manually
llama stack run fireworks --port 8321
# Terminal 2: Wait for startup, then run tests
pytest tests/integration/inference/ --stack-config=http://localhost:8321
```
**After (1 step):**
```bash
# Single command handles everything
pytest tests/integration/inference/ --stack-config=server:fireworks
```
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- update REAMDE.md format and python version
- update package name: CustomTool was renamed to ClientTool in
https://github.com/meta-llama/llama-stack-client-python/pull/73
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2556
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
# What does this PR do?
Clarifies that non-Llama models can be trained via the Post Training API
## Test Plan
Build docs locally
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
We were not using conditionals correctly, conditionals can only be used
when the env variable is set, so `${env.ENVIRONMENT:+}` would return
None is ENVIRONMENT is not set.
If you want to create a conditional value, you need to do
`${env.ENVIRONMENT:=}`, this will pick the value of ENVIRONMENT if set,
otherwise will return None.
Closes: https://github.com/meta-llama/llama-stack/issues/2564
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The external providers guide can now be accessed directly from the
sidebar
## Test Plan
Build locally to test the changes
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
fixes the api_key type when read from env
## Test Plan
run nvidia template w/o api_key in run.yaml and perform inference
before change the inference will fail w/ -
```
File ".../llama-stack/llama_stack/providers/remote/inference/nvidia/nvidia.py", line 118, in _get_client_for_base_url
api_key=(self._config.api_key.get_secret_value() if self._config.api_key else "NO KEY"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get_secret_value'
```
## What does this PR do?
Ollama does not support remote images. Only local file paths OR base64
inputs are supported. This PR ensures that the Stack downloads remote
images and passes the base64 down to the inference engine.
## Test Plan
Added a test cases for Responses and ran it for both `fireworks` and
`ollama` providers.
# What does this PR do?
Simple approach to get some provider pages in the docs.
Add or update description fields in the provider configuration class
using Pydantic’s Field, ensuring these descriptions are clear and
complete, as they will be used to auto-generate provider documentation
via ./scripts/distro_codegen.py instead of editing the docs manually.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Resolves:
```
mypy.....................................................................Failed
- hook id: mypy
- exit code: 1
llama_stack/providers/utils/responses/responses_store.py:119: error: Missing positional argument "policy" in call to "fetch_one" of "AuthorizedSqlStore" [call-arg]
llama_stack/providers/utils/responses/responses_store.py:122: error: "AuthorizedSqlStore" has no attribute "delete" [attr-defined]
Found 2 errors in 1 file (checked 403 source files)
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR creates a webmethod for deleting open AI responses, adds and
implementation for it and makes an integration test for the OpenAI
delete response method.
[//]: # (If resolving an issue, uncomment and update the line below)
# (Closes#2077)
## Test Plan
Ran the standard tests and the pre-commit hooks and the unit tests.
# (## Documentation)
For this pr I made the routes and implementation based on the current
get and create methods. The unit tests were not able to handle this test
due to the mock interface in use, which did not allow for effective CRUD
to be tested. I instead created an integration test to match the
existing ones in the test_openai_responses.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- change from https://github.com/meta-llama/llama-stack/issues/2110 need
update documentation. "container" is not valid value for --image-type
- chore: updates from standard output
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.3.0 to 6.3.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v6.3.1 🌈 Do not warn when version not in manifest-file</h2>
<h2>Changes</h2>
<p>This is a hotfix to change the warning messages that a version could
not be found in the local manifest-file to info level.</p>
<p>A <code>setup-uv</code> release contains a version-manifest.json file
with infos in all available <code>uv</code> releases. When a new
<code>uv</code> version is released this is not contained in this file
until the file gets updated and a new <code>setup-uv</code> release is
made.
We will overhaul this process in the future but for now the spamming of
warnings is removed.</p>
<h2>🐛 Bug fixes</h2>
<ul>
<li>Do not warn when version not in manifest-file <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/462">#462</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>chore: update known versions for 0.7.14 @<a
href="https://github.com/apps/github-actions">github-actions[bot]</a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/459">#459</a>)</li>
<li>Revert "Set expected cache dir drive to C: on windows (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/451">#451</a>)"
<a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/460">#460</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="bd01e18f51"><code>bd01e18</code></a>
Do not warn when version not in manifest-file (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/462">#462</a>)</li>
<li><a
href="c6a5ebaafe"><code>c6a5eba</code></a>
chore: update known versions for 0.7.14 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/459">#459</a>)</li>
<li><a
href="790df8f465"><code>790df8f</code></a>
Revert "Set expected cache dir drive to C: on windows (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/451">#451</a>)"
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/460">#460</a>)</li>
<li>See full diff in <a
href="445689ea25...bd01e18f51">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Given the shift to python3.12, we need to explicitly depend on
`setuptools` for the pkg_resources import
## Test Plan
Run
```
cd local/llama-stack
UV_PROJECT_ENVIRONMENT=/tmp/docs uv sync --frozen --group docs
cd /tmp/docs
uv run python -m sphinx -T -b html -d _build/doctrees -D language=en \
~/local/llama-stack/docs/source/ \
/tmp/docs/html
```
# What does this PR do?
Closes https://github.com/meta-llama/llama-stack/issues/2461
## Test Plan
Tested with the `ollama` distriubtion template and updated the vector_io
provider to:
```yaml
vector_io:
- provider_id: milvus
provider_type: inline::milvus
config:
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/milvus_store.db
kvstore:
type: sqlite
db_name: milvus_registry.db
```
Ran the stack
```bash
llama stack run ./llama_stack/templates/ollama/run.yaml --image-type venv --env OLLAMA_URL="http://0.0.0.0:11434"
```
Ran the tests:
```
pytest -sv --stack-config=http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
Output passed.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
currently only the last saved model is reported as a checkpoint and
associated with the job UUID. since the HF trainer handles checkpoint
collection during training, we need to add all of the `checkpoint-*`
folders as Checkpoint objects. Adjust the save strategy to be per-epoch
to make this easier and to use less storage
Signed-off-by: Charlie Doern <cdoern@redhat.com>
The error message was misleading as it appeared to be an Ollama
connectivity issue, but actually occurred during faiss vector database
initialization.
## 🔍 Root Cause Analysis
The issue was in the faiss vector database serialization logic in
`llama_stack/providers/inline/vector_io/faiss/faiss.py`:
1. **Saving**: `faiss.serialize_index()` returns binary data (uint8
numpy array)
2. **Bug**: Code incorrectly used `np.savetxt()` which converts binary
to text with scientific notation (e.g., `7.300000000000000000e+01`)
3. **Loading**: `np.loadtxt(buffer, dtype=np.uint8)` failed to parse
scientific notation back to uint8
4. **Result**: Server crashed during initialization before reaching
Ollama connectivity check
## ✅ Solution
Replaced text-based serialization with proper binary serialization:
```
**After (fixed):**
```python
# Saving - proper binary format
np.save(buffer, np_index, allow_pickle=False)
# Loading - proper binary format
self.index = faiss.deserialize_index(np.load(buffer,
allow_pickle=False))
```
## 🧪 Testing
- ✅ Binary serialization/deserialization works correctly
- ✅ Backward compatible with existing functionality
- ✅ No security concerns (allow_pickle=False maintained)
- ✅ Resolves the specific ValueError mentioned in the issue
## 📊 Impact
This fix resolves:
- ValueError during server startup with Ollama templates
## 🔗 Related Issues
- Closes#2519
- Affects all users of Ollama template and faiss vector_io configurations
## 📝 Files Changed
- `llama_stack/providers/inline/vector_io/faiss/faiss.py` - Fixed serialization methods in `initialize()` and `_save_index()`
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
- llama_stack/exceptions.py: Add UnsupportedModelError class
- remote inference ollama.py and utils/inference/model_registry.py:
Changed ValueError in favor of UnsupportedModelError
- utils/inference/litellm_openai_mixin.py: remove `register_model`
function implementation from `LiteLLMOpenAIMixin` class. Now uses the
parent class `ModelRegistryHelper`'s function implementation
Closes#2517
## Test Plan
1. Create a new `test_run_openai.yaml` and paste the following config in
it:
```yaml
version: '2'
image_name: test-image
apis:
- inference
providers:
inference:
- provider_id: openai
provider_type: remote::openai
config:
max_tokens: 8192
models:
- metadata: {}
model_id: "non-existent-model"
provider_id: openai
model_type: llm
server:
port: 8321
```
And run the server with:
```bash
uv run llama stack run test_run_openai.yaml
```
You should now get a `llama_stack.exceptions.UnsupportedModelError` with
the supported list of models in the error message.
---
Tested for the following remote inference providers, and they all raise
the `UnsupportedModelError`:
- Anthropic
- Cerebras
- Fireworks
- Gemini
- Groq
- Ollama
- OpenAI
- SambaNova
- Together
- Watsonx
---------
Co-authored-by: Rohan Awhad <rawhad@redhat.com>
# What does this PR do?
Fixes CVE-2025-4565 and the following warning:
```
warning: `aiohttp==3.11.13` is yanked (reason: "Regression: https://github.com/aio-libs/aiohttp/issues/10617")
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Fixes an error when inferring dataset provider_id with metadata
Closes #[2506](https://github.com/meta-llama/llama-stack/issues/2506)
Signed-off-by: Juanma Barea <juanmabareamartinez@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- conditionally created folder /.llama/providers.d if
external_providers_dir is set
- do not create /.cache folder, not in use anywhere
- combine chmod and copy to one command
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
updated test:
```
export CONTAINER_BINARY=podman
LLAMA_STACK_DIR=. uv run llama stack build --template remote-vllm --image-type container --image-name <name>
```
log:
```
Containerfile created successfully in /tmp/tmp.rPMunE39Aw/Containerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y iputils-ping net-tools iproute2 dnsutils telnet curl wget telnet git procps psmisc lsof traceroute bubblewrap gcc && rm -rf /var/lib/apt/lists/*
ENV UV_SYSTEM_PYTHON=1
RUN pip install uv
RUN uv pip install --no-cache sentencepiece pillow pypdf transformers pythainlp faiss-cpu opentelemetry-sdk requests datasets chardet scipy nltk numpy matplotlib psycopg2-binary aiosqlite langdetect autoevals tree_sitter tqdm pandas chromadb-client opentelemetry-exporter-otlp-proto-http redis scikit-learn openai pymongo emoji sqlalchemy[asyncio] mcp aiosqlite fastapi fire httpx uvicorn opentelemetry-sdk opentelemetry-exporter-otlp-proto-http
RUN uv pip install --no-cache sentence-transformers --no-deps
RUN uv pip install --no-cache torch torchvision --index-url https://download.pytorch.org/whl/cpu
# Allows running as non-root user
RUN mkdir -p /.llama/providers.d /.cache
RUN uv pip install --no-cache llama-stack
RUN pip uninstall -y uv
ENTRYPOINT ["python", "-m", "llama_stack.distribution.server.server", "--template", "remote-vllm"]
RUN chmod -R g+rw /app /.llama /.cache
PWD: /tmp/llama-stack
Containerfile: /tmp/tmp.rPMunE39Aw/Containerfile
+ podman build --progress=plain --security-opt label=disable --platform linux/amd64 -t distribution-remote-vllm:0.2.12 -f /tmp/tmp.rPMunE39Aw/Containerfile /tmp/llama-stack
....
Success!
Build Successful!
You can find the newly-built template here: /tmp/llama-stack/llama_stack/templates/remote-vllm/run.yaml
You can run the new Llama Stack distro via: llama stack run /tmp/llama-stack/llama_stack/templates/remote-vllm/run.yaml --image-type container
```
```
podman tag localhost/distribution-remote-vllm:dev quay.io/wenzhou/distribution-remote-vllm:2492_2
podman push quay.io/wenzhou/distribution-remote-vllm:2492_2
docker run --rm -p 8321:8321 -e INFERENCE_MODEL="meta-llama/Llama-2-7b-chat-hf" -e VLLM_URL="http://localhost:8000/v1" quay.io/wenzhou/distribution-remote-vllm:2492_2 --port 8321
INFO 2025-06-26 13:47:31,813 __main__:436 server: Using template remote-vllm config file:
/app/llama-stack-source/llama_stack/templates/remote-vllm/run.yaml
INFO 2025-06-26 13:47:31,818 __main__:438 server: Run configuration:
INFO 2025-06-26 13:47:31,826 __main__:440 server: apis:
- agents
- datasetio
- eval
- inference
- safety
- scoring
- telemetry
- tool_runtime
- vector_io
benchmarks: []
container_image: null
....
```
-----
previous test:
local run` >llama stack build --template remote-vllm --image-type
container`
image stored in `quay.io/wenzhou/distribution-remote-vllm:2492`
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Update changelog.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Some templates were still using the old environment variable substition
syntax instead of the new one and were not getting substituted properly.
Also, some places didn't handle the new None vs old empty string ("")
values that come from the conditional environment variable substitution.
This gets the starter and remote-vllm distributions starting again, and
I tested various permutations of the starter as chroma and pgvector
needed some adjustments to their config classes to handle the new
possible `None` values. And, I had to tweak our `Provider` class to also
handle `None` values, for cases where we disable providers in the
starter config via environment variables.
This may not have caught everything that was missed, but I did grep
around quite a bit to try and find anything lingering.
## Test Plan
The following permutations now all run (or attempt to run to the point
of complaining that they can't connect to chroma, vllm, etc) when before
they failed immediately on startup because of bad environment variable
substitions:
```
uv run llama stack run llama_stack/templates/starter/run.yaml
ENABLE_SQLITE_VEC=true uv run llama stack run llama_stack/templates/starter/run.yaml
ENABLE_PGVECTOR=true uv run llama stack run llama_stack/templates/starter/run.yaml
ENABLE_CHROMADB=true uv run llama stack run llama_stack/templates/starter/run.yaml
uv run llama stack run llama_stack/templates/remote-vllm/run.yaml
```
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
I get errors when trying to query spans. It appears to be a result of
traces being inserted where there is no root_span_id which causes a
pydantic validation error on trying to load the data for a query
response (and in any case having no span referenced undermines the
purpose of the trace). The root cause as far as I can see is an invalid
test in the code that inserts the trace, where it is testing for the
string "true" against an object set to the python value True.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2493
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
With this change I can query spans.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
The goal is to promote the minimal set of dependencies the project needs
to run, this includes:
* dependencies needed to work with the CLI
* dependencies needed for the server to run with no providers
This also:
* Relocate redundant dependencies out of the core project and into the
individual providers that actually require them.
* Include all necessary server dependencies so the project can run
standalone, even without any providers.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Build and run distro a server.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This commit significantly improves the environment variable substitution
functionality in Llama Stack configuration files:
* The version field in configuration files has been changed from string
to integer type for better type consistency across build and run
configurations.
* The environment variable substitution system for ${env.FOO:} was fixed
and properly returns an error
* The environment variable substitution system for ${env.FOO+} returns
None instead of an empty strings, it better matches type annotations in
config fields
* The system includes automatic type conversion for boolean, integer,
and float values.
* The error messages have been enhanced to provide clearer guidance when
environment variables are missing, including suggestions for using
default values or conditional syntax.
* Comprehensive documentation has been added to the configuration guide
explaining all supported syntax patterns, best practices, and runtime
override capabilities.
* Multiple provider configurations have been updated to use the new
conditional syntax for optional API keys, making the system more
flexible for different deployment scenarios. The telemetry configuration
has been improved to properly handle optional endpoints with appropriate
validation, ensuring that required endpoints are specified when their
corresponding sinks are enabled.
* There were many instances of ${env.NVIDIA_API_KEY:} that should have
caused the code to fail. However, due to a bug, the distro server was
still being started, and early validation wasn’t triggered. As a result,
failures were likely being handled downstream by the providers. I’ve
maintained similar behavior by using ${env.NVIDIA_API_KEY:+}, though I
believe this is incorrect for many configurations. I’ll leave it to each
provider to correct it as needed.
* Environment variable substitution now uses the same syntax as Bash
parameter expansion.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
We still had a few enum declared to behave like string as well as enum.
Let's use StrEnum for those.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.
* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
These are a couple of fixes to get an example LangChain app working with
our OpenAI Responses API implementation.
The Responses API spec requires an annotations array in
`output[*].content[*].annotations` and we were not providing one. So,
this adds that as an empty list, even though we don't do anything to
populate it yet. This prevents an error from client libraries like
Langchain that expect this field to always exist, even if an empty list.
The other fix is `web_search_preview` is a valid name for the web search
tool in the Responses API, but we only responded to `web_search` or
`web_search_preview_2025_03_11`.
## Test Plan
The existing Responses unit tests were expanded to test these cases,
via:
```
pytest -sv tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
The existing test_openai_responses.py integration tests still pass with
this change, tested as below with Fireworks:
```
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv tests/integration/agents/test_openai_responses.py \
--text-model accounts/fireworks/models/llama4-scout-instruct-basic
```
Lastly, this example LangChain app now works with Llama stack (tested
with Ollama in the starter template in this case). This LangChain code
is using the example snippets for using Responses API at
https://python.langchain.com/docs/integrations/chat/openai/#responses-api
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="fake",
model="ollama/meta-llama/Llama-3.2-3B-Instruct",
)
tool = {"type": "web_search_preview"}
llm_with_tools = llm.bind_tools([tool])
response = llm_with_tools.invoke("What was a positive news story from today?")
print(response.content)
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
closes#2522
## Test Plan
added integration test
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -v
tests/integration/agents/test_openai_responses.py --text-model
"accounts/fireworks/models/llama-v3p3-70b-instruct" -vv -k
'function_call'
# What does this PR do?
Adding `ChunkMetadata` so we can properly delete embeddings later.
More specifically, this PR refactors and extends the chunk metadata
handling in the vector database and introduces a distinction between
metadata used for model context and backend-only metadata required for
chunk management, storage, and retrieval. It also improves chunk ID
generation and propagation throughout the stack, enhances test coverage,
and adds new utility modules.
```python
class ChunkMetadata(BaseModel):
"""
`ChunkMetadata` is backend metadata for a `Chunk` that is used to store additional information about the chunk that
will NOT be inserted into the context during inference, but is required for backend functionality.
Use `metadata` in `Chunk` for metadata that will be used during inference.
"""
document_id: str | None = None
chunk_id: str | None = None
source: str | None = None
created_timestamp: int | None = None
updated_timestamp: int | None = None
chunk_window: str | None = None
chunk_tokenizer: str | None = None
chunk_embedding_model: str | None = None
chunk_embedding_dimension: int | None = None
content_token_count: int | None = None
metadata_token_count: int | None = None
```
Eventually we can migrate the document_id out of the `metadata` field.
I've introduced the changes so that `ChunkMetadata` is backwards
compatible with `metadata`.
<!-- If resolving an issue, uncomment and update the line below -->
Closes https://github.com/meta-llama/llama-stack/issues/2501
## Test Plan
Added unit tests
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Our starter distro required Ollama to be running (and a large list of
models available in that Ollama) to successfully start. This adjusts
things so that Ollama does not have to be running to use the starter
template / distro.
To accomplish this, a few changes were needed:
* The Ollama provider is now configurable whether it raises an Exception
or just logs a warning when it cannot reach the Ollama server on
startup. The default is to raise an exception (same as previous
behavior), but in the starter template we adjust this to just log a
warning so that we can bring the stack up without needing a running
Ollama server.
* The starter template no longer specifies a default list of models for
Ollama, as any models specified there need to actually be pulled and
available in Ollama. Instead, it adds a new
`OLLAMA_INFERENCE_MODEL` environment variable where users can provide an
optional model to register with the Ollama provider on startup.
Additional models can also be registered via the typical
`models.register(...)` at runtime.
* The vLLM template was adjusted to also allow an optional
`VLLM_INFERENCE_MODEL` specified on startup, so that the behavior
between vLLM and Ollama was consistent here to make it easy to get up
and running quickly.
* The default vector store was changed from sqlite-vec to faiss.
sqlite-vec can enabled via setting the `ENABLE_SQLITE_VEC` environment
variable, like we do for chromadb and pgvector. This is due to
sqlite-vec not shipping proper arm64 binaries, like we previously fixed
in #1530 for the ollama distribution.
## Test Plan
With this change, the following scenarios now work with the starter
template that did not before:
* no Ollama running
* Ollama running but not all of the Llama models pulled locally
* Ollama running with a custom model registered on startup
* vLLM running with a custom model registered on startup
* running the starter template on linux/arm64, like when running
containers on Mac without rosetta emulation
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add search_mode parameter (vector/keyword/hybrid) to
openai_search_vector_store method. Fixes OpenAPI
code generation by using str instead of Literal type.
Closes: #2459
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
Bug:
1. go to responses chat logs in UI
2. go to chat completions logs page
3. observe that same data appears in the table twice
This is because `fetchData` is called multiple times when multiple
renders occur.
## Test Plan
manual testing of above bug repro steps
# What does this PR do?
Closes#2495
Changes:
- Delay the `COPY run.yaml` into docker image step until after external
provider handling
- Split the check for `external_providers_dir` into “non-empty” and
“directory exists"
## Test Plan
0. Create and Activate venv
1. Create a `simple_build.yaml`
```yaml
version: '2'
distribution_spec:
providers:
inference:
- remote::openai
image_type: container
image_name: openai-stack
```
2. Run llama stack build:
```bash
llama stack build --config simple_build.yaml
```
3. Run the docker container:
```bash
docker run \
-p 8321:8321 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
openai_stack:0.2.12
```
This should show server is running.
```
INFO 2025-06-23 19:07:57,832 llama_stack.distribution.distribution:151 core: Loading external providers from /.llama/providers.d
INFO 2025-06-23 19:07:59,324 __main__:572 server: Listening on ['::', '0.0.0.0']:8321
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2025-06-23 19:07:59,336 __main__:156 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
Notice the first line:
```
Loading external providers from /.llama/providers.d
```
This is expected behaviour.
Co-authored-by: Rohan Awhad <rawhad@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
6.0.1 to 6.3.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v6.3.0 🌈 Use latest version from manifest-file</h2>
<h2>Changes</h2>
<p>If a manifest-file is supplied the default value of the version input
(latest) will get the latest version available in the manifest. That
might not be the actual latest version available in the official uv
repo.</p>
<h2>🚀 Enhancements</h2>
<ul>
<li>Use latest version from manifest-file <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/458">#458</a>)</li>
</ul>
<h2>v6.2.0 🌈 New input manifest-file</h2>
<h2>Changes</h2>
<p>This release adds a new input <code>manifest-file</code>.</p>
<p>The <code>manifest-file</code> input allows you to specify a JSON
manifest that lists available uv versions,
architectures, and their download URLs. By default, this action uses the
manifest file contained
in this repository, which is automatically updated with each release of
uv.</p>
<p>The manifest file contains an array of objects, each describing a
version,
architecture, platform, and the corresponding download URL.</p>
<p>You can supply a custom manifest file URL to define additional
versions,
architectures, or different download URLs.
This is useful if you maintain your own uv builds or want to override
the default sources.</p>
<p>For example:</p>
<pre lang="json"><code>[
{
"version": "0.7.12-alpha.1",
"artifactName":
"uv-x86_64-unknown-linux-gnu.tar.gz",
"arch": "x86_64",
"platform": "unknown-linux-gnu",
"downloadUrl":
"https://release.pyx.dev/0.7.12-alpha.1/uv-x86_64-unknown-linux-gnu.tar.gz"
},
...
]
</code></pre>
<pre lang="yaml"><code>- name: Use a custom manifest file
uses: astral-sh/setup-uv@v6
with:
manifest-file: "https://example.com/my-custom-manifest.json"
</code></pre>
<blockquote>
<p>[!WARNING]</p>
</blockquote>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="445689ea25"><code>445689e</code></a>
Use latest version from manifest-file (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/458">#458</a>)</li>
<li><a
href="a02a550bdd"><code>a02a550</code></a>
Look for version-manifest.json relative to action path (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/456">#456</a>)</li>
<li><a
href="60cc2b4585"><code>60cc2b4</code></a>
Add input manifest-file (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/454">#454</a>)</li>
<li><a
href="7bbb36f434"><code>7bbb36f</code></a>
chore: update known versions for 0.7.13 and 0.7.12 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/444">#444</a>)</li>
<li><a
href="60ecb381b4"><code>60ecb38</code></a>
Set expected cache dir drive to C: on windows (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/451">#451</a>)</li>
<li><a
href="252c995424"><code>252c995</code></a>
chore: update known versions for 0.7.11 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/442">#442</a>)</li>
<li><a
href="477a814f2d"><code>477a814</code></a>
chore: update known versions for 0.7.10 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/440">#440</a>)</li>
<li><a
href="9b19f8f4b1"><code>9b19f8f</code></a>
Add warning about shadowed uv binaries to
<code>activate-environment</code> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/439">#439</a>)</li>
<li><a
href="d44461ea9f"><code>d44461e</code></a>
chore: update known versions for 0.7.9 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/437">#437</a>)</li>
<li><a
href="c19c1b1ffd"><code>c19c1b1</code></a>
Check that all jobs are in all-tests-passed.needs (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/432">#432</a>)</li>
<li>Additional commits viewable in <a
href="6b9c6063ab...445689ea25">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
CI tests have been failing with
.venv/lib/python3.12/site-packages/peft/auto.py:21: in <module>
from transformers import (
.venv/lib/python3.12/site-packages/transformers/__init__.py:27: in
<module>
from . import dependency_versions_check
.venv/lib/python3.12/site-packages/transformers/dependency_versions_check.py:57:
in <module>
require_version_core(deps[pkg])
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:117:
in require_version_core
return require_version(requirement, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:111:
in require_version
_compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:44: in
_compare_versions
raise ImportError(
E ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
E Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
------------------------------ Captured log setup
------------------------------
INFO llama_stack.providers.remote.inference.ollama.ollama:ollama.py:106
checking connectivity to Ollama at `http://0.0.0.0:11434`.../
=========================== short test summary info
============================
ERROR
tests/integration/providers/test_providers.py::TestProviders::test_providers
- ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
=================== 1 skipped, 4 warnings, 1 error in 9.52s
====================
## Test Plan
CI
# What does this PR do?
probably related to 3.11 upgrade
^^^^
File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.11/site-packages/termcolor/termcolor.py",
line 147, in colored
text = fmt_str % (COLORS[color], text)
~~~~~~^^^^^^^
KeyError: 'light_blue'
## Test Plan
# What does this PR do?
Inference/Response stores now store user attributes when inserting, and
respects them when fetching.
## Test Plan
pytest tests/unit/utils/test_sqlstore.py
# What does this PR do?
See inline comment.
fixes test
_
test_openai_vector_store_search_with_high_score_filter[llama_stack_client-meta-llama/Llama-3.3-70B-Instruct-meta-llama/Llama-4-Scout-17B-16E-Instruct-all-MiniLM-L6-v2-None-None]
_
llama-stack/llama_stack/distribution/library_client.py:98: in
convert_to_pydantic
return TypeAdapter(annotation).validate_python(value)
.venv/lib/python3.10/site-packages/pydantic/type_adapter.py:421: in
validate_python
return self.validator.validate_python(
E pydantic_core._pydantic_core.ValidationError: 1 validation error for
nullable[SearchRankingOptions]
E score_threshold
E Input should be less than or equal to 1 [type=less_than_equal,
input_value=1.3458905661753127, input_type=float]
E For further information visit
https://errors.pydantic.dev/2.11/v/less_than_equal
The above exception was the direct cause of the following exception:
llama-stack/tests/integration/vector_io/test_openai_vector_stores.py:376:
in test_openai_vector_store_search_with_high_score_filter
search_response = compat_client.vector_stores.search(
.venv/lib/python3.10/site-packages/llama_stack_client/resources/vector_stores/vector_stores.py:356:
in search
return self._post(
.venv/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1232:
in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream,
stream_cls=stream_cls))
llama-stack/llama_stack/distribution/library_client.py:177: in request
result = loop.run_until_complete(self.async_client.request(*args,
**kwargs))
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/asyncio/base_events.py:649:
in run_until_complete
return future.result()
llama-stack/llama_stack/distribution/library_client.py:292: in request
response = await self._call_non_streaming(
llama-stack/llama_stack/distribution/library_client.py:313: in
_call_non_streaming
body = self._convert_body(path, options.method, body)
llama-stack/llama_stack/distribution/library_client.py:425: in
_convert_body
converted_body[param_name] = convert_to_pydantic(param.annotation,
value)
llama-stack/llama_stack/distribution/library_client.py:112: in
convert_to_pydantic
raise ValueError(f"Failed to convert parameter {value} into
{annotation}: {e}") from e
E ValueError: Failed to convert parameter {'score_threshold':
1.3458905661753127} into
llama_stack.apis.vector_io.vector_io.SearchRankingOptions | None: 1
validation error for nullable[SearchRankingOptions]
E score_threshold
E Input should be less than or equal to 1 [type=less_than_equal,
input_value=1.3458905661753127, input_type=float]
E For further information visit
https://errors.pydantic.dev/2.11/v/less_than_equal
## Test Plan
feat: Add Gemini 2.0 and 2.5 models
This commit expands the set of known Gemini models by introducing:
- `gemini/gemini-2.0-flash`
- `gemini/gemini-2.5-flash`
- `gemini/gemini-2.5-pro`
These new models are added to `LLM_MODEL_IDS` for broader compatibility
and updated in `run.yaml` to allow for their immediate use in starter
configurations.
Signed-off-by: Eran Cohen <eranco@redhat.com>
# What does this PR do?
This adds the ability to list, retrieve, update, and delete Vector Store
Files. It implements these new APIs for the faiss and sqlite-vec
providers, since those are the two that also have the rest of the vector
store files implementation.
Closes#2445
## Test Plan
### test_openai_vector_stores Integration Tests
There are a number of new integration tests added, which I ran for each
provider as outlined below.
faiss (from ollama distro):
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
--embedding-model=all-MiniLM-L6-v2
```
sqlite-vec (from starter distro):
```
llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
--embedding-model=all-MiniLM-L6-v2
```
### file_search verification tests
I also ensured the file_search verification tests continue to work, both
for faiss and sqlite-vec.
faiss (ollama distro):
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.2-3B-Instruct
```
sqlite-vec (starter distro):
```
llama stack run llama_stack/templates/starter/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=together/meta-llama/Llama-3.2-3B-Instruct-Turbo
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Do not force 384 for the embedding dimension, use the one provided by
the test run.
## Test Plan
```
pytest -s -vvv tests/integration/vector_io/test_vector_io.py --stack-config=http://localhost:8321 \
-k "not(builtin_tool or safety_with_image or code_interpreter or test_rag)" \
--text-model="meta-llama/Llama-3.2-3B-Instruct" \
--embedding-model=granite-embedding-125m --embedding-dimension=768
Uninstalled 1 package in 16ms
Installed 1 package in 11ms
INFO 2025-06-18 10:52:03,314 tests.integration.conftest:59 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
/Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================= test session starts =================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-15.5-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'cov': '6.0.0', 'html': '4.1.1', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: cov-6.0.0, html-4.1.1, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, nbval-0.11.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 8 items
tests/integration/vector_io/test_vector_io.py::test_vector_db_retrieve[emb=granite-embedding-125m:dim=768] PASSED
tests/integration/vector_io/test_vector_io.py::test_vector_db_register[emb=granite-embedding-125m:dim=768] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=granite-embedding-125m:dim=768-test_case0] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=granite-embedding-125m:dim=768-test_case1] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=granite-embedding-125m:dim=768-test_case2] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=granite-embedding-125m:dim=768-test_case3] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=granite-embedding-125m:dim=768-test_case4] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks_with_precomputed_embeddings[emb=granite-embedding-125m:dim=768] PASSED
================================================== 8 passed in 5.50s ==================================================
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
dropped python3.10, updated pyproject and dependencies, and also removed
some blocks of code with special handling for enum.StrEnum
Closes#2458
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Move to use vector_stores.search for file search tool in Responses,
which supports filters.
closes#2435
## Test Plan
Added e2e test with fitlers.
myenv ❯ llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search and filters' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
Sadly, I won't have capacity to continue working for the project.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
## Summary:
This commit adds infinite scroll pagination to the chat completions and
responses tables.
## Test Plan:
1. Run unit tests: npm run test
2. Manual testing: Navigate to chat
completions/responses pages
3. Verify infinite scroll triggers when approaching
bottom
4. Added playwright tests: npm run test:e2e
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
To add health check for faiss inline vector_io provider.
I tried adding `async def health(self) -> HealthResponse:` like in
inference provider, but it didn't worked for `inline->vector_io->faiss`
provider. And via debug logs, I understood the critical issue, that the
health responses are being stored with the API name as the key, not as a
nested dictionary with provider IDs. This means that all providers of
the same API type (e.g., "vector_io") will share the same health
response, and only the last one processed will be visible in the API
response.
I've created a patch file that fixes this issue by:
- Storing the original get_providers_health method
- Creating a patched version that correctly maps health responses to
providers
- Applying the patch to the `ProviderImpl` class
Not an expert, so please let me know, if there can be any other
workaround using which I can get the health status updated directly from
`faiss.py`.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Added unit tests to test the provider patch implementation in the PR.
Adding a screenshot with the FAISS inline vector_io health status as
"OK"

| API Conformance Tests | [conformance.yml](conformance.yml) | Run the API Conformance test suite on the changes. |
| Installer CI | [install-script-ci.yml](install-script-ci.yml) | Test the installation script |
| Integration Auth Tests | [integration-auth-tests.yml](integration-auth-tests.yml) | Run the integration test suite with Kubernetes authentication |
| SqlStore Integration Tests | [integration-sql-store-tests.yml](integration-sql-store-tests.yml) | Run the integration test suite with SqlStore |
| Integration Tests (Replay) | [integration-tests.yml](integration-tests.yml) | Run the integration test suites from tests/integration in replay mode |
| Vector IO Integration Tests | [integration-vector-io-tests.yml](integration-vector-io-tests.yml) | Run the integration test suite with various VectorIO providers |
| Pre-commit | [pre-commit.yml](pre-commit.yml) | Run pre-commit checks |
| Test Llama Stack Build | [providers-build.yml](providers-build.yml) | Test llama stack build |
| Test llama stack list-deps | [providers-list-deps.yml](providers-list-deps.yml) | Test llama stack list-deps |
| Python Package Build Test | [python-build-test.yml](python-build-test.yml) | Test building the llama-stack PyPI project |
| Integration Tests (Record) | [record-integration-tests.yml](record-integration-tests.yml) | Run the integration test suite from tests/integration |
| Check semantic PR titles | [semantic-pr.yml](semantic-pr.yml) | Ensure that PR titles follow the conventional commit spec |
| Close stale issues and PRs | [stale_bot.yml](stale_bot.yml) | Run the Stale Bot action |
| Test External Providers Installed via Module | [test-external-provider-module.yml](test-external-provider-module.yml) | Test External Provider installation via Python module |
| Test External API and Providers | [test-external.yml](test-external.yml) | Test the External API and Provider mechanisms |
| UI Tests | [ui-unit-tests.yml](ui-unit-tests.yml) | Run the UI test suite |
| Unit Tests | [unit-tests.yml](unit-tests.yml) | Run the unit test suite |
echo "::error::Full mypy failed. Reproduce locally with 'uv run pre-commit run mypy-full --hook-stage manual --all-files'."
fi
exit $status
- name:Check if any unused recordings
run:|
set -e
PYTHONPATH=$PWD uv run ./scripts/cleanup_recordings.py --delete
changes=$(git status --short tests/integration | grep 'recordings' || true)
if [ -n "$changes" ]; then
echo "::error::Unused integration recordings detected. Run 'PYTHONPATH=$(pwd) uv run ./scripts/cleanup_recordings.py --delete' locally and commit the deletions."
echo "::error file=$file,line=$line_num::Do not use 'import logging' or 'from logging import' in $file. Use the custom log instead: from llama_stack.log import get_logger; logger = get_logger(). If direct logging is truly needed, add:# allow-direct-logging"
done <<< "$matches"
exit 1
fi
exit 0
- id:fips-compliance
name:Ensure llama-stack remains FIPS compliant
entry:bash
language:system
types:[python]
pass_filenames:true
exclude:'^tests/.*$'# Exclude test dir as some safety tests used MD5
Here are some key changes that are coming as part of this release.
### Build and Environment
- Environment improvements: fixed env var replacement to preserve types.
- Docker stability: fixed container startup failures for Fireworks AI provider.
- Removed absolute paths in build for better portability.
### Features
- UI Enhancements: Implemented file upload and VectorDB creation/configuration directly in UI.
- Vector Store Improvements: Added keyword, vector, and hybrid search inside vector store.
- Added S3 authorization support for file providers.
- SQL Store: Added inequality support to where clause.
### Documentation
- Fixed post-training docs.
- Added Contributor Guidelines for creating Internal vs. External providers.
### Fixes
- Removed unsupported bfcl scoring function.
- Multiple reliability and configuration fixes for providers and environment handling.
### Engineering / Chores
- Cleaner internal development setup with consistent paths.
- Incremental improvements to provider integration and vector store behavior.
### New Contributors
- @omertuc made their first contribution in #3270
- @r3v5 made their first contribution in vector store hybrid search
---
# v0.2.19
Published on: 2025-08-26T22:06:55Z
## Highlights
* feat: Add CORS configuration support for server by @skamenan7 in https://github.com/llamastack/llama-stack/pull/3201
* feat(api): introduce /rerank by @ehhuang in https://github.com/llamastack/llama-stack/pull/2940
* feat: Add S3 Files Provider by @mattf in https://github.com/llamastack/llama-stack/pull/3202
---
# v0.2.18
Published on: 2025-08-20T01:09:27Z
## Highlights
* Add moderations create API
* Hybrid search in Milvus
* Numerous Responses API improvements
* Documentation updates
---
# v0.2.17
Published on: 2025-08-05T01:51:14Z
## Highlights
* feat(tests): introduce inference record/replay to increase test reliability by @ashwinb in https://github.com/meta-llama/llama-stack/pull/2941
* fix(library_client): improve initialization error handling and prevent AttributeError by @mattf in https://github.com/meta-llama/llama-stack/pull/2944
* fix: use OLLAMA_URL to activate Ollama provider in starter by @ashwinb in https://github.com/meta-llama/llama-stack/pull/2963
* feat(UI): adding MVP playground UI by @franciscojavierarceo in https://github.com/meta-llama/llama-stack/pull/2828
* Standardization of errors (@nathan-weinberg)
* feat: Enable DPO training with HuggingFace inline provider by @Nehanth in https://github.com/meta-llama/llama-stack/pull/2825
* chore: rename templates to distributions by @ashwinb in https://github.com/meta-llama/llama-stack/pull/3035
---
# v0.2.16
Published on: 2025-07-28T23:35:23Z
## Highlights
* Automatic model registration for self-hosted providers (ollama and vllm currently). No need for `INFERENCE_MODEL` environment variables which need to be updated, etc.
* Much simplified starter distribution. Most `ENABLE_` env variables are now gone. When you set `VLLM_URL`, the `vllm` provider is auto-enabled. Similar for `MILVUS_URL`, `PGVECTOR_DB`, etc. Check the [config.yaml](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/templates/starter/config.yaml) for more details.
* All tests migrated to pytest now (thanks @Elbehery)
* DPO implementation in the post-training provider (thanks @Nehanth)
* (Huge!) Support for external APIs and providers thereof (thanks @leseb, @cdoern and others). This is a really big deal -- you can now add more APIs completely out of tree and experiment with them before (optionally) wanting to contribute back.
* `inline::vllm` provider is gone thank you very much
* several improvements to OpenAI inference implementations and LiteLLM backend (thanks @mattf)
* Chroma now supports Vector Store API (thanks @franciscojavierarceo).
* Authorization improvements: Vector Store/File APIs now supports access control (thanks @franciscojavierarceo); Telemetry read APIs are gated according to logged-in user's roles.
---
# v0.2.15
Published on: 2025-07-16T03:30:01Z
---
# v0.2.14
Published on: 2025-07-04T16:06:48Z
## Highlights
* Support for Llama Guard 4
* Added Milvus support to vector-stores API
* Documentation and zero-to-hero updates for latest APIs
---
# v0.2.13
Published on: 2025-06-28T04:28:11Z
## Highlights
* search_mode support in OpenAI vector store API
* Security fixes
---
# v0.2.12
Published on: 2025-06-20T22:52:12Z
## Highlights
* Filter support in file search
* Support auth attributes in inference and response stores
---
# v0.2.11
Published on: 2025-06-17T20:26:26Z
## Highlights
* OpenAI-compatible vector store APIs
* Hybrid Search in Sqlite-vec
* File search tool in Responses API
* Pagination in inference and response stores
* Added `suffix` to completions API for fill-in-the-middle tasks
---
# v0.2.10.1
Published on: 2025-06-06T20:11:02Z
@ -399,7 +549,7 @@ GenAI application developers need more than just an LLM - they need to integrate
Llama Stack was created to provide developers with a comprehensive and coherent interface that simplifies AI application development and codifies best practices across the Llama ecosystem. Since our launch in September 2024, we have seen a huge uptick in interest in Llama Stack APIs by both AI developers and from partners building AI services with Llama models. Partners like Nvidia, Fireworks, and Ollama have collaborated with us to develop implementations across various APIs, including inference, memory, and safety.
With Llama Stack, you can easily build a RAG agent which can also search the web, do complex math, and custom tool calling. You can use telemetry to inspect those traces, and convert telemetry into evals datasets. And with Llama Stack’s plugin architecture and prepackage distributions, you choose to run your agent anywhere - in the cloud with our partners, deploy your own environment using virtualenv, conda, or Docker, operate locally with Ollama, or even run on mobile devices with our SDKs. Llama Stack offers unprecedented flexibility while also simplifying the developer experience.
With Llama Stack, you can easily build a RAG agent which can also search the web, do complex math, and custom tool calling. You can use telemetry to inspect those traces, and convert telemetry into evals datasets. And with Llama Stack’s plugin architecture and prepackage distributions, you choose to run your agent anywhere - in the cloud with our partners, deploy your own environment using virtualenv or Docker, operate locally with Ollama, or even run on mobile devices with our SDKs. Llama Stack offers unprecedented flexibility while also simplifying the developer experience.
## Release
After iterating on the APIs for the last 3 months, today we’re launching a stable release (V1) of the Llama Stack APIs and the corresponding llama-stack server and client packages(v0.1.0). We now have automated tests for providers. These tests make sure that all provider implementations are verified. Developers can now easily and reliably select distributions or providers based on their specific requirements.
@ -462,70 +612,3 @@ A small but important bug-fix release to update the URL datatype for the client-
---
# v0.0.62
Published on: 2024-12-18T02:39:43Z
---
# v0.0.61
Published on: 2024-12-10T20:50:33Z
---
# v0.0.55
Published on: 2024-11-23T17:14:07Z
---
# v0.0.54
Published on: 2024-11-22T00:36:09Z
---
# v0.0.53
Published on: 2024-11-20T22:18:00Z
🚀 Initial Release Notes for Llama Stack!
### Added
- Resource-oriented design for models, shields, memory banks, datasets and eval tasks
- Persistence for registered objects with distribution
- Ability to persist memory banks created for FAISS
- PostgreSQL KVStore implementation
- Environment variable placeholder support in run.yaml files
- Comprehensive Zero-to-Hero notebooks and quickstart guides
- Support for quantized models in Ollama
- Vision models support for Together, Fireworks, Meta-Reference, and Ollama, and vLLM
- Bedrock distribution with safety shields support
- Evals API with task registration and scoring functions
- MMLU and SimpleQA benchmark scoring functions
- Huggingface dataset provider integration for benchmarks
- Support for custom dataset registration from local paths
- Benchmark evaluation CLI tools with visualization tables
- RAG evaluation scoring functions and metrics
- Local persistence for datasets and eval tasks
### Changed
- Split safety into distinct providers (llama-guard, prompt-guard, code-scanner)
We want to make contributing to this project as easy and transparent as
possible.
## Set up your development environment
We use [uv](https://github.com/astral-sh/uv) to manage python dependencies and virtual environments.
You can install `uv` by following this [guide](https://docs.astral.sh/uv/getting-started/installation/).
You can install the dependencies by running:
```bash
cd llama-stack
uv venv --python 3.12
uv sync --group dev
uv pip install -e .
source .venv/bin/activate
```
```{note}
If you are making changes to Llama Stack, it is essential that you use Python 3.12 as shown above.
Llama Stack can work with Python 3.13 but the pre-commit hooks used to validate code changes only work with Python 3.12.
If you don't specify a Python version, `uv` will automatically select a Python version according to the `requires-python`
section of the `pyproject.toml`, which is fine for running Llama Stack but not for committing changes.
For more info, see the [uv docs around Python versions](https://docs.astral.sh/uv/concepts/python-versions/).
```
Note that you can create a dotenv file `.env` that includes necessary environment variables:
```
LLAMA_STACK_BASE_URL=http://localhost:8321
LLAMA_STACK_CLIENT_LOG=debug
LLAMA_STACK_PORT=8321
LLAMA_STACK_CONFIG=<provider-name>
TAVILY_SEARCH_API_KEY=
BRAVE_SEARCH_API_KEY=
```
And then use this dotenv file when running client SDK tests via the following:
```bash
uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
```
### Pre-commit Hooks
We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
```bash
uv pip install 'pre-commit>=4.4.0'
uv run pre-commit install
```
Note that the only version of pre-commit that works with the Llama Stack continuous integration is `4.3.0` so it is essential that you pull
that specific version as shown above. Once you have run these commands, pre-commit hooks will run automatically before each commit.
Alternatively, if you don't want to install the pre-commit hooks (or if you want to check if your changes are ready before committing),
you can run the checks manually by running:
```bash
uv run pre-commit run --all-files -v
```
The `-v` (verbose) parameter is optional but often helpful for getting more information about any issues with that the pre-commit checks identify.
To run the expanded mypy configuration that CI enforces, use:
```bash
uv run pre-commit run mypy-full --hook-stage manual --all-files
```
or invoke mypy directly with all optional dependencies:
```bash
uv run --group dev --group type_checking mypy
```
```{caution}
Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
```
## Discussions -> Issues -> Pull Requests
We actively welcome your pull requests. However, please read the following. This is heavily inspired by [Ghostty](https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md).
If in doubt, please open a [discussion](https://github.com/meta-llama/llama-stack/discussions); we can always convert that to an issue later.
If in doubt, please open a [discussion](https://github.com/llamastack/llama-stack/discussions); we can always convert that to an issue later.
### Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.
Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.
### Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Meta's open source projects.
Complete your CLA here: <https://code.facebook.com/cla>
**I'd like to contribute!**
All issues are actionable (please report if they are not.) Pick one and start working on it. Thank you.
If you need help or guidance, comment on the issue. Issues that are extra friendly to new contributors are tagged with "contributor friendly".
If you are new to the project, start by looking at the issues tagged with "good first issue". If you're interested
leave a comment on the issue and a triager will assign it to you.
Please avoid picking up too many issues at once. This helps you stay focused and ensures that others in the community also have opportunities to contribute.
- Try to work on only 1–2 issues at a time, especially if you’re still getting familiar with the codebase.
- Before taking an issue, check if it’s already assigned or being actively discussed.
- If you’re blocked or can’t continue with an issue, feel free to unassign yourself or leave a comment so others can step in.
**I have a bug!**
@ -41,89 +136,20 @@ If you need help or guidance, comment on the issue. Issues that are extra friend
4. Make sure your code lints using `pre-commit`.
5. If you haven't already, complete the Contributor License Agreement ("CLA").
6. Ensure your pull request follows the [conventional commits format](https://www.conventionalcommits.org/en/v1.0.0/).
## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Meta's open source projects.
Complete your CLA here: <https://code.facebook.com/cla>
## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.
Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.
7. Ensure your pull request follows the [coding style](#coding-style).
## Set up your development environment
Please keep pull requests (PRs) small and focused. If you have a large set of changes, consider splitting them into logically grouped, smaller PRs to facilitate review and testing.
We use [uv](https://github.com/astral-sh/uv) to manage python dependencies and virtual environments.
You can install `uv` by following this [guide](https://docs.astral.sh/uv/getting-started/installation/).
You can install the dependencies by running:
```bash
cd llama-stack
uv sync --extra dev
uv pip install -e .
source .venv/bin/activate
```{tip}
As a general guideline:
- Experienced contributors should try to keep no more than 5 open PRs at a time.
- New contributors are encouraged to have only one open PR at a time until they’re familiar with the codebase and process.
```
> [!NOTE]
> You can use a specific version of Python with `uv` by adding the `--python <version>` flag (e.g. `--python 3.11`)
> Otherwise, `uv` will automatically select a Python version according to the `requires-python` section of the `pyproject.toml`.
> For more info, see the [uv docs around Python versions](https://docs.astral.sh/uv/concepts/python-versions/).
## Repository guidelines
Note that you can create a dotenv file `.env` that includes necessary environment variables:
```
LLAMA_STACK_BASE_URL=http://localhost:8321
LLAMA_STACK_CLIENT_LOG=debug
LLAMA_STACK_PORT=8321
LLAMA_STACK_CONFIG=<provider-name>
TAVILY_SEARCH_API_KEY=
BRAVE_SEARCH_API_KEY=
```
And then use this dotenv file when running client SDK tests via the following:
```bash
uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
```
## Pre-commit Hooks
We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
```bash
uv run pre-commit install
```
After that, pre-commit hooks will run automatically before each commit.
Alternatively, if you don't want to install the pre-commit hooks, you can run the checks manually by running:
```bash
uv run pre-commit run --all-files
```
> [!CAUTION]
> Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
## Running tests
You can find the Llama Stack testing documentation here [here](tests/README.md).
## Adding a new dependency to the project
To add a new dependency to the project, you can use the `uv` command. For example, to add `foo` to the project, you can run:
```bash
uv add foo
uv sync
```
## Coding Style
### Coding Style
* Comments should provide meaningful insights into the code. Avoid filler comments that simply
describe the next step, as they create unnecessary clutter, same goes for docstrings.
@ -139,39 +165,65 @@ uv sync
justification for bypassing the check.
* Don't use unicode characters in the codebase. ASCII-only is preferred for compatibility or
readability reasons.
* Providers configuration class should be Pydantic Field class. It should have a `description` field
that describes the configuration. These descriptions will be used to generate the provider
documentation.
* When possible, use keyword arguments only when calling functions.
* Llama Stack utilizes [custom Exception classes](llama_stack/apis/common/errors.py) for certain Resources that should be used where applicable.
### License
By contributing to Llama, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.
## Common Tasks
Some tips about common tasks you work on while contributing to Llama Stack:
### Using `llama stack build`
### Installing dependencies of distributions
Building a stack image (conda / docker) will use the production version of the `llama-stack` and `llama-stack-client` packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set `LLAMA_STACK_DIR` and `LLAMA_STACK_CLIENT_DIR` to the appropriate checked out directories when running any of the `llama` CLI commands.
When installing dependencies for a distribution, you can use `llama stack list-deps` to view and install the required packages.
If you have made changes to a provider's configuration in any form (introducing a new config key, or
changing models, etc.), you should run `./scripts/distro_codegen.py` to re-generate various YAML
files as well as the documentation. You should not change `docs/source/.../distributions/` files
manually as they are auto-generated.
If you have made changes to a provider's configuration in any form (introducing a new config key, or changing models, etc.), you should run `./scripts/distro_codegen.py` to re-generate various YAML files as well as the documentation. You should not change `docs/source/.../distributions/` files manually as they are auto-generated.
### Updating the provider documentation
If you have made changes to a provider's configuration, you should run `./scripts/provider_codegen.py`
to re-generate the documentation. You should not change `docs/source/.../providers/` files manually
as they are auto-generated.
Note that the provider "description" field will be used to generate the provider documentation.
### Building the Documentation
If you are making changes to the documentation at [https://llama-stack.readthedocs.io/en/latest/](https://llama-stack.readthedocs.io/en/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
If you are making changes to the documentation at [https://llamastack.github.io/](https://llamastack.github.io/), you can use the following command to build the documentation and preview your changes.
```bash
# This rebuilds the documentation pages.
uv run --group docs make -C docs/ html
# This rebuilds the documentation pages and the OpenAPI spec.
cd docs/
npm install
npm run gen-api-docs all
npm run build
# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
# This will start a local server (usually at http://127.0.0.1:3000).
We released [Version 0.2.0](https://github.com/meta-llama/llama-stack/releases/tag/v0.2.0) with support for the Llama 4 herd of models released by Meta.
<details>
<summary>👋 Click here to see how to run Llama 4 models on Llama Stack </summary>
\
*Note you need 8xH100 GPU-host to run these models*
```bash
pip install -U llama_stack
MODEL="Llama-4-Scout-17B-16E-Instruct"
# get meta url from llama.com
llama model download --source meta --model-id $MODEL --meta-url <META_URL>
Llama Stack standardizes the core building blocks that simplify AI application development. It codifies best practices across the Llama ecosystem. More specifically, it provides
Llama Stack defines and standardizes the core building blocks that simplify AI application development. It provides a unified set of APIs with implementations from leading service providers. More specifically, it provides:
- **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
- **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals.
- **Plugin architecture** to support the rich ecosystem of different API implementations in various environments, including local development, on-premises, cloud, and mobile.
- **Prepackaged verified distributions** which offer a one-stop solution for developers to get started quickly and reliably in any environment.
- **Multiple developer interfaces** like CLI and SDKs for Python, Typescript, iOS, and Android.
@ -97,74 +37,81 @@ Llama Stack standardizes the core building blocks that simplify AI application d
/>
</div>
### Llama Stack Benefits
- **Flexible Options**: Developers can choose their preferred infrastructure without changing APIs and enjoy flexible deployment choices.
- **Consistent Experience**: With its unified APIs, Llama Stack makes it easier to build, test, and deploy AI applications with consistent application behavior.
- **Robust Ecosystem**: Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies) that offer tailored infrastructure, software, and services for deploying Llama models.
#### Llama Stack Benefits
By reducing friction and complexity, Llama Stack empowers developers to focus on what they do best: building transformative generative AI applications.
- **Flexibility**: Developers can choose their preferred infrastructure without changing APIs and enjoy flexible deployment choices.
- **Consistent Experience**: With its unified APIs, Llama Stack makes it easier to build, test, and deploy AI applications with consistent application behavior.
- **Robust Ecosystem**: Llama Stack is integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies) that offer tailored infrastructure, software, and services for deploying Llama models.
For more information, see the [Benefits of Llama Stack](https://llamastack.github.io/docs/latest/concepts/architecture#benefits-of-llama-stack) documentation.
### API Providers
Here is a list of the various API providers and available distributions that can help developers get started easily with Llama Stack.
Please checkout for [full list](https://llamastack.github.io/docs/providers)
> **Note**: Additional providers are available through external packages. See [External Providers](https://llamastack.github.io/docs/providers/external) documentation.
### Distributions
A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario - you can begin with a local development setup (eg. ollama) and seamlessly transition to production (eg. Fireworks) without changing your application code. Here are some of the distributions we support:
A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario. For example, you can begin with a local setup of Ollama and seamlessly transition to production, with fireworks, without changing your application code.
Here are some of the distributions we support:
| **Distribution** | **Llama Stack Docker** | Start This Distribution |
For full documentation on the Llama Stack distributions see the [Distributions Overview](https://llamastack.github.io/docs/distributions) page.
### Documentation
Please checkout our [Documentation](https://llama-stack.readthedocs.io/en/latest/index.html) page for more details.
Please checkout our [Documentation](https://llamastack.github.io/docs) page for more details.
* CLI references
* [llama (server-side) CLI Reference](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/index.html): Guide for using the `llama` CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
* [llama (client-side) CLI Reference](https://llama-stack.readthedocs.io/en/latest/references/llama_stack_client_cli_reference.html): Guide for using the `llama-stack-client` CLI, which allows you to query information about the distribution.
* [llama (server-side) CLI Reference](https://llamastack.github.io/docs/references/llama_cli_reference): Guide for using the `llama` CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
* [llama (client-side) CLI Reference](https://llamastack.github.io/docs/references/llama_stack_client_cli_reference): Guide for using the `llama-stack-client` CLI, which allows you to query information about the distribution.
* Getting Started
* [Quick guide to start a Llama Stack server](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).
* [Quick guide to start a Llama Stack server](https://llamastack.github.io/docs/getting_started/quickstart).
* [Jupyter notebook](./docs/getting_started.ipynb) to walk-through how to use simple text and vision inference llama_stack_client APIs
* The complete Llama Stack lesson [Colab notebook](https://colab.research.google.com/drive/1dtVmxotBsI4cGZQNsJRYPrLiDeT0Wnwt) of the new [Llama 3.2 course on Deeplearning.ai](https://learn.deeplearning.ai/courses/introducing-multimodal-llama-3-2/lesson/8/llama-stack).
* A [Zero-to-Hero Guide](https://github.com/meta-llama/llama-stack/tree/main/docs/zero_to_hero_guide) that guide you through all the key components of llama stack with code samples.
* [Contributing](CONTRIBUTING.md)
* [Adding a new API Provider](https://llama-stack.readthedocs.io/en/latest/contributing/new_api_provider.html) to walk-through how to add a new API provider.
* [Adding a new API Provider](https://llamastack.github.io/docs/contributing/new_api_provider) to walk-through how to add a new API provider.
### Llama Stack Client SDKs
Check out our client SDKs for connecting to a Llama Stack server in your preferred language.
Check out our client SDKs for connecting to a Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [typescript](https://github.com/meta-llama/llama-stack-client-typescript), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
## 🌟 GitHub Star History
## Star History
[](https://www.star-history.com/#meta-llama/llama-stack&Date)
Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.
### Why This Benchmark Suite Exists
**Performance Validation**: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:
- Llama Stack inference (with vLLM backend)
- Direct vLLM inference calls
- Both under identical Kubernetes deployment conditions
**Production Readiness Assessment**: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.
**Regression Detection (TODO)**: As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.
**Resource Planning**: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 144.3 MB/s eta 0:00:00
Installing collected packages: uv
Successfully installed uv-0.8.19
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: pip install --upgrade pip
Using Python 3.11.13 environment at: /usr/local
Resolved 61 packages in 551ms
Downloading pillow (6.3MiB)
Downloading hf-xet (3.0MiB)
Downloading tokenizers (3.1MiB)
Downloading pygments (1.2MiB)
Downloading pandas (11.8MiB)
Downloading aiohttp (1.7MiB)
Downloading pydantic-core (1.9MiB)
Downloading numpy (16.2MiB)
Downloading transformers (11.1MiB)
Downloading pyarrow (40.8MiB)
Downloading pydantic-core
Downloading aiohttp
Downloading tokenizers
Downloading hf-xet
Downloading pygments
Downloading pillow
Downloading numpy
Downloading pandas
Downloading transformers
Downloading pyarrow
Prepared 61 packages in 1.23s
Installed 61 packages in 114ms
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.12.15
+ aiosignal==1.4.0
+ annotated-types==0.7.0
+ anyio==4.10.0
+ attrs==25.3.0
+ certifi==2025.8.3
+ charset-normalizer==3.4.3
+ click==8.1.8
+ datasets==4.1.1
+ dill==0.4.0
+ filelock==3.19.1
+ frozenlist==1.7.0
+ fsspec==2025.9.0
+ ftfy==6.3.1
+ guidellm==0.3.0
+ h11==0.16.0
+ h2==4.3.0
+ hf-xet==1.1.10
+ hpack==4.1.0
+ httpcore==1.0.9
+ httpx==0.28.1
+ huggingface-hub==0.35.0
+ hyperframe==6.1.0
+ idna==3.10
+ loguru==0.7.3
+ markdown-it-py==4.0.0
+ mdurl==0.1.2
+ multidict==6.6.4
+ multiprocess==0.70.16
+ numpy==2.3.3
+ packaging==25.0
+ pandas==2.3.2
+ pillow==11.3.0
+ propcache==0.3.2
+ protobuf==6.32.1
+ pyarrow==21.0.0
+ pydantic==2.11.9
+ pydantic-core==2.33.2
+ pydantic-settings==2.10.1
+ pygments==2.19.2
+ python-dateutil==2.9.0.post0
+ python-dotenv==1.1.1
+ pytz==2025.2
+ pyyaml==6.0.2
+ regex==2025.9.18
+ requests==2.32.5
+ rich==14.1.0
+ safetensors==0.6.2
+ six==1.17.0
+ sniffio==1.3.1
+ tokenizers==0.22.1
+ tqdm==4.67.1
+ transformers==4.56.2
+ typing-extensions==4.15.0
+ typing-inspection==0.4.1
+ tzdata==2025.2
+ urllib3==2.5.0
+ wcwidth==0.2.14
+ xxhash==3.5.0
+ yarl==1.20.1
Using Python 3.11.13 environment at: /usr/local
Audited 1 package in 3ms
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Creating backend...
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
Creating request loader...
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 149.3 MB/s eta 0:00:00
Installing collected packages: uv
Successfully installed uv-0.8.19
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: pip install --upgrade pip
Using Python 3.11.13 environment at: /usr/local
Resolved 61 packages in 494ms
Downloading pandas (11.8MiB)
Downloading tokenizers (3.1MiB)
Downloading pygments (1.2MiB)
Downloading aiohttp (1.7MiB)
Downloading transformers (11.1MiB)
Downloading numpy (16.2MiB)
Downloading pillow (6.3MiB)
Downloading pydantic-core (1.9MiB)
Downloading hf-xet (3.0MiB)
Downloading pyarrow (40.8MiB)
Downloading pydantic-core
Downloading aiohttp
Downloading tokenizers
Downloading hf-xet
Downloading pillow
Downloading pygments
Downloading numpy
Downloading pandas
Downloading pyarrow
Downloading transformers
Prepared 61 packages in 1.24s
Installed 61 packages in 126ms
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.12.15
+ aiosignal==1.4.0
+ annotated-types==0.7.0
+ anyio==4.10.0
+ attrs==25.3.0
+ certifi==2025.8.3
+ charset-normalizer==3.4.3
+ click==8.1.8
+ datasets==4.1.1
+ dill==0.4.0
+ filelock==3.19.1
+ frozenlist==1.7.0
+ fsspec==2025.9.0
+ ftfy==6.3.1
+ guidellm==0.3.0
+ h11==0.16.0
+ h2==4.3.0
+ hf-xet==1.1.10
+ hpack==4.1.0
+ httpcore==1.0.9
+ httpx==0.28.1
+ huggingface-hub==0.35.0
+ hyperframe==6.1.0
+ idna==3.10
+ loguru==0.7.3
+ markdown-it-py==4.0.0
+ mdurl==0.1.2
+ multidict==6.6.4
+ multiprocess==0.70.16
+ numpy==2.3.3
+ packaging==25.0
+ pandas==2.3.2
+ pillow==11.3.0
+ propcache==0.3.2
+ protobuf==6.32.1
+ pyarrow==21.0.0
+ pydantic==2.11.9
+ pydantic-core==2.33.2
+ pydantic-settings==2.10.1
+ pygments==2.19.2
+ python-dateutil==2.9.0.post0
+ python-dotenv==1.1.1
+ pytz==2025.2
+ pyyaml==6.0.2
+ regex==2025.9.18
+ requests==2.32.5
+ rich==14.1.0
+ safetensors==0.6.2
+ six==1.17.0
+ sniffio==1.3.1
+ tokenizers==0.22.1
+ tqdm==4.67.1
+ transformers==4.56.2
+ typing-extensions==4.15.0
+ typing-inspection==0.4.1
+ tzdata==2025.2
+ urllib3==2.5.0
+ wcwidth==0.2.14
+ xxhash==3.5.0
+ yarl==1.20.1
Using Python 3.11.13 environment at: /usr/local
Audited 1 package in 3ms
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Creating backend...
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
Creating request loader...
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 156.8 MB/s eta 0:00:00
Installing collected packages: uv
Successfully installed uv-0.8.19
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: pip install --upgrade pip
Using Python 3.11.13 environment at: /usr/local
Resolved 61 packages in 480ms
Downloading pillow (6.3MiB)
Downloading pydantic-core (1.9MiB)
Downloading pyarrow (40.8MiB)
Downloading aiohttp (1.7MiB)
Downloading numpy (16.2MiB)
Downloading pygments (1.2MiB)
Downloading transformers (11.1MiB)
Downloading pandas (11.8MiB)
Downloading tokenizers (3.1MiB)
Downloading hf-xet (3.0MiB)
Downloading pydantic-core
Downloading aiohttp
Downloading tokenizers
Downloading hf-xet
Downloading pygments
Downloading pillow
Downloading numpy
Downloading pandas
Downloading pyarrow
Downloading transformers
Prepared 61 packages in 1.25s
Installed 61 packages in 126ms
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.12.15
+ aiosignal==1.4.0
+ annotated-types==0.7.0
+ anyio==4.10.0
+ attrs==25.3.0
+ certifi==2025.8.3
+ charset-normalizer==3.4.3
+ click==8.1.8
+ datasets==4.1.1
+ dill==0.4.0
+ filelock==3.19.1
+ frozenlist==1.7.0
+ fsspec==2025.9.0
+ ftfy==6.3.1
+ guidellm==0.3.0
+ h11==0.16.0
+ h2==4.3.0
+ hf-xet==1.1.10
+ hpack==4.1.0
+ httpcore==1.0.9
+ httpx==0.28.1
+ huggingface-hub==0.35.0
+ hyperframe==6.1.0
+ idna==3.10
+ loguru==0.7.3
+ markdown-it-py==4.0.0
+ mdurl==0.1.2
+ multidict==6.6.4
+ multiprocess==0.70.16
+ numpy==2.3.3
+ packaging==25.0
+ pandas==2.3.2
+ pillow==11.3.0
+ propcache==0.3.2
+ protobuf==6.32.1
+ pyarrow==21.0.0
+ pydantic==2.11.9
+ pydantic-core==2.33.2
+ pydantic-settings==2.10.1
+ pygments==2.19.2
+ python-dateutil==2.9.0.post0
+ python-dotenv==1.1.1
+ pytz==2025.2
+ pyyaml==6.0.2
+ regex==2025.9.18
+ requests==2.32.5
+ rich==14.1.0
+ safetensors==0.6.2
+ six==1.17.0
+ sniffio==1.3.1
+ tokenizers==0.22.1
+ tqdm==4.67.1
+ transformers==4.56.2
+ typing-extensions==4.15.0
+ typing-inspection==0.4.1
+ tzdata==2025.2
+ urllib3==2.5.0
+ wcwidth==0.2.14
+ xxhash==3.5.0
+ yarl==1.20.1
Using Python 3.11.13 environment at: /usr/local
Audited 1 package in 4ms
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Creating backend...
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
Creating request loader...
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 126.9 MB/s eta 0:00:00
Installing collected packages: uv
Successfully installed uv-0.8.19
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: pip install --upgrade pip
Using Python 3.11.13 environment at: /usr/local
Resolved 61 packages in 561ms
Downloading hf-xet (3.0MiB)
Downloading pillow (6.3MiB)
Downloading transformers (11.1MiB)
Downloading pyarrow (40.8MiB)
Downloading numpy (16.2MiB)
Downloading pandas (11.8MiB)
Downloading tokenizers (3.1MiB)
Downloading pydantic-core (1.9MiB)
Downloading pygments (1.2MiB)
Downloading aiohttp (1.7MiB)
Downloading pydantic-core
Downloading aiohttp
Downloading tokenizers
Downloading hf-xet
Downloading pygments
Downloading pillow
Downloading numpy
Downloading pandas
Downloading transformers
Downloading pyarrow
Prepared 61 packages in 1.25s
Installed 61 packages in 114ms
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.12.15
+ aiosignal==1.4.0
+ annotated-types==0.7.0
+ anyio==4.10.0
+ attrs==25.3.0
+ certifi==2025.8.3
+ charset-normalizer==3.4.3
+ click==8.1.8
+ datasets==4.1.1
+ dill==0.4.0
+ filelock==3.19.1
+ frozenlist==1.7.0
+ fsspec==2025.9.0
+ ftfy==6.3.1
+ guidellm==0.3.0
+ h11==0.16.0
+ h2==4.3.0
+ hf-xet==1.1.10
+ hpack==4.1.0
+ httpcore==1.0.9
+ httpx==0.28.1
+ huggingface-hub==0.35.0
+ hyperframe==6.1.0
+ idna==3.10
+ loguru==0.7.3
+ markdown-it-py==4.0.0
+ mdurl==0.1.2
+ multidict==6.6.4
+ multiprocess==0.70.16
+ numpy==2.3.3
+ packaging==25.0
+ pandas==2.3.2
+ pillow==11.3.0
+ propcache==0.3.2
+ protobuf==6.32.1
+ pyarrow==21.0.0
+ pydantic==2.11.9
+ pydantic-core==2.33.2
+ pydantic-settings==2.10.1
+ pygments==2.19.2
+ python-dateutil==2.9.0.post0
+ python-dotenv==1.1.1
+ pytz==2025.2
+ pyyaml==6.0.2
+ regex==2025.9.18
+ requests==2.32.5
+ rich==14.1.0
+ safetensors==0.6.2
+ six==1.17.0
+ sniffio==1.3.1
+ tokenizers==0.22.1
+ tqdm==4.67.1
+ transformers==4.56.2
+ typing-extensions==4.15.0
+ typing-inspection==0.4.1
+ tzdata==2025.2
+ urllib3==2.5.0
+ wcwidth==0.2.14
+ xxhash==3.5.0
+ yarl==1.20.1
Using Python 3.11.13 environment at: /usr/local
Audited 1 package in 3ms
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Creating backend...
Backend openai_http connected to http://vllm-server:8000 for model meta-llama/Llama-3.2-3B-Instruct.
Creating request loader...
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
Here's a collection of comprehensive guides, examples, and resources for building AI applications with Llama Stack. For the complete documentation, visit our [Github page](https://llamastack.github.io/getting_started/quickstart).
## Render locally
From the llama-stack `docs/` directory, run the following commands to render the docs locally:
```bash
npm install
npm run gen-api-docs all
npm run build
npm run serve
```
You can open up the docs in your browser at http://localhost:3000
## File Import System
This documentation uses `remark-code-import` to import files directly from the repository, eliminating copy-paste maintenance. Files are automatically embedded during build time.
### Importing Code Files
To import Python code (or any code files) with syntax highlighting, use this syntax in `.mdx` files:
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications:
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/benchmarks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
The Evaluation APIs are associated with a set of Resources. Please visit the Resources section in our [Core Concepts](../concepts/index.mdx) guide for better high-level understanding.
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
Llama stack pre-registers several popular open-benchmarks to easily evaluate model performance via CLI.
The list of open-benchmarks we currently support:
- [MMLU-COT](https://arxiv.org/abs/2009.03300) (Measuring Massive Multitask Language Understanding): Benchmark designed to comprehensively evaluate the breadth and depth of a model's academic and professional understanding
- [GPQA-COT](https://arxiv.org/abs/2311.12022) (A Graduate-Level Google-Proof Q&A Benchmark): A challenging benchmark of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
You can follow this [contributing guide](../references/evals_reference/index.mdx#open-benchmark-contributing-guide) to add more open-benchmarks to Llama Stack
### Run evaluation on open-benchmarks via CLI
We have built-in functionality to run the supported open-benchmarks using llama-stack-client CLI
#### Spin up Llama Stack server
Spin up llama stack server with 'open-benchmark' template
```
llama stack run llama_stack/distributions/open-benchmark/config.yaml
```
#### Run eval CLI
There are 3 necessary inputs to run a benchmark eval
- `list of benchmark_ids`: The list of benchmark ids to run evaluation on
- `model-id`: The model id to evaluate on
- `output_dir`: Path to store the evaluate results
--output_dir <directory to store the evaluate results>
```
You can run
```
llama-stack-client eval run-benchmark help
```
to see the description of all the flags that eval run-benchmark has
In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggregate evaluation results over there.
## Usage Example
Here's a basic example of using the evaluation API:
- **Choose appropriate providers**: Use Meta Reference for comprehensive evaluation, NVIDIA for platform-specific needs
- **Configure storage properly**: Ensure your key-value store configuration matches your performance requirements
- **Monitor evaluation progress**: Large evaluations can take time - implement proper monitoring
- **Use appropriate scoring functions**: Select scoring metrics that align with your evaluation goals
## What's Next?
- Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP).
- Check out our [Building Applications - Evaluation](../building_applications/evals.mdx) guide for more details on how to use the Evaluation APIs to evaluate your applications.
- Check out our [Evaluation Reference](../references/evals_reference/index.mdx) for more details on the APIs.
- Explore the [Scoring](./scoring.mdx) documentation for available scoring functions.
Post-training in Llama Stack allows you to fine-tune models using various providers and frameworks. This section covers all available post-training providers and how to use them effectively.
- **HuggingFace SFTTrainer** (`inline::huggingface`) - Fine-tuning using HuggingFace ecosystem
- **TorchTune** (`inline::torchtune`) - Fine-tuning using Meta's TorchTune framework
- **NVIDIA** (`remote::nvidia`) - Fine-tuning using NVIDIA's platform
## HuggingFace SFTTrainer
[HuggingFace SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) is an inline post training provider for Llama Stack. It allows you to run supervised fine tuning on a variety of models using many datasets.
### Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support, CPU support, and MPS support (MacOS Metal Performance Shaders)
### Configuration
| Field | Type | Required | Default | Description |
[TorchTune](https://github.com/pytorch/torchtune) is an inline post training provider for Llama Stack. It provides a simple and efficient way to fine-tune language models using PyTorch.
### Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support and single device capabilities
- Support for LoRA
### Configuration
| Field | Type | Required | Default | Description |
The Scoring API in Llama Stack allows you to evaluate outputs of your GenAI system using various scoring functions and metrics. This section covers all available scoring providers and their configuration.
## Overview
Llama Stack provides multiple scoring providers:
- **Basic** (`inline::basic`) - Simple evaluation metrics and scoring functions
- **Braintrust** (`inline::braintrust`) - Advanced evaluation using the Braintrust platform
- **LLM-as-Judge** (`inline::llm-as-judge`) - Uses language models to evaluate responses
The Scoring API is associated with `ScoringFunction` resources and provides a suite of out-of-the-box scoring functions. You can also add custom evaluators to meet specific evaluation needs.
## Basic Scoring
Basic scoring provider for simple evaluation metrics and scoring functions. This provider offers fundamental scoring capabilities without external dependencies.
### Configuration
No configuration required - this provider works out of the box.
Braintrust scoring provider for evaluation and scoring using the [Braintrust platform](https://braintrustdata.com/). Braintrust provides advanced evaluation capabilities and experiment tracking.
### Configuration
| Field | Type | Required | Default | Description |
| `openai_api_key` | `str \| None` | No | | The OpenAI API Key for LLM-powered evaluations |
### Sample Configuration
```yaml
openai_api_key: ${env.OPENAI_API_KEY:=}
```
### Features
- Advanced evaluation metrics
- Experiment tracking and comparison
- LLM-powered evaluation functions
- Integration with Braintrust's evaluation suite
- Detailed scoring analytics and insights
### Use Cases
- Production evaluation pipelines
- A/B testing of model versions
- Advanced scoring with custom metrics
- Detailed evaluation reporting and analysis
## LLM-as-Judge
LLM-as-judge scoring provider that uses language models to evaluate and score responses. This approach leverages the reasoning capabilities of large language models to assess quality, relevance, and other subjective metrics.
### Configuration
No configuration required - this provider works out of the box.
description: Legacy APIs that are being phased out
sidebar_label: Deprecated
sidebar_position: 1
---
# Deprecated APIs
This section contains APIs that are being phased out in favor of newer, more standardized implementations. These APIs are maintained for backward compatibility but are not recommended for new projects.
:::warning Deprecation Notice
These APIs are deprecated and will be removed in future versions. Please migrate to the recommended alternatives listed below.
:::
## Migration Guide
When using deprecated APIs, please refer to the migration guides provided for each API to understand how to transition to the supported alternatives.
## Deprecated API List
### Legacy Inference APIs
Some older inference endpoints that have been superseded by the standardized Inference API.
**Migration Path:** Use the [Inference API](../api/) instead.
### Legacy Vector Operations
Older vector database operations that have been replaced by the Vector IO API.
**Migration Path:** Use the [Vector IO API](../api/) instead.
### Legacy File Operations
Older file management endpoints that have been replaced by the Files API.
**Migration Path:** Use the [Files API](../api/) instead.
## Support Timeline
Deprecated APIs will be supported according to the following timeline:
- **Current Version**: Full support with deprecation warnings
- **Next Major Version**: Limited support with migration notices
- **Following Major Version**: Removal of deprecated APIs
## Getting Help
If you need assistance migrating from deprecated APIs:
1. Check the specific migration guides for each API
2. Review the [API Reference](../api/) for current alternatives
3. Consult the [Community Forums](https://github.com/llamastack/llama-stack/discussions) for migration support
4. Open an issue on GitHub for specific migration questions
## Contributing
If you find issues with deprecated APIs or have suggestions for improving the migration process, please contribute by:
1. Opening an issue describing the problem
2. Submitting a pull request with improvements
3. Updating migration documentation
For more information on contributing, see our [Contributing Guide](../contributing/).
description: APIs in development with limited support
sidebar_label: Experimental
sidebar_position: 1
---
# Experimental APIs
This section contains APIs that are currently in development and may have limited support or stability. These APIs are available for testing and feedback but should not be used in production environments.
:::warning Experimental Notice
These APIs are experimental and may change without notice. Use with caution and provide feedback to help improve them.
:::
## Current Experimental APIs
### Batch Inference API
Run inference on a dataset of inputs in batch mode for improved efficiency.
**Status:** In Development
**Provider Support:** Limited
**Use Case:** Large-scale inference operations
**Features:**
- Batch processing of multiple inputs
- Optimized resource utilization
- Progress tracking and monitoring
### Batch Agents API
Run agentic workflows on a dataset of inputs in batch mode.
**Status:** In Development
**Provider Support:** Limited
**Use Case:** Large-scale agent operations
**Features:**
- Batch agent execution
- Parallel processing capabilities
- Result aggregation and analysis
### Synthetic Data Generation API
Generate synthetic data for model development and testing.
**Status:** Early Development
**Provider Support:** Very Limited
**Use Case:** Training data augmentation
**Features:**
- Automated data generation
- Quality control mechanisms
- Customizable generation parameters
### Batches API (OpenAI-compatible)
OpenAI-compatible batch management for inference operations.
- Llama Stack server running with experimental features enabled
- Appropriate provider configurations
- Understanding of API limitations
### Configuration
Experimental APIs may require special configuration flags or provider settings. Check the specific API documentation for setup requirements.
### Usage Guidelines
1. **Testing Only**: Use experimental APIs for testing and development only
2. **Monitor Changes**: Watch for updates and breaking changes
3. **Provide Feedback**: Report issues and suggest improvements
4. **Backup Data**: Always backup important data when using experimental features
## Feedback and Contribution
We encourage feedback on experimental APIs to help improve them:
### Reporting Issues
- Use GitHub issues with the "experimental" label
- Include detailed error messages and reproduction steps
- Specify the API version and provider being used
### Feature Requests
- Submit feature requests through GitHub discussions
- Provide use cases and expected behavior
- Consider contributing implementations
### Testing
- Test experimental APIs in your environment
- Report performance issues and optimization opportunities
- Share success stories and use cases
## Migration to Stable APIs
As experimental APIs mature, they will be moved to the stable API section. When this happens:
1. **Announcement**: We'll announce the promotion in release notes
2. **Migration Guide**: Detailed migration instructions will be provided
3. **Deprecation Timeline**: Experimental versions will be deprecated with notice
4. **Support**: Full support will be available for stable versions
## Provider Support
Experimental APIs may have limited provider support. Check the specific API documentation for:
- Supported providers
- Configuration requirements
- Known limitations
- Performance characteristics
## Roadmap
Experimental APIs are part of our ongoing development roadmap:
- **Q1 2024**: Batch Inference API stabilization
- **Q2 2024**: Batch Agents API improvements
- **Q3 2024**: Synthetic Data Generation API expansion
- **Q4 2024**: Batches API full OpenAI compatibility
For the latest updates, follow our [GitHub releases](https://github.com/llamastack/llama-stack/releases) and [roadmap discussions](https://github.com/llamastack/llama-stack/discussions).
description: OpenAI-compatible APIs and features in Llama Stack
sidebar_label: OpenAI Compatibility
sidebar_position: 1
---
# OpenAI API Compatibility
Llama Stack provides comprehensive OpenAI API compatibility, allowing you to use existing OpenAI API clients and tools with Llama Stack providers. This compatibility layer ensures seamless migration and interoperability.
## Overview
OpenAI API compatibility in Llama Stack includes:
- **OpenAI-compatible endpoints** for all major APIs
- **Request/response format compatibility** with OpenAI standards
- **Authentication and authorization** using OpenAI-style API keys
- **Error handling** with OpenAI-compatible error codes and messages
- **Rate limiting** and usage tracking compatible with OpenAI patterns
## Supported OpenAI APIs
### Chat Completions API
OpenAI-compatible chat completions for conversational AI applications.
**Endpoint:** `/v1/chat/completions`
**Compatibility:** Full OpenAI API compatibility
**Providers:** All inference providers
**Features:**
- Message-based conversations
- System prompts and user messages
- Function calling support
- Streaming responses
- Temperature and other parameter controls
### Completions API
OpenAI-compatible text completions for general text generation.
**Endpoint:** `/v1/completions`
**Compatibility:** Full OpenAI API compatibility
**Providers:** All inference providers
**Features:**
- Text completion generation
- Prompt engineering support
- Customizable parameters
- Batch processing capabilities
### Embeddings API
OpenAI-compatible embeddings for vector operations.
**Endpoint:** `/v1/embeddings`
**Compatibility:** Full OpenAI API compatibility
**Providers:** All embedding providers
**Features:**
- Text embedding generation
- Multiple embedding models
- Batch embedding processing
- Vector similarity operations
### Files API
OpenAI-compatible file management for document processing.
**Endpoint:** `/v1/files`
**Compatibility:** Full OpenAI API compatibility
**Providers:** Local Filesystem, S3
**Features:**
- File upload and management
- Document processing
- File metadata tracking
- Secure file access
### Vector Store Files API
OpenAI-compatible vector store file operations for RAG applications.
For detailed code examples and implementation guides, see our [OpenAI Implementation Guide](../providers/openai.mdx).
## Known Limitations
### Responses API Limitations
The Responses API is still in active development. For detailed information about current limitations and implementation status, see our [OpenAI Responses API Limitations](../providers/openai_responses_limitations.mdx).
2. **Community Support**: Ask questions in GitHub discussions
3. **Issue Reporting**: Open GitHub issues for bugs
4. **Professional Support**: Contact support for enterprise issues
## Roadmap
Upcoming OpenAI compatibility features:
- **Enhanced batch processing** support
- **Advanced function calling** capabilities
- **Improved error handling** and diagnostics
- **Performance optimizations** for large-scale deployments
For the latest updates, follow our [GitHub releases](https://github.com/llamastack/llama-stack/releases) and [roadmap discussions](https://github.com/llamastack/llama-stack/discussions).
The Llama Stack provides a comprehensive set of APIs organized by stability level to help you choose the right endpoints for your use case.
## 🟢 Stable APIs
**Production-ready APIs with backward compatibility guarantees.**
These APIs are fully tested, documented, and stable. They follow semantic versioning principles and maintain backward compatibility within major versions. Recommended for production applications.
**Preview APIs that may change before becoming stable.**
These APIs include v1alpha and v1beta endpoints that are feature-complete but may undergo changes based on feedback. Great for exploring new capabilities and providing feedback.
These APIs are deprecated and will be removed in future versions. They are provided for migration purposes and to help transition to newer, stable alternatives.
description: Complete reference for Llama Stack APIs
sidebar_label: Overview
sidebar_position: 1
---
# API Reference
Llama Stack provides a comprehensive set of APIs for building generative AI applications. All APIs follow OpenAI-compatible standards and can be used interchangeably across different providers.
## Core APIs
### Inference API
Run inference with Large Language Models (LLMs) and embedding models.
**Supported Providers:**
- Meta Reference (Single Node)
- Ollama (Single Node)
- Fireworks (Hosted)
- Together (Hosted)
- NVIDIA NIM (Hosted and Single Node)
- vLLM (Hosted and Single Node)
- TGI (Hosted and Single Node)
- AWS Bedrock (Hosted)
- Cerebras (Hosted)
- Groq (Hosted)
- SambaNova (Hosted)
- PyTorch ExecuTorch (On-device iOS, Android)
- OpenAI (Hosted)
- Anthropic (Hosted)
- Gemini (Hosted)
- WatsonX (Hosted)
### Agents API
Run multi-step agentic workflows with LLMs, including tool usage, memory (RAG), and complex reasoning.
**Supported Providers:**
- Meta Reference (Single Node)
- Fireworks (Hosted)
- Together (Hosted)
- PyTorch ExecuTorch (On-device iOS)
### Vector IO API
Perform operations on vector stores, including adding documents, searching, and deleting documents.
**Supported Providers:**
- FAISS (Single Node)
- SQLite-Vec (Single Node)
- Chroma (Hosted and Single Node)
- Milvus (Hosted and Single Node)
- Postgres (PGVector) (Hosted and Single Node)
- Weaviate (Hosted)
- Qdrant (Hosted and Single Node)
### Files API (OpenAI-compatible)
Manage file uploads, storage, and retrieval with OpenAI-compatible endpoints.
**Supported Providers:**
- Local Filesystem (Single Node)
- S3 (Hosted)
### Vector Store Files API (OpenAI-compatible)
Integrate file operations with vector stores for automatic document processing and search.
**Supported Providers:**
- FAISS (Single Node)
- SQLite-vec (Single Node)
- Milvus (Single Node)
- ChromaDB (Hosted and Single Node)
- Qdrant (Hosted and Single Node)
- Weaviate (Hosted)
- Postgres (PGVector) (Hosted and Single Node)
### Safety API
Apply safety policies to outputs at a systems level, not just model level.
**Supported Providers:**
- Llama Guard (Depends on Inference Provider)
- Prompt Guard (Single Node)
- Code Scanner (Single Node)
- AWS Bedrock (Hosted)
### Post Training API
Fine-tune models for specific use cases and domains.
**Supported Providers:**
- Meta Reference (Single Node)
- HuggingFace (Single Node)
- TorchTune (Single Node)
- NVIDIA NEMO (Hosted)
### Eval API
Generate outputs and perform scoring to evaluate system performance.
**Supported Providers:**
- Meta Reference (Single Node)
- NVIDIA NEMO (Hosted)
### Telemetry API
Collect telemetry data from the system for monitoring and observability.
**Supported Providers:**
- Meta Reference (Single Node)
### Tool Runtime API
Interact with various tools and protocols to extend LLM capabilities.
**Supported Providers:**
- Brave Search (Hosted)
- RAG Runtime (Single Node)
## API Compatibility
All Llama Stack APIs are designed to be OpenAI-compatible, allowing you to:
- Use existing OpenAI API clients and tools
- Migrate from OpenAI to other providers seamlessly
- Maintain consistent API contracts across different environments
## Getting Started
To get started with Llama Stack APIs:
1. **Choose a Distribution**: Select a pre-configured distribution that matches your environment
2. **Configure Providers**: Set up the providers you want to use for each API
3. **Start the Server**: Launch the Llama Stack server with your configuration
4. **Use the APIs**: Make requests to the API endpoints using your preferred client
For detailed setup instructions, see our [Getting Started Guide](../getting_started/quickstart).
## Provider Details
For complete provider compatibility and setup instructions, see our [Providers Documentation](../providers/).
## API Stability
Llama Stack APIs are organized by stability level:
- **[Stable APIs](./index.mdx)** - Production-ready APIs with full support
- **[Experimental APIs](../api-experimental/)** - APIs in development with limited support
- **[Deprecated APIs](../api-deprecated/)** - Legacy APIs being phased out
## OpenAI Integration
For specific OpenAI API compatibility features, see our [OpenAI Compatibility Guide](../api-openai/).
description: Build powerful AI applications with the Llama Stack agent framework
sidebar_label: Agents
sidebar_position: 3
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agents
An Agent in Llama Stack is a powerful abstraction that allows you to build complex AI applications.
The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
## Core Concepts
### 1. Agent Configuration
Agents are configured using the `AgentConfig` class, which includes:
- **Model**: The underlying LLM to power the agent
- **Instructions**: System prompt that defines the agent's behavior
- **Tools**: Capabilities the agent can use to interact with external systems
- **Safety Shields**: Guardrails to ensure responsible AI behavior
```python
from llama_stack_client import Agent
# Create the agent
agent = Agent(
llama_stack_client,
model="meta-llama/Llama-3-70b-chat",
instructions="You are a helpful assistant that can use tools to answer questions.",
description: Understanding the internal processing flow of Llama Stack agents
sidebar_label: Agent Execution Loop
sidebar_position: 4
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agent Execution Loop
Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
## Steps in the Agent Workflow
Each agent turn follows these key steps:
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
2. **Context Retrieval**:
- If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the `instructions` field to steer the agent.
- For new documents, they are first inserted into the memory bank.
- Retrieved context is provided to the LLM as a tool response in the message history.
3. **Inference Loop**: The agent enters its main execution loop:
- The LLM receives a user prompt (with previous tool outputs)
- The LLM generates a response, potentially with [tool calls](./tools)
- If tool calls are present:
- Tool inputs are safety-checked
- Tools are executed (e.g., web search, code execution)
- Tool responses are fed back to the LLM for synthesis
- The loop continues until:
- The LLM provides a final response without tool calls
- Maximum iterations are reached
- Token limit is exceeded
4. **Final Safety Check**: The agent's final response is screened through safety shields
## Execution Flow Diagram
```mermaid
sequenceDiagram
participant U as User
participant E as Executor
participant M as Memory Bank
participant L as LLM
participant T as Tools
participant S as Safety Shield
Note over U,S: Agent Turn Start
U->>S: 1. Submit Prompt
activate S
S->>E: Input Safety Check
deactivate S
loop Inference Loop
E->>L: 2.1 Augment with Context
L-->>E: 2.2 Response (with/without tool calls)
alt Has Tool Calls
E->>S: Check Tool Input
S->>T: 3.1 Execute Tool
T-->>E: 3.2 Tool Response
E->>L: 4.1 Tool Response
L-->>E: 4.2 Synthesized Response
end
opt Stop Conditions
Note over E: Break if:
Note over E: - No tool calls
Note over E: - Max iterations reached
Note over E: - Token limit exceeded
end
end
E->>S: Output Safety Check
S->>U: 5. Final Response
```
Each step in this process can be monitored and controlled through configurations.
## Agent Execution Example
Here's an example that demonstrates monitoring the agent's execution:
description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
sidebar_label: Evaluations
sidebar_position: 7
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](../references/evals_reference/) guide that covers the complete set of APIs and developer experience flow.
:::tip[Interactive Examples]
Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
:::
## Application Evaluation Example
[](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
In this example, we will show you how to:
1. **Build an Agent** with Llama Stack
2. **Query the agent's sessions, turns, and steps** to analyze execution
3. **Evaluate the results** using scoring functions
## Step-by-Step Evaluation Process
### 1. Building a Search Agent
First, let's create an agent that can search the web to answer questions:
```python
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
description: Comprehensive guides for building AI applications with Llama Stack
sidebar_label: Overview
sidebar_position: 5
---
# AI Application Examples
Llama Stack provides all the building blocks needed to create sophisticated AI applications.
## Getting Started
The best way to get started is to look at this comprehensive notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
**📓 [Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)**
## Core Topics
Here are the key topics that will help you build effective AI applications:
### 🤖 **Agent Development**
- **[Agent Framework](./agent.mdx)** - Understand the components and design patterns of the Llama Stack agent framework
- **[Agent Execution Loop](./agent_execution_loop.mdx)** - How agents process information, make decisions, and execute actions
- **[Agents vs Responses API](./responses_vs_agents.mdx)** - Learn when to use each API for different use cases
### 📚 **Knowledge Integration**
- **[RAG (Retrieval-Augmented Generation)](./rag.mdx)** - Enhance your agents with external knowledge through retrieval mechanisms
### 🛠️ **Capabilities & Extensions**
- **[Tools](./tools.mdx)** - Extend your agents' capabilities by integrating with external tools and APIs
### 📊 **Quality & Monitoring**
- **[Evaluations](./evals.mdx)** - Evaluate your agents' effectiveness and identify areas for improvement
- **[Telemetry](./telemetry.mdx)** - Monitor and analyze your agents' performance and behavior
- **[Safety](./safety.mdx)** - Implement guardrails and safety measures to ensure responsible AI behavior
## Application Patterns
### 🤖 **Conversational Agents**
Build intelligent chatbots and assistants that can:
- Maintain context across conversations
- Access external knowledge bases
- Execute actions through tool integrations
- Apply safety filters and guardrails
### 📖 **RAG Applications**
Create knowledge-augmented applications that:
- Retrieve relevant information from documents
- Generate contextually accurate responses
- Handle large knowledge bases efficiently
- Provide source attribution
### 🔧 **Tool-Enhanced Systems**
Develop applications that can:
- Search the web for real-time information
- Interact with databases and APIs
- Perform calculations and analysis
- Execute complex multi-step workflows
### 🛡️ **Enterprise Applications**
Build production-ready systems with:
- Comprehensive safety measures
- Performance monitoring and analytics
- Scalable deployment configurations
- Evaluation and quality assurance
## Next Steps
1. **📖 Start with the Notebook** - Work through the complete tutorial
2. **🎯 Choose Your Pattern** - Pick the application type that matches your needs
3. **🏗️ Build Your Foundation** - Set up your [providers](/docs/providers/) and [distributions](/docs/distributions/)
4. **🚀 Deploy & Monitor** - Use our [deployment guides](/docs/deploying/) for production
## Related Resources
- **[Getting Started](/docs/getting_started/quickstart)** - Basic setup and concepts
- **[Providers](/docs/providers/)** - Available AI service providers
description: Web-based admin interface and chat playground for Llama Stack
sidebar_label: Playground
sidebar_position: 10
---
# Admin UI & Chat Playground
The Llama Stack UI provides a comprehensive web-based admin interface for managing your Llama Stack server, with an integrated chat playground for interactive testing. This admin interface is the primary way to monitor, manage, and debug your Llama Stack applications.
## Quick Start
Launch the admin UI with:
```bash
npx llama-stack-ui
```
Then visit `http://localhost:8322` to access the interface.
## Admin Interface Features
The Llama Stack UI is organized into three main sections:
RAG enables your applications to reference and recall information from external documents. Llama Stack makes Agentic RAG available through OpenAI's Responses API.
Llama Stack supports various approaches for building RAG applications. The server provides two APIs (Responses and Chat Completions), plus a high-level client wrapper (Agent class):
#### Approach 1: Agent Class (Client-Side)
The **Agent class** is a high-level client wrapper around the Responses API with automatic tool execution and session management. Best for conversational agents and multi-turn RAG.
```python
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
- No built-in session management (stateless by default)
- Best for: Single-turn queries, OpenAI-compatible applications
#### Approach 3: Chat Completions API
The **Chat Completions API** is a server-side API that gives you explicit control over retrieval and generation. Best for custom RAG pipelines and batch processing.
Doing great work is about more than just hard work and ambition; it involves combining several elements:
1. **Pursue What Excites You**: Engage in projects that are both ambitious and exciting to you. It's important to work on something you have a natural aptitude for and a deep interest in.
2. **Explore and Discover**: Great work often feels like a blend of discovery and creation. Focus on seeing possibilities and let ideas take their natural shape, rather than just executing a plan.
3. **Be Bold Yet Flexible**: Take bold steps in your work without over-planning. An adaptable approach that evolves with new ideas can often lead to breakthroughs.
4. **Work on Your Own Projects**: Develop a habit of working on projects of your own choosing, as these often lead to great achievements. These should be projects you find exciting and that challenge you intellectually.
5. **Be Earnest and Authentic**: Approach your work with earnestness and authenticity. Trying to impress others with affectation can be counterproductive, as genuine effort and intellectual honesty lead to better work outcomes.
6. **Build a Supportive Environment**: Work alongside great colleagues who inspire you and enhance your work. Surrounding yourself with motivating individuals creates a fertile environment for great work.
7. **Maintain High Morale**: High morale significantly impacts your ability to do great work. Stay optimistic and protect your mental well-being to maintain progress and momentum.
8. **Balance**: While hard work is essential, overworking can lead to diminishing returns. Balance periods of intensive work with rest to sustain productivity over time.
This approach shows that great work is less about following a strict formula and more about aligning your interests, ambition, and environment to foster creativity and innovation.
- **Vector Stores API**: OpenAI-compatible vector storage with automatic embedding model detection
- **Files API**: Document upload and processing using OpenAI's file format
- **Responses API**: Enhanced chat completions with agentic tool calling via file search
## Configuring Default Embedding Models
To enable automatic vector store creation without specifying embedding models, configure a default embedding model in your config.yaml like so:
```yaml
vector_stores:
default_provider_id: faiss
default_embedding_model:
provider_id: sentence-transformers
model_id: nomic-ai/nomic-embed-text-v1.5
```
With this configuration:
- `client.vector_stores.create()` works without requiring embedding model or provider parameters
- The system automatically uses the default vector store provider (`faiss`) when multiple providers are available
- The system automatically uses the default embedding model (`sentence-transformers/nomic-ai/nomic-embed-text-v1.5`) for any newly created vector store
- The `default_provider_id` specifies which vector storage backend to use
- The `default_embedding_model` specifies both the inference provider and model for embeddings
## Vector Store Operations
### Creating Vector Stores
You can create vector stores with automatic or explicit embedding model selection:
```python
# Automatic - uses default configured embedding model and vector store provider
vs = client.vector_stores.create()
# Explicit - specify embedding model and/or provider when you need specific ones
vs = client.vector_stores.create(
extra_body={
"provider_id": "faiss", # Optional: specify vector store provider
description: Compare the Agents API and OpenAI Responses API for building AI applications with tool calling capabilities
sidebar_label: Agents vs Responses API
sidebar_position: 5
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agents vs OpenAI Responses API
Llama Stack (LLS) provides two different APIs for building AI applications with tool calling capabilities: the **Agents API** and the **OpenAI Responses API**. While both enable AI systems to use tools, and maintain full conversation history, they serve different use cases and have distinct characteristics.
:::note
**Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](../providers/openai#chat-completions) directly, before progressing to Agents or Responses API.
:::
## Overview
### LLS Agents API
The Agents API is a full-featured, stateful system designed for complex, multi-turn conversations. It maintains conversation state through persistent sessions identified by a unique session ID. The API supports comprehensive agent lifecycle management, detailed execution tracking, and rich metadata about each interaction through a structured session/turn/step hierarchy. The API can orchestrate multiple tool calls within a single turn.
### OpenAI Responses API
The OpenAI Responses API is a full-featured, stateful system designed for complex, multi-turn conversations, with direct compatibility with OpenAI's conversational patterns enhanced by LLama Stack's tool calling capabilities. It maintains conversation state by chaining responses through a `previous_response_id`, allowing interactions to branch or continue from any prior point. Each response can perform multiple tool calls within a single turn.
### Key Differences
The LLS Agents API uses the Chat Completions API on the backend for inference as it's the industry standard for building AI applications and most LLM providers are compatible with this API. For a detailed comparison between Responses and Chat Completions, see [OpenAI's documentation](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
Additionally, Agents let you specify input/output shields whereas Responses do not (though support is planned). Agents use a linear conversation model referenced by a single session ID. Responses, on the other hand, support branching, where each response can serve as a fork point, and conversations are tracked by the latest response ID. Responses also lets you dynamically choose the model, vector store, files, MCP servers, and more on each inference call, enabling more complex workflows. Agents require a static configuration for these components at the start of the session.
Today the Agents and Responses APIs can be used independently depending on the use case. But, it is also productive to treat the APIs as complementary. It is not currently supported, but it is planned for the LLS Agents API to alternatively use the Responses API as its backend instead of the default Chat Completions API, i.e., enabling a combination of the safety features of Agents with the dynamic configuration and branching capabilities of Responses.
## Feature Comparison
| Feature | LLS Agents API | OpenAI Responses API |
|---------|------------|---------------------|
| **Conversation Management** | Linear persistent sessions | Can branch from any previous response ID |
print(f"Alternative web search: {response3.output_message.content}")
```
</TabItem>
</Tabs>
Both APIs demonstrate distinct strengths that make them valuable on their own for different scenarios. The Agents API excels in providing structured, safety-conscious workflows with persistent session management, while the Responses API offers flexibility through dynamic configuration and OpenAI compatible tool patterns.
## Use Case Examples
### 1. Research and Analysis with Safety Controls
**Best Choice: Agents API**
**Scenario:** You're building a research assistant for a financial institution that needs to analyze market data, execute code to process financial models, and search through internal compliance documents. The system must ensure all interactions are logged for regulatory compliance and protected by safety shields to prevent malicious code execution or data leaks.
**Why Agents API?** The Agents API provides persistent session management for iterative research workflows, built-in safety shields to protect against malicious code in financial models, and structured execution logs (session/turn/step) required for regulatory compliance. The static tool configuration ensures consistent access to your knowledge base and code interpreter throughout the entire research session.
### 2. Dynamic Information Gathering with Branching Exploration
**Best Choice: Responses API**
**Scenario:** You're building a competitive intelligence tool that helps businesses research market trends. Users need to dynamically switch between web search for current market data and file search through uploaded industry reports. They also want to branch conversations to explore different market segments simultaneously and experiment with different models for various analysis types.
**Why Responses API?** The Responses API's branching capability lets users explore multiple market segments from any research point. Dynamic per-call configuration allows switching between web search and file search as needed, while experimenting with different models (faster models for quick searches, more powerful models for deep analysis). The OpenAI-compatible tool patterns make integration straightforward.
### 3. OpenAI Migration with Advanced Tool Capabilities
**Best Choice: Responses API**
**Scenario:** You have an existing application built with OpenAI's Assistants API that uses file search and web search capabilities. You want to migrate to Llama Stack for better performance and cost control while maintaining the same tool calling patterns and adding new capabilities like dynamic vector store selection.
**Why Responses API?** The Responses API provides full OpenAI tool compatibility (`web_search`, `file_search`) with identical syntax, making migration seamless. The dynamic per-call configuration enables advanced features like switching vector stores per query or changing models based on query complexity - capabilities that extend beyond basic OpenAI functionality while maintaining compatibility.
### 4. Educational Programming Tutor
**Best Choice: Agents API**
**Scenario:** You're building a programming tutor that maintains student context across multiple sessions, safely executes code exercises, and tracks learning progress with audit trails for educators.
**Why Agents API?** Persistent sessions remember student progress across multiple interactions, safety shields prevent malicious code execution while allowing legitimate programming exercises, and structured execution logs help educators track learning patterns.
### 5. Advanced Software Debugging Assistant
**Best Choice: Agents API with Responses Backend**
**Scenario:** You're building a debugging assistant that helps developers troubleshoot complex issues. It needs to maintain context throughout a debugging session, safely execute diagnostic code, switch between different analysis tools dynamically, and branch conversations to explore multiple potential causes simultaneously.
**Why Agents + Responses?** The Agent provides safety shields for code execution and session management for the overall debugging workflow. The underlying Responses API enables dynamic model selection and flexible tool configuration per query, while branching lets you explore different theories (memory leak vs. concurrency issue) from the same debugging point and compare results.
:::info[Future Enhancement]
The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
:::
## Decision Framework
Use this framework to choose the right API for your use case:
### Choose Agents API when:
- ✅ You need **safety shields** for input/output validation
- ✅ Your application requires **linear conversation flow** with persistent context
- ✅ You need **audit trails** and structured execution logs
- ✅ Your tool configuration is **static** throughout the session
- ✅ You're building **educational, financial, or enterprise** applications with compliance requirements
### Choose Responses API when:
- ✅ You need **conversation branching** to explore multiple paths
description: Implement safety measures and content moderation in Llama Stack applications
sidebar_label: Safety
sidebar_position: 9
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Safety Guardrails
Safety is a critical component of any AI application. Llama Stack provides a comprehensive Shield system that can be applied at multiple touchpoints to ensure responsible AI behavior and content moderation.
## Shield System Overview
The Shield system in Llama Stack provides:
- **Content filtering** for both input and output messages
- **Multi-touchpoint protection** across your application flow
- **Configurable safety policies** tailored to your use case
- **Integration with agents** for automated safety enforcement
description: Extend agent capabilities with external tools and function calling
sidebar_label: Tools
sidebar_position: 6
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Tools
Tools are functions that can be invoked by an agent to perform tasks. They are organized into tool groups and registered with specific providers. Each tool group represents a collection of related tools from a single provider. They are organized into groups so that state can be externalized: the collection operates on the same state typically.
An example of this would be a "db_access" tool group that contains tools for interacting with a database. "list_tables", "query_table", "insert_row" could be examples of tools in this group.
Tools are treated as any other resource in llama stack like models. You can register them, have providers for them etc.
When instantiating an agent, you can provide it a list of tool groups that it has access to. Agent gets the corresponding tool definitions for the specified tool groups and passes them along to the model.
Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools.
## Server-side vs. Client-side Tool Execution
Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution and optional continuation using the `agent.resume_turn` method.
## Server-side Tools
Llama Stack provides built-in providers for some common tools. These include web search, math, and RAG capabilities.
### Web Search
You have three providers to execute the web search tool calls generated by a model: Brave Search, Bing Search, and Tavily Search.
To indicate that the web search tool calls should be executed by brave-search, you can point the "builtin::websearch" toolgroup to the "brave-search" provider.
```python
client.toolgroups.register(
toolgroup_id="builtin::websearch",
provider_id="brave-search",
args={"max_results": 5},
)
```
The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is:
```
{"<provider_name>_api_key": <your api key>}
```
### Math
The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
```python
client.toolgroups.register(
toolgroup_id="builtin::wolfram_alpha",
provider_id="wolfram-alpha"
)
```
Example usage:
```python
result = client.tool_runtime.invoke_tool(
tool_name="wolfram_alpha",
args={"query": "solve x^2 + 2x + 1 = 0"}
)
```
### RAG
The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
By default, llama stack config.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers.
:::
## Model Context Protocol (MCP)
[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered from an MCP endpoint and can be used to extend the agent's capabilities.
### Using Remote MCP Servers
You can find some popular remote MCP servers [here](https://github.com/jaw9c/awesome-remote-mcp-servers). You can register them as toolgroups in the same way as local providers.
Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide the authorization token when creating the Agent:
When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed along to the generative model.
```python
# Example tool definition
def my_tool(input: int) -> int:
"""
Runs my awesome tool.
:param input: some int parameter
"""
return input * 2
```
:::tip[Documentation Best Practices]
We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
:::
Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
```python
# Example agent config with client provided tools
agent = Agent(client, ..., tools=[my_tool])
```
Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/) for an example of how to use client provided tools.
## Tool Invocation
Tools can be invoked using the `invoke_tool` method:
```python
result = client.tool_runtime.invoke_tool(
tool_name="web_search",
kwargs={"query": "What is the capital of France?"}
)
```
The result contains:
- `content`: The tool's output
- `error_message`: Optional error message if the tool failed
- `error_code`: Optional error code if the tool failed
## Listing Available Tools
You can list all available tools or filter by tool group:
description: Understanding API stability levels and versioning in Llama Stack
sidebar_label: API Stability
sidebar_position: 4
---
# Llama Stack API Stability Leveling
In order to provide a stable experience in Llama Stack, the various APIs need different stability levels indicating the level of support, backwards compatability, and overall production readiness.
## Different Levels
### v1alpha
- Little to no expectation of support between versions
- Breaking changes are permitted
- Datatypes and parameters can break
- Routes can be added and removed
#### Graduation Criteria
- an API can graduate from `v1alpha` to `v1beta` if the team has identified the extent of the non-optional routes and the shape of their parameters/return types for the API eg. `/v1/openai/chat/completions`. Optional types can change.
- CRUD must stay stable once in `v1beta`. This is a commitment to backward compatibility, guaranteeing that most code you write against the v1beta version will not break during future updates. We may make additive changes (like adding a new, optional field to a response), but we will not make breaking changes (like renaming an existing "modelName" field to "name", changing an ID's data type from an integer to a string, or altering an endpoint URL).
- for OpenAI APIs, a comparison to the OpenAI spec for the specific API can be done to ensure completeness.
### v1beta
- API routes remain consistent between versions
- Parameters and return types are not ensured between versions
- API, besides minor fixes and adjustments, should be _almost_ v1. Changes should not be drastic.
#### Graduation Criteria
- an API can graduate from `v1beta` to `v1` if the API surface and datatypes are complete as identified by the team. The parameters and return types that are mandatory for each route are stable. All aspects of graduating from `v1alpha1` to `v1beta` apply as well.
- Optional parameters, routes, or parts of the return type can be added after graduating to `v1`
### v1 (stable)
- Considered stable
- Backwards compatible between Z-streams
- Y-stream breaking changes must go through the proper approval and announcement process.
- Datatypes for a route and its return types cannot change between Z-streams
- Y-stream datatype changes should be sparing, unless the changes are additional net-new parameters
- Must have proper conformance testing as outlined in https://github.com/llamastack/llama-stack/issues/3237
### v2+ (Major Versions)
Introducing a new major version like `/v2` is a significant and disruptive event that should be treated as a last resort. It is reserved for essential changes to a stable `/v1` API that are fundamentally backward-incompatible and cannot be implemented through additive, non-breaking changes or breaking changes across X/Y-Stream releases (x.y.z).
If a `/v2` version is deemed absolutely necessary, it must adhere to the following protocol to ensure a sane and predictable transition for users:
#### Lifecycle Progression
A new major version must follow the same stability lifecycle as `/v1`. It will be introduced as `/v2alpha`, mature to `/v2beta`, and finally become stable as `/v2`.
#### Coexistence:
The new `/v2` API must be introduced alongside the existing `/v1` API and run in parallel. It must not replace the `/v1` API immediately.
#### Deprecation Policy:
When a `/v2` API is introduced, a clear and generous deprecation policy for the `/v1` API must be published simultaneously. This policy must outline the timeline for the eventual removal of the `/v1` API, giving users ample time to migrate.
### Deprecated APIs
Deprecated APIs are those that are no longer actively maintained or supported. Depreated APIs are marked with the flag `deprecated = True` in the OpenAPI spec. These APIs will be removed in a future release.
### API Stability vs. Provider Stability
The leveling introduced in this document relates to the stability of the API and not specifically the providers within the API.
Providers can iterate as much as they want on functionality as long as they work within the bounds of an API. If they need to change the API, then the API should not be `/v1`, or those breaking changes can only happen on a y-stream release basis.
### Approval and Announcement Process for Breaking Changes
- **PR Labeling**: Any pull request that introduces a breaking API change must be clearly labeled with `breaking-change`.
- **PR Title/Commit**: Any pull request that introduces a breaking API change must contain `BREAKING CHANGE` in the title and commit footer. Alternatively, the commit can include `!`, eg. `feat(api)!: title goes here` This is outlined in the [conventional commits documentation](https://www.conventionalcommits.org/en/v1.0.0/#specification)
- **Maintainer Review**: At least one maintainer must explicitly acknowledge the breaking change during review by applying the `breaking-change` label. An approval must come with this label or the acknowledgement this label has already been applied.
- **Announcement**: Breaking changes require inclusion in release notes and, if applicable, a separate communication (e.g., Discord, Github Issues, or GitHub Discussions) prior to release.
If a PR has proper approvals, labels, and commit/title hygiene, the failing API conformance tests will be bypassed.
## Enforcement
### Migration of API routes under `/v1alpha`, `/v1beta`, and `/v1`
Instead of placing every API under `/v1`, any API that is not fully stable or complete should go under `/v1alpha` or `/v1beta`. For example, at the time of this writing, `post_training` belongs here, as well as any OpenAI-compatible API whose surface does not exactly match the upstream OpenAI API it mimics.
This migration is crucial as we get Llama Stack in the hands of users who intend to productize various APIs. A clear view of what is stable and what is actively being developed will enable users to pick and choose various APIs to build their products on.
This migration will be a breaking change for any API moving out of `/v1`. Ideally, this should happen before 0.3.0 and especially 1.0.0.
### `x-stability` tags in the OpenAPI spec for oasdiff
`x-stability` tags allow tools like oasdiff to enforce different rules for different stability levels; these tags should match the routes: [oasdiff stability](https://github.com/oasdiff/oasdiff/blob/main/docs/STABILITY.md)
### Testing
The testing of each stable API is already outlined in [issue #3237](https://github.com/llamastack/llama-stack/issues/3237) and is being worked on. These sorts of conformance tests should apply primarily to `/v1` APIs only, with `/v1alpha` and `/v1beta` having any tests the maintainers see fit as well as basic testing to ensure the routing works properly.
### New APIs going forward
Any subsequently introduced APIs should be introduced as `/v1alpha`
description: Understanding remote vs inline provider implementations
sidebar_label: API Providers
sidebar_position: 2
---
# API Providers
The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code.
- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack.
Most importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.
description: Understanding external APIs in Llama Stack
sidebar_label: External APIs
sidebar_position: 3
---
# External APIs
Llama Stack supports external APIs that live outside of the main codebase. This allows you to:
- Create and maintain your own APIs independently
- Share APIs with others without contributing to the main codebase
- Keep API-specific code separate from the core Llama Stack code
## Configuration
To enable external APIs, you need to configure the `external_apis_dir` in your Llama Stack configuration. This directory should contain your external API specifications:
```yaml
external_apis_dir: ~/.llama/apis.d/
```
## Directory Structure
The external APIs directory should follow this structure:
```
apis.d/
custom_api1.yaml
custom_api2.yaml
```
Each YAML file in these directories defines an API specification.
## API Specification
Here's an example of an external API specification for a weather API:
```yaml
module: weather
api_dependencies:
- inference
protocol: WeatherAPI
name: weather
pip_packages:
- llama-stack-api-weather
```
### API Specification Fields
- `module`: Python module containing the API implementation
- `protocol`: Name of the protocol class for the API
- `name`: Name of the API
- `pip_packages`: List of pip packages to install the API, typically a single package
## Required Implementation
External APIs must expose a `available_providers()` function in their module that returns a list of provider names:
```python
# llama_stack_api_weather/api.py
from llama_stack_api import Api, InlineProviderSpec, ProviderSpec
description: Available REST APIs and planned capabilities in Llama Stack
sidebar_label: APIs
sidebar_position: 1
---
# APIs
A Llama Stack API is described as a collection of REST endpoints following OpenAI API standards. We currently support the following APIs:
- **Inference**: run inference with a LLM
- **Safety**: apply safety policies to the output at a Systems (not only model) level
- **Agents**: run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
- **DatasetIO**: interface with datasets and data loaders
- **Scoring**: evaluate outputs of the system
- **Eval**: generate outputs (via Inference or Agents) and perform scoring
- **VectorIO**: perform operations on vector stores, such as adding documents, searching, and deleting documents
- **Files**: manage file uploads, storage, and retrieval
- **Post Training**: fine-tune a model
- **Tool Runtime**: interact with various tools and protocols
- **Responses**: generate responses from an LLM
We are working on adding a few more APIs to complete the application lifecycle. These will include:
- **Batch Inference**: run inference on a dataset of inputs
- **Batch Agents**: run agents on a dataset of inputs
- **Batches**: OpenAI-compatible batch management for inference
## OpenAI API Compatibility
We are working on adding OpenAI API compatibility to Llama Stack. This will allow you to use Llama Stack with OpenAI API clients and tools.
### File Operations and Vector Store Integration
The Files API and Vector Store APIs work together through file operations, enabling automatic document processing and search. This integration implements the [OpenAI Vector Store Files API specification](https://platform.openai.com/docs/api-reference/vector-stores-files) and allows you to:
- Upload documents through the Files API
- Automatically process and chunk documents into searchable vectors
- Store processed content in vector databases based on the availability of [our providers](../../providers/index.mdx)
- Search through documents using natural language queries
For detailed information about this integration, see [File Operations and Vector Store Integration](../file_operations_vector_stores.md).
Llama Stack addresses these challenges through a service-oriented, API-first approach:
**Develop Anywhere, Deploy Everywhere**
- Start locally with CPU-only setups
- Move to GPU acceleration when needed
- Deploy to cloud or edge without code changes
- Same APIs and developer experience everywhere
**Production-Ready Building Blocks**
- Pre-built safety guardrails and content filtering
- Built-in RAG and agent capabilities
- Comprehensive evaluation toolkit
- Full observability and monitoring
**True Provider Independence**
- Swap providers without application changes
- Mix and match best-in-class implementations
- Federation and fallback support
- No vendor lock-in
**Robust Ecosystem**
- Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies).
- Ecosystem offers tailored infrastructure, software, and services for deploying a variety of models.
## Our Philosophy
- **Service-Oriented**: REST APIs enforce clean interfaces and enable seamless transitions across different environments.
- **Composability**: Every component is independent but works together seamlessly
- **Production Ready**: Built for real-world applications, not just demos
- **Turnkey Solutions**: Easy to deploy built in solutions for popular deployment scenarios
With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
Some files were not shown because too many files have changed in this diff
Show more