Add support for deleting individual chunks from vector stores
- Add abstract remove_chunk() method to EmbeddingIndex base class
- Implement chunk deletion for Faiss provider, SQLite Vec, Milvus,
PGVector
- Placeholder implementations with NotImplementedError for
Chroma/Qdrant/Weaviate
- Integrate chunk deletion into OpenAI vector store file deletion flow
- removed xfail from
test_openai_vector_store_delete_file_removes_from_vector_store
Closes: #2477
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Enable Chroma inline unit tests and fix integration tests.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Avoid the error message:
```
INFO 2025-07-24 21:51:54,530 __main__:598 server: Received interrupt signal, shutting down gracefully...
ERROR 2025-07-24 21:51:54,692 asyncio:1826 uncategorized: Task was destroyed but it is pending!
task: <Task pending name='Task-15' coro=<refresh_registry() running at
/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/stack.py:356> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=>
```
# What does this PR do?
Today, external providers are installed via the `external_providers_dir`
in the config. This necessitates users to understand the `ProviderSpec`
and set up their directories accordingly. This process splits up the
config for the stack across multiple files, directories, and formats.
Most (if not all) external providers today have a
[get_provider_spec](559cb18fbb/src/ramalama_stack/provider.py (L9))
method that sits unused. Utilizing this method rather than the
providers.d route allows for a much easier installation process for
external providers and limits the amount of extra configuration a
regular user has to do to get their stack off the ground.
To accomplish this and wire it throughout the build process, Introduce
the concept of a `module` for users to specify for an external provider
upon build time. In order to facilitate this, align the build and run
spec to use `Provider` class rather than the stringified provider_type
that build currently uses.
For example, say this is in your build config:
```
- provider_id: ramalama
provider_type: remote::ramalama
module: ramalama_stack
```
during build (in the various `build_...` scripts), additionally to
installing any pip dependencies we will also install this module and use
the `get_provider_spec` method to retrieve the ProviderSpec that is
currently specified using `providers.d`.
In production so far, providing instructions for installing external
providers for users has been difficult: they need to install the module
as a pre-req, create the providers.d directory, copy in the provider
spec, and also copy in the necessary build/run yaml files. Accessing an
external provider should be as easy as possible, and pointing to its
installable module aligns more with the rest of our build and dependency
management process.
For now, `external_providers_dir` still exists as an alternate more
declarative method of using external providers.
## Test Plan
added an integration test installing an external provider from module
and more unit test coverage for `get_provider_registry`
( the warning in yellow is expected, the module is installed inside of
the build env, not where we are running the command)
<img width="1119" height="400" alt="Screenshot 2025-07-24 at 11 30
48 AM"
src="https://github.com/user-attachments/assets/1efbaf45-b9e8-451a-bd63-264ed664706d"
/>
<img width="1154" height="618" alt="Screenshot 2025-07-24 at 11 31
14 AM"
src="https://github.com/user-attachments/assets/feb2b3ea-c5dd-418e-9662-9a3bd5dd6bdc"
/>
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Bumps [form-data](https://github.com/form-data/form-data) from 4.0.2 to
4.0.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/form-data/form-data/releases">form-data's
releases</a>.</em></p>
<blockquote>
<h2>v4.0.4</h2>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.3...v4.0.4">v4.0.4</a>
- 2025-07-16</h2>
<h3>Commits</h3>
<ul>
<li>[meta] add <code>auto-changelog</code> <a
href="811f68282f"><code>811f682</code></a></li>
<li>[Tests] handle predict-v8-randomness failures in node < 17 and
node > 23 <a
href="1d11a76434"><code>1d11a76</code></a></li>
<li>[Fix] Switch to using <code>crypto</code> random for boundary values
<a
href="3d1723080e"><code>3d17230</code></a></li>
<li>[Tests] fix linting errors <a
href="5e340800b5"><code>5e34080</code></a></li>
<li>[meta] actually ensure the readme backup isn’t published <a
href="316c82ba93"><code>316c82b</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a
href="58c25d7640"><code>58c25d7</code></a></li>
<li>[meta] fix readme capitalization <a
href="2300ca1959"><code>2300ca1</code></a></li>
</ul>
<h2>v4.0.3</h2>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.3">v4.0.3</a>
- 2025-06-05</h2>
<h3>Fixed</h3>
<ul>
<li>[Fix] <code>append</code>: avoid a crash on nullish values <a
href="https://redirect.github.com/form-data/form-data/issues/577"><code>[#577](https://github.com/form-data/form-data/issues/577)</code></a></li>
</ul>
<h3>Commits</h3>
<ul>
<li>[eslint] use a shared config <a
href="426ba9ac44"><code>426ba9a</code></a></li>
<li>[eslint] fix some spacing issues <a
href="20941917f0"><code>2094191</code></a></li>
<li>[Refactor] use <code>hasown</code> <a
href="81ab41b46f"><code>81ab41b</code></a></li>
<li>[Fix] validate boundary type in <code>setBoundary()</code> method <a
href="8d8e469309"><code>8d8e469</code></a></li>
<li>[Tests] add tests to check the behavior of <code>getBoundary</code>
with non-strings <a
href="837b8a1f75"><code>837b8a1</code></a></li>
<li>[Dev Deps] remove unused deps <a
href="870e4e6659"><code>870e4e6</code></a></li>
<li>[meta] remove local commit hooks <a
href="e6e83ccb54"><code>e6e83cc</code></a></li>
<li>[Dev Deps] update <code>eslint</code> <a
href="4066fd6f65"><code>4066fd6</code></a></li>
<li>[meta] fix scripts to use prepublishOnly <a
href="c4bbb13c0e"><code>c4bbb13</code></a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/form-data/form-data/blob/master/CHANGELOG.md">form-data's
changelog</a>.</em></p>
<blockquote>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.3...v4.0.4">v4.0.4</a>
- 2025-07-16</h2>
<h3>Commits</h3>
<ul>
<li>[meta] add <code>auto-changelog</code> <a
href="811f68282f"><code>811f682</code></a></li>
<li>[Tests] handle predict-v8-randomness failures in node < 17 and
node > 23 <a
href="1d11a76434"><code>1d11a76</code></a></li>
<li>[Fix] Switch to using <code>crypto</code> random for boundary values
<a
href="3d1723080e"><code>3d17230</code></a></li>
<li>[Tests] fix linting errors <a
href="5e340800b5"><code>5e34080</code></a></li>
<li>[meta] actually ensure the readme backup isn’t published <a
href="316c82ba93"><code>316c82b</code></a></li>
<li>[Dev Deps] update <code>@ljharb/eslint-config</code> <a
href="58c25d7640"><code>58c25d7</code></a></li>
<li>[meta] fix readme capitalization <a
href="2300ca1959"><code>2300ca1</code></a></li>
</ul>
<h2><a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.3">v4.0.3</a>
- 2025-06-05</h2>
<h3>Fixed</h3>
<ul>
<li>[Fix] <code>append</code>: avoid a crash on nullish values <a
href="https://redirect.github.com/form-data/form-data/issues/577"><code>[#577](https://github.com/form-data/form-data/issues/577)</code></a></li>
</ul>
<h3>Commits</h3>
<ul>
<li>[eslint] use a shared config <a
href="426ba9ac44"><code>426ba9a</code></a></li>
<li>[eslint] fix some spacing issues <a
href="20941917f0"><code>2094191</code></a></li>
<li>[Refactor] use <code>hasown</code> <a
href="81ab41b46f"><code>81ab41b</code></a></li>
<li>[Fix] validate boundary type in <code>setBoundary()</code> method <a
href="8d8e469309"><code>8d8e469</code></a></li>
<li>[Tests] add tests to check the behavior of <code>getBoundary</code>
with non-strings <a
href="837b8a1f75"><code>837b8a1</code></a></li>
<li>[Dev Deps] remove unused deps <a
href="870e4e6659"><code>870e4e6</code></a></li>
<li>[meta] remove local commit hooks <a
href="e6e83ccb54"><code>e6e83cc</code></a></li>
<li>[Dev Deps] update <code>eslint</code> <a
href="4066fd6f65"><code>4066fd6</code></a></li>
<li>[meta] fix scripts to use prepublishOnly <a
href="c4bbb13c0e"><code>c4bbb13</code></a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="41996f5ac7"><code>41996f5</code></a>
v4.0.4</li>
<li><a
href="316c82ba93"><code>316c82b</code></a>
[meta] actually ensure the readme backup isn’t published</li>
<li><a
href="2300ca1959"><code>2300ca1</code></a>
[meta] fix readme capitalization</li>
<li><a
href="811f68282f"><code>811f682</code></a>
[meta] add <code>auto-changelog</code></li>
<li><a
href="5e340800b5"><code>5e34080</code></a>
[Tests] fix linting errors</li>
<li><a
href="1d11a76434"><code>1d11a76</code></a>
[Tests] handle predict-v8-randomness failures in node < 17 and node
> 23</li>
<li><a
href="58c25d7640"><code>58c25d7</code></a>
[Dev Deps] update <code>@ljharb/eslint-config</code></li>
<li><a
href="3d1723080e"><code>3d17230</code></a>
[Fix] Switch to using <code>crypto</code> random for boundary
values</li>
<li><a
href="d8d67dc8ac"><code>d8d67dc</code></a>
v4.0.3</li>
<li><a
href="e6e83ccb54"><code>e6e83cc</code></a>
[meta] remove local commit hooks</li>
<li>Additional commits viewable in <a
href="https://github.com/form-data/form-data/compare/v4.0.2...v4.0.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/meta-llama/llama-stack/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
- Added ability to specify `required_scope` when declaring an API. This
is part of the `@webmethod` decorator.
- If auth is enabled, a user can access an API only if
`user.attributes['scope']` includes the `required_scope`
- We add `required_scope='telemetry.read'` to the telemetry read APIs.
## Test Plan
CI with added tests
1. Enable server.auth with github token
2. Observe `client.telemetry.query_traces()` returns 403
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds support for the new Streamable HTTP transport for MCP, as
well as falling back to the SSE protocol if the Streamable HTTP
connection fails.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2542
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Calum Murray <cmurray@redhat.com>
# What does this PR do?
Prototype on a new feature to allow new APIs to be plugged in Llama
Stack. Opened for early feedback on the approach and test appetite on
the functionality.
@ashwinb @raghotham open for early feedback, thanks!
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
currently `print` is being used with custom formatting to achieve
telemetry output in the console_span_processor
This causes telemetry not to show up in log files when using
`LLAMA_STACK_LOG_FILE`. During testing it looks like telemetry is not
being captured when it is
switch to using Rich formatting with the logger and then strip the
formatting off when a log file is being used so the formatting looks
normal
## Test Plan
before:
console:
<img width="967" height="127" alt="Screenshot 2025-07-21 at 4 02 15 PM"
src="https://github.com/user-attachments/assets/b09518cc-9d38-4970-9877-70e2c41fcbb5"
/>
log file (no telemetry):
```
2025-07-21 16:01:32,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 16:01:34,779 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 16:01:35,083 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 16:01:35,091 uvicorn.error:84 uncategorized: Started server process [68679]
2025-07-21 16:01:35,091 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 16:01:35,092 __main__:163 server: Starting up
2025-07-21 16:01:35,092 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 16:01:35,092 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 16:01:37,167 uvicorn.access:473 uncategorized: 127.0.0.1:53145 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
after:
console:
<img width="797" height="165" alt="Screenshot 2025-07-22 at 3 28 44 PM"
src="https://github.com/user-attachments/assets/44d40e3b-6502-439d-9ea5-38058b289962"
/>
log file:
```
2025-07-21 15:59:51,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 15:59:53,801 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 15:59:54,059 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 15:59:54,066 uvicorn.error:84 uncategorized: Started server process [68578]
2025-07-21 15:59:54,067 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 15:59:54,067 __main__:163 server: Starting up
2025-07-21 15:59:54,067 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 15:59:54,068 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 15:59:55,381 [TELEMETRY] 19:59:55.381 /v1/openai/v1/chat/completions
2025-07-21 15:59:55,619 uvicorn.access:473 uncategorized: 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
2025-07-21 15:59:55,621 [TELEMETRY] 19:59:55.621 /v1/openai/v1/chat/completions [StatusCode.OK] (240.07ms)
2025-07-21 15:59:55,622 [TELEMETRY] 19:59:55.620 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
This flips #2823 and #2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.
In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.
# What does this PR do?
Adds type guards in /distribution/inspect.py and ignores a valid-type
mypy error in library_client.py. This PR is part of issue #2647 . I'm
rather unsure whether ignoring the valid-type error is correct in this
case. It appears that args[0] is interpreted as [any] but I didn't find
any way to specify the type.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
openai/models.py has backward compat entries for litellm model names.
the starter template includes these in the list of registered models.
the inclusion results in duplicate model registrations.
the backward compat is no longer necessary.
## Test Plan
ci
# What does this PR do?
This PR implements the openai compatible endpoints for chromadb
Closes#2462
## Test Plan
Ran ollama llama stack server and ran the command
`pytest -sv --stack-config=http://localhost:8321
tests/integration/vector_io/test_openai_vector_stores.py
--embedding-model all-MiniLM-L6-v2`
8 failed, 27 passed, 8 skipped, 1 xfailed
The failed ones are regarding files api
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
Co-authored-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
- Use printf to to escape special characters (e.g. < > )
- Apply escaping to pip_dependencies and special_pip_deps
Resolves shell interpretation of >= operators as redirections that were
causing build failing to respect versions and unexpected file creation
in /app directory.
Closes: #2866
## Test Plan
Manually tested, will also be tested by existing CI
Signed-off-by: Derek Higgins <derekh@redhat.com>
- rm TEMP_DIR when build_container.sh succeeds
- prevents multiple temp directories with Containerfile being left in
/tmp
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I noticed a few issues with my implementation of the search mode
validation for RagQuery.
This PR replaces the check for search mode in RagQuery with a Literal.
There were issues before with
```
TypeError: Object of type RAGSearchMode is not JSON serializable
```
When using
```
query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
```
It also fixes the fact that despite user input "vector" was always the
used search mode.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Verify that a chosen search mode works when using Rag Query or use below
agent config:
```
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
"query_config": {
"mode": "keyword",
"max_chunks": 6
}
},
}
],
)
```
Running Unit Tests:
```
uv sync --extra dev
uv run pytest tests/unit/rag/test_rag_query.py -v
```
# What does this PR do?
Moving vector store and vector store files helper methods to
`openai_vector_store_mixin.py`
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
The tests are already supported in the CI and tests the inline providers
and current integration tests.
Note that the `vector_index` fixture will be test `milvus_vec_adapter`,
`faiss_vec_adapter`, and `sqlite_vec_adapter` in
`tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py`.
Additionally, the integration tests in `integration-vector-io-tests.yml`
runs `tests/integration/vector_io` tests for the following providers:
```python
vector-io-provider: ["inline::faiss", "inline::sqlite-vec", "inline::milvus", "remote::chromadb", "remote::pgvector"]
```
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
add an `OpenAIMixin` for use by inference providers who remote endpoints
support an OpenAI compatible API.
use is demonstrated by refactoring
- OpenAIInferenceAdapter
- NVIDIAInferenceAdapter (adds embedding support)
- LlamaCompatInferenceAdapter
## Test Plan
existing unit and integration tests
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/2716/ broke commands
like:
```
python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml
```
And will fail with:
```
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 626, in <module>
main()
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 402, in main
config_file = resolve_config_or_template(args.config, Mode.RUN)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/utils/config_resolution.py", line 43, in resolve_config_or_template
config_path = Path(config_or_template)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 1162, in __init__
super().__init__(*args)
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 373, in __init__
raise TypeError(
TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'NoneType'
```
Complaining that no positional arguments are present. We now honour the
deprecation until --config and --template are removed completely.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Both ` python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml` and ` python -m
llama_stack.distribution.server.server
llama_stack/templates/starter/run.yaml` should run the server. Same for
`--template starter`.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Remove --no-cache flags from uv pip install commands to enable caching
- Mount host uv cache directory to container for persistent caching
- Set UV_LINK_MODE=copy to prevent uv using hardlinks
- When building the starter image
o Build time reduced from ~4:45 to ~3:05 on subsequent builds
(environment specific)
o Eliminates re-downloading of 3G+ of data on each build
o Cache size: ~6.2G (when building starter image)
Fixes excessive data downloads during distro container builds.
Signed-off-by: Derek Higgins <derekh@redhat.com>
This PR updates model registration and lookup behavior to be slightly
more general / flexible. See
https://github.com/meta-llama/llama-stack/issues/2843 for more details.
Note that this change is backwards compatible given the design of the
`lookup_model()` method.
## Test Plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes flaky telemetry tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
See https://github.com/meta-llama/llama-stack/pull/2814
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
chore: Making name optional in openai_create_vector_store
# Closes https://github.com/meta-llama/llama-stack/issues/2706
## Test Plan
CI and unit tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Ensures that session turns retrieved from the agent persistence layer
are sorted by their `started_at` timestamp, as the key-value store does
not guarantee order.
Closes#2852
## Test Plan
- [ ] Add unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
minor update of the pgvector doc, changing 'faiss' to 'pgvector'
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Refactors the vector store routing logic by moving OpenAI-compatible
vector store operations from the `VectorIORouter` to the
`VectorDBsRoutingTable`.
Closes https://github.com/meta-llama/llama-stack/issues/2761
## Test Plan
Added unit tests to cover new routing logic and ACL checks.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Part of #2696
## Test Plan
Run `llama stack run starter`
Error:
```
myenv ❯ llama stack run starters
WARNING 2025-07-10 12:12:43,052 llama_stack.cli.stack.run:82 server: Conda detected. Using conda environment myenv for the run.
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
[--image-type {conda,venv}] [--enable-ui]
[config | template]
llama stack run: error: Could not resolve config or template 'starters'.
Tried the following locations:
1. As file path: /Users/erichuang/projects/llama-stack-git/starters
2. As template: /Users/erichuang/projects/llama-stack-git/llama_stack/templates/starters/run.yaml
3. As built distribution: (/Users/erichuang/.llama/distributions/llamastack-starters/starters-run.yaml, /Users/erichuang/.llama/distributions/starters/starters-run.yaml)
Available templates: dell, test-env, vllm-gpu, test-template, cerebras, openai-api-verification, sambanova, passthrough, direct-config, together, openai, fireworks, meta-reference-gpu, __pycache__, dev, ollama, watsonx, remote-vllm, llama_api, groq, dummy, oracle, nvidia, ci-tests, postgres-demo, test-stack, bedrock, starter, hf-serverless, hf-endpoint, tgi, open-benchmark, verification
Did you mean one of these templates?
- starter
- together
- postgres-demo
```
# What does this PR do?
After https://github.com/meta-llama/llama-stack/pull/2818, SIGINT will
print a stack trace. This is because uvicorn re-raises SIGINT and it
gets converted by Python internal signal handler (default handles
SIGINT) to KeyboardInterrupt exception. We know simply catch the
exception to get a clean exit, this is not changing the behavior on
SIGINT.
## Test Plan
Run the server, hit Ctrl+C or `kill -2 <server pid>` and expect a clean
exit with no stack trace.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR add `provider_id` field to `VectorDBInput` class.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2819
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Just like #2805 but for vLLM.
We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.
## Test Plan
Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:
```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
uv run llama stack run --image-type venv /tmp/starter.yaml
```
Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
Inline _inference_ providers haven't proved to be very useful -- they
are rarely used. And for good reason -- it is almost never a good idea
to include a complex (distributed) inference engine bundled into a
distributed stateful front-end server serving many other things.
Responsibility should be split properly.
See Discord discussion:
1395849853
For self-hosted providers like Ollama (or vLLM), the backing server is
running a set of models. That server should be treated as the source of
truth and the Stack registry should just be a cache for those models. Of
course, in production environments, you may not want this (because you
know what model you are running statically) hence there's a config
boolean to control this behavior.
_This is part of a series of PRs aimed at removing the requirement of
needing to set `INFERENCE_MODEL` env variables for running Llama Stack
server._
## Test Plan
Copy and modify the starter.yaml template / config and enable
`refresh_models: true, refresh_models_interval: 10` for the ollama
provider. Then, run:
```
LLAMA_STACK_LOGGING=all=debug \
ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml
```
See a gargantuan amount of logs, but verify that the provider is
periodically refreshing models. Stop and prune a model from ollama
server, restart the server. Verify that the model goes away when I call
`uv run llama-stack-client models list`
# What does this PR do?
This PR fixes the `DPOAlignmentConfig` schema to use the correct Direct
Preference Optimization (DPO) parameters.
The current schema incorrectly uses PPO-inspired parameters
(`reward_scale`, `reward_clip`, `epsilon`, `gamma`) that are not part of
the DPO algorithm. This PR updates it to use the standard DPO
parameters:
- `beta`: The KL divergence coefficient that controls deviation from the
reference model
- `loss_type`: The type of DPO loss function (sigmoid, hinge, ipo,
kto_pair)
These parameters align with standard DPO implementations like
HuggingFace's TRL library.
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
When we call `construct_stack()`, providers are instantiated and
`initialize()` is called. This call can end up doing _anything_ at all
-- specifically, providers are free to create long running background
tasks as part of this. If we wrapped this within a `asyncio.run()` as in
the current code, these tasks get canceled when the stack construction
finishes. This is not correct. The PR addresses the issue by creating a
persistent event loop which is used for both the stack as well as for
running the uvicorn server. In other words, the lifetime of the
providers (and downstream async code) is now the same as the lifetime of
the uvicorn server.
## Test Plan
This should not affect any current code since we don't have background
tasks created right now. However,
https://github.com/meta-llama/llama-stack/pull/2805 will start using
this functionality.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
If I am running `uv run llama stack run --image-type venv` it should not
be saying to me "Conda detected" because I am pretty clearly telling it
I need venv. The root cause is the offending line.
# What does this PR do?
let's users register models available at
https://integrate.api.nvidia.com/v1/models that isn't already in
llama_stack/providers/remote/inference/nvidia/models.py
## Test Plan
1. run the nvidia distro
2. register a model from https://integrate.api.nvidia.com/v1/models that
isn't already know, as of this writing
nvidia/llama-3.1-nemotron-ultra-253b-v1 is a good example
3. perform inference w/ the model
- POST /v1/models accepts optional provider_model_id
- ModelsRoutingTable.register_model handler ensures it is non-None,
providing a default
usage of Model.provider_model_id will no longer need to detect None
Move sentence-transformers to be the first embedding in the list of
models. This ensures it will always be the default and is more
consistent then having the default change based on what env variables
are available
Closes: #2702
## Test Plan
Manually verified
Signed-off-by: Derek Higgins <derekh@redhat.com>