# What does this PR do?
This commit significantly improves the environment variable substitution
functionality in Llama Stack configuration files:
* The version field in configuration files has been changed from string
to integer type for better type consistency across build and run
configurations.
* The environment variable substitution system for ${env.FOO:} was fixed
and properly returns an error
* The environment variable substitution system for ${env.FOO+} returns
None instead of an empty strings, it better matches type annotations in
config fields
* The system includes automatic type conversion for boolean, integer,
and float values.
* The error messages have been enhanced to provide clearer guidance when
environment variables are missing, including suggestions for using
default values or conditional syntax.
* Comprehensive documentation has been added to the configuration guide
explaining all supported syntax patterns, best practices, and runtime
override capabilities.
* Multiple provider configurations have been updated to use the new
conditional syntax for optional API keys, making the system more
flexible for different deployment scenarios. The telemetry configuration
has been improved to properly handle optional endpoints with appropriate
validation, ensuring that required endpoints are specified when their
corresponding sinks are enabled.
* There were many instances of ${env.NVIDIA_API_KEY:} that should have
caused the code to fail. However, due to a bug, the distro server was
still being started, and early validation wasn’t triggered. As a result,
failures were likely being handled downstream by the providers. I’ve
maintained similar behavior by using ${env.NVIDIA_API_KEY:+}, though I
believe this is incorrect for many configurations. I’ll leave it to each
provider to correct it as needed.
* Environment variable substitution now uses the same syntax as Bash
parameter expansion.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.
* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Adding `ChunkMetadata` so we can properly delete embeddings later.
More specifically, this PR refactors and extends the chunk metadata
handling in the vector database and introduces a distinction between
metadata used for model context and backend-only metadata required for
chunk management, storage, and retrieval. It also improves chunk ID
generation and propagation throughout the stack, enhances test coverage,
and adds new utility modules.
```python
class ChunkMetadata(BaseModel):
"""
`ChunkMetadata` is backend metadata for a `Chunk` that is used to store additional information about the chunk that
will NOT be inserted into the context during inference, but is required for backend functionality.
Use `metadata` in `Chunk` for metadata that will be used during inference.
"""
document_id: str | None = None
chunk_id: str | None = None
source: str | None = None
created_timestamp: int | None = None
updated_timestamp: int | None = None
chunk_window: str | None = None
chunk_tokenizer: str | None = None
chunk_embedding_model: str | None = None
chunk_embedding_dimension: int | None = None
content_token_count: int | None = None
metadata_token_count: int | None = None
```
Eventually we can migrate the document_id out of the `metadata` field.
I've introduced the changes so that `ChunkMetadata` is backwards
compatible with `metadata`.
<!-- If resolving an issue, uncomment and update the line below -->
Closes https://github.com/meta-llama/llama-stack/issues/2501
## Test Plan
Added unit tests
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Our starter distro required Ollama to be running (and a large list of
models available in that Ollama) to successfully start. This adjusts
things so that Ollama does not have to be running to use the starter
template / distro.
To accomplish this, a few changes were needed:
* The Ollama provider is now configurable whether it raises an Exception
or just logs a warning when it cannot reach the Ollama server on
startup. The default is to raise an exception (same as previous
behavior), but in the starter template we adjust this to just log a
warning so that we can bring the stack up without needing a running
Ollama server.
* The starter template no longer specifies a default list of models for
Ollama, as any models specified there need to actually be pulled and
available in Ollama. Instead, it adds a new
`OLLAMA_INFERENCE_MODEL` environment variable where users can provide an
optional model to register with the Ollama provider on startup.
Additional models can also be registered via the typical
`models.register(...)` at runtime.
* The vLLM template was adjusted to also allow an optional
`VLLM_INFERENCE_MODEL` specified on startup, so that the behavior
between vLLM and Ollama was consistent here to make it easy to get up
and running quickly.
* The default vector store was changed from sqlite-vec to faiss.
sqlite-vec can enabled via setting the `ENABLE_SQLITE_VEC` environment
variable, like we do for chromadb and pgvector. This is due to
sqlite-vec not shipping proper arm64 binaries, like we previously fixed
in #1530 for the ollama distribution.
## Test Plan
With this change, the following scenarios now work with the starter
template that did not before:
* no Ollama running
* Ollama running but not all of the Llama models pulled locally
* Ollama running with a custom model registered on startup
* vLLM running with a custom model registered on startup
* running the starter template on linux/arm64, like when running
containers on Mac without rosetta emulation
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add search_mode parameter (vector/keyword/hybrid) to
openai_search_vector_store method. Fixes OpenAPI
code generation by using str instead of Literal type.
Closes: #2459
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
feat: Add Gemini 2.0 and 2.5 models
This commit expands the set of known Gemini models by introducing:
- `gemini/gemini-2.0-flash`
- `gemini/gemini-2.5-flash`
- `gemini/gemini-2.5-pro`
These new models are added to `LLM_MODEL_IDS` for broader compatibility
and updated in `run.yaml` to allow for their immediate use in starter
configurations.
Signed-off-by: Eran Cohen <eranco@redhat.com>
# What does this PR do?
This adds the ability to list, retrieve, update, and delete Vector Store
Files. It implements these new APIs for the faiss and sqlite-vec
providers, since those are the two that also have the rest of the vector
store files implementation.
Closes#2445
## Test Plan
### test_openai_vector_stores Integration Tests
There are a number of new integration tests added, which I ran for each
provider as outlined below.
faiss (from ollama distro):
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
--embedding-model=all-MiniLM-L6-v2
```
sqlite-vec (from starter distro):
```
llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
--embedding-model=all-MiniLM-L6-v2
```
### file_search verification tests
I also ensured the file_search verification tests continue to work, both
for faiss and sqlite-vec.
faiss (ollama distro):
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.2-3B-Instruct
```
sqlite-vec (starter distro):
```
llama stack run llama_stack/templates/starter/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=together/meta-llama/Llama-3.2-3B-Instruct-Turbo
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Move to use vector_stores.search for file search tool in Responses,
which supports filters.
closes#2435
## Test Plan
Added e2e test with fitlers.
myenv ❯ llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search and filters' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
# What does this PR do?
- Implement OpenAI-compatible embeddings endpoint in vLLM provider
- Support both float and base64 encoding formats
- Add proper error handling and response formatting
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2447
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This PR contains two sets of notebooks that serve as reference material
for developers getting started with Llama Stack using the NVIDIA
Provider. Developers should be able to execute these notebooks
end-to-end, pointing to their NeMo Microservices deployment.
1. `beginner_e2e/`: Notebook that walks through a beginner end-to-end
workflow that covers creating datasets, running inference, customizing
and evaluating models, and running safety checks.
2. `tool_calling/`: Notebook that is ported over from the [Data Flywheel
& Tool Calling
notebook](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/nemo/data-flywheel)
that is referenced in the NeMo Microservices docs. I updated the
notebook to use the Llama Stack client wherever possible, and added
relevant instructions.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- Both notebook folders contain READMEs with pre-requisites. To manually
test these notebooks, you'll need to have a deployment of the NeMo
Microservices Platform and update the `config.py` file with your
deployment's information.
- I've run through these notebooks manually end-to-end to verify each
step works.
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
For code completion apps need "fill in the middle" capabilities.
Added option of `suffix` to `openai_completion` to enable this.
Updated ollama provider to showcase the same.
### Test Plan
```
pytest -sv --stack-config="inference=ollama" tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix
```
### OpenAI Sample script
```
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1")
response = client.completions.create(
model="qwen2.5-coder:1.5b",
prompt="The capital of ",
suffix="is Paris.",
max_tokens=10,
)
print(response.choices[0].text)
```
### Output
```
France is ____.
To answer this question, we
```
# What does this PR do?
Add support for hybrid search mode in SQLite-vec provider, which
combines
keyword and vector search for better results. The implementation:
- Adds hybrid search mode as a new option alongside vector and keyword
search
- Implements query_hybrid method in SQLiteVecIndex that:
- First performs keyword search to get candidate matches
- Then applies vector similarity search on those candidates
- Updates documentation to reflect the new search mode
This change improves search quality by leveraging both semantic
similarity
and keyword matching, while maintaining backward compatibility with
existing
vector and keyword search modes.
## Test Plan
```
pytest tests/unit/providers/vector_io/test_sqlite_vec.py -v -s --tb=short
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:217: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.6-arm64-arm-64bit', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'asyncio': '0.26.0', 'nbval': '0.11.0', 'cov': '6.1.1'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, anyio-4.8.0, asyncio-0.26.0, nbval-0.11.0, cov-6.1.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 10 items
tests/unit/providers/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_full_text_search PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_full_text_search_k_greater_than_results PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_no_keyword_matches PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_score_threshold PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_different_embedding PASSED
```
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This is an initial working prototype of wiring up the `file_search`
builtin tool for the Responses API to our existing rag knowledge search
tool.
This is me seeing what I could pull together on top of the bits we
already have merged. This may not be the ideal way to implement this,
and things like how I shuffle the vector store ids from the original
response API tool request to the actual tool execution feel a bit hacky
(grep for `tool_kwargs["vector_db_ids"]` in `_execute_tool_call` to see
what I mean).
## Test Plan
I stubbed in some new tests to exercise this using text and pdf
documents.
Note that this is currently under tests/verification only because it
sometimes flakes with tool calling of the small Llama-3.2-3B model we
run in CI (and that I use as an example below). We'd want to make the
test a bit more robust in some way if we moved this over to
tests/integration and ran it in CI.
### OpenAI SaaS (to verify test correctness)
```
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=https://api.openai.com/v1 \
--model=gpt-4o
```
### Fireworks with faiss vector store
```
llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
```
### Ollama with faiss vector store
This sometimes flakes on Ollama because the quantized small model
doesn't always choose to call the tool to answer the user's question.
But, it often works.
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.2-3B-Instruct
```
### OpenAI provider with sqlite-vec vector store
```
llama stack run ./llama_stack/templates/starter/run.yaml --image-type venv
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=openai/gpt-4o-mini
```
### Ensure existing vector store integration tests still pass
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io \
--text-model "meta-llama/Llama-3.2-3B-Instruct" \
--embedding-model=all-MiniLM-L6-v2
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds OpenAI compatibility for Ollama embeddings. Closes
https://github.com/meta-llama/llama-stack/issues/2428
Summary of changes:
- `llama_stack/providers/remote/inference/ollama/ollama.py`
- Implements the OpenAI embeddings endpoint for Ollama, replacing the
NotImplementedError with a full function that validates the model,
prepares parameters, calls the client, encodes embedding data
(optionally in base64), and returns a correctly structured response.
- Updates import statements to include the new embedding response
utilities.
- `llama_stack/providers/utils/inference/litellm_openai_mixin.py`
- Refactors the embedding data encoding logic to use a new shared
utility (`b64_encode_openai_embeddings_response`) instead of inline
base64 encoding and packing logic.
- Cleans up imports accordingly.
- `llama_stack/providers/utils/inference/openai_compat.py`
- Adds `b64_encode_openai_embeddings_response` to handle encoding OpenAI
embedding outputs (including base64 support) in a reusable way.
- Adds `prepare_openai_embeddings_params` utility for standardizing
embedding parameter preparation.
- Updates imports to include the new embedding data class.
- `tests/integration/inference/test_openai_embeddings.py`
- Removes `"remote::ollama"` from the list of providers that skip OpenAI
embeddings tests, since support is now implemented.
## Note
There was one minor issue, which required me to override the
`OpenAIEmbeddingsResponse.model` name with
`self._get_model(model).identifier` name, which is very unsatisfying.
## Test Plan
Unit Tests and integration tests
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Updated the `search` functionality return response to match openai.
## Test Plan
```
pytest -sv --stack-config=http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
Fixes provider weaviate `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then sets `score`
to infinity, which represent maximum similarity.
<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]
## Test Plan
Checkout this PR
Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient
base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)
models = client.models.list()
embedding_model = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384
_ = client.vector_dbs.register(
vector_db_id="foo_db",
embedding_model=embedding_model,
embedding_dimension=embedding_dimension,
provider_id="weaviate",
)
chunk = {
"content": "foo",
"mime_type": "text/plain",
"metadata": {
"document_id": "foo-id"
}
}
client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```
Extracts common OpenAI vector-store code into its own mixin so that all
providers can share the same core logic.
This also makes it easy for Llama Stack to support both vector-stores
and Llama Stack APIs in the interim so that both share the same
underlying vector-dbs.
Each provider contains storage specific logic to `create / edit / delete
/ list` vector dbs while the plumbing logic is standardized in the
common code.
Ensured that this works well with both faiss and sqllite-vec.
### Test Plan
```
llama stack run starter
pytest -sv --stack-config http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
Adding OpenAI compat `/v1/vector-store` apis.
This PR implements the `faiss` provider with followup PRs coming up for
other providers.
Added routes to create, update, delete, list vector stores.
Also added route to search a vector store
Inserting into vector stores is missing and will be a follow up diff.
### Test Plan
- Added new integration test for testing the faiss provider
```
pytest -sv --stack-config http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
Fixes provider pgvector `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then sets `score`
to infinity, which represent maximum similarity.
<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]
## Test Plan
Checkout this PR
Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient
base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)
models = client.models.list()
embedding_model = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384
_ = client.vector_dbs.register(
vector_db_id="foo_db",
embedding_model=embedding_model,
embedding_dimension=embedding_dimension,
provider_id="pgvector",
)
chunk = {
"content": "foo",
"mime_type": "text/plain",
"metadata": {
"document_id": "foo-id"
}
}
client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
To add health status check for remote VLLM
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
PR includes the unit test to test the added health check implementation
feature.
The non-streaming version is just a small layer on top of the streaming
version - just pluck off the final `response.completed` event and return
that as the response!
This PR also includes a couple other changes which I ended up making
while working on it on a flight:
- changes to `ollama` so it does not pull embedding models
unconditionally
- a small fix to library client to make the stream and non-stream cases
a bit more symmetric
# What does this PR do?
The chat completion ids generated by Ollama are not unique enough to use
with stored chat completions as they rely on only 3 numbers of
randomness to give unique values - ie `chatcmpl-373`. This causes
frequent collisions in id values of chat completions in Ollama, which
creates issues in our SQL storage of chat completions by id where it
expects ids to actually be unique.
So, this adjusts Ollama responses to use uuids as unique ids. This does
mean we're replacing the ids generated natively by Ollama. If we don't
wish to do this, we'll either need to relax the unique constraint on our
chat completions id field in the inference storage or convince Ollama
upstream to use something closer to uuid values here.
Closes#2315
## Test Plan
I tested by running the openai completion / chat completion integration
tests in a loop. Without this change, I regularly get unique id
collisions. With this change, I do not. We sometimes see flakes from
these unique id collisions in our CI tests, and this will resolve those.
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml
while true; do; \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
pytest -s -v \
tests/integration/inference/test_openai_completion.py \
--stack-config=http://localhost:8321 \
--text-model="meta-llama/Llama-3.2-3B-Instruct"; \
done
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
fixes provider to use stream var correctly
Before
```
curl --request POST \
--url http://localhost:8321/v1/openai/v1/chat/completions \
--header 'content-type: application/json' \
--data '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
]
}'
{"detail":"Internal server error: An unexpected error occurred."}
```
After
```
llama-stack % curl --request POST \
--url http://localhost:8321/v1/openai/v1/chat/completions \
--header 'content-type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama4-scout-instruct-basic",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
]
}'
{"id":"chatcmpl-97978538-271d-4c73-8d4d-c509bfb6c87e","choices":[{"message":{"role":"assistant","content":"I'm an AI assistant designed by Meta. I'm here to answer your questions, share interesting ideas and maybe even surprise you with a fresh perspective. What's on your mind?","name":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","created":1748896403,"model":"accounts/fireworks/models/llama4-scout-instruct-basic"}%
```
# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
Updates sambanova inference to use strict as false in json_schema
structured output
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
# What does this PR do?
Handles the case where the vllm config `tls_verify` is set to `false` or
`true`.
Closes: https://github.com/meta-llama/llama-stack/issues/2283
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The `tls_verify` can now receive a path to a certificate file if the
endpoint requires it.
Signed-off-by: Sébastien Han <seb@redhat.com>
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
Since https://github.com/meta-llama/llama-stack/pull/2193 switched to
openai sdk, we need to strip 'openai/' from the model_id
## Test Plan
start server with openai provider and send a chat completion call
# What does this PR do?
Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs
## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
When launching a fine-tuning job, an upcoming version of NeMo Customizer
will expect the `config` name to be formatted as
`namespace/name@version`. Here, `config` is a reference to a model +
additional metadata. There could be multiple `config`s that reference
the same base model.
This PR updates NVIDIA's `supervised_fine_tune` to simply pass the
`model` param as-is to NeMo Customizer. Currently, it expects a
specific, allowlisted llama model (i.e. `meta/Llama3.1-8B-Instruct`) and
converts it to the provider format (`meta/llama-3.1-8b-instruct`).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
From a notebook, I built an image with my changes:
```
!llama stack build --template nvidia --image-type venv
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
```
And could successfully launch a job:
```
response = client.post_training.supervised_fine_tune(
job_uuid="",
model="meta/llama-3.2-1b-instruct@v1.0.0+A100", # Model passed as-is to Customimzer
...
)
job_id = response.job_uuid
print(f"Created job with ID: {job_id}")
Output:
Created job with ID: cust-Jm4oGmbwcvoufaLU4XkrRU
```
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.
Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.
Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.
The APIs are OpenAI-compliant, with an additional `input_messages`
field.
## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
# What does this PR do?
fixes#2121
this implementation splits reponsibility between litellm and openai
libraries -
| Inference Method | Implementation Source |
|----------------------------|--------------------------|
| completion | LiteLLMOpenAIMixin |
| chat_completion | LiteLLMOpenAIMixin |
| embedding | LiteLLMOpenAIMixin |
| batch_completion | LiteLLMOpenAIMixin |
| batch_chat_completion | LiteLLMOpenAIMixin |
| openai_completion | AsyncOpenAI |
| openai_chat_completion | AsyncOpenAI |
## Test Plan
smoke test with -
```
$ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run
$ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8
$ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{
"model": "Llama-4-Scout-17B-16E-Instruct-FP8",
"messages": [
{"role": "user", "content": "Hello Llama! Can you give me a quick intro?"}
]
}'
{"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}}
```
and run full test suite.
# What does this PR do?
This fixes an issue in how we used the tool_call_buf from streaming tool
calls in the remote-vllm provider where it would end up concatenating
parameters from multiple different tool call results instead of
aggregating the results from each tool call separately.
It also fixes an issue found while digging into that where we were
accidentally mixing the json string form of tool call parameters with
the string representation of the python form, which mean we'd end up
with single quotes in what should be double-quoted json strings.
Closes#1120
## Test Plan
The following tests are now passing 100% for the remote-vllm provider,
where some of the test_text_inference were failing before this change:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
```
All but one of the agent tests are passing (including the multi-tool
one). See the PR at https://github.com/vllm-project/vllm/pull/17917 and
a gist at
https://gist.github.com/bbrowning/4734240ce96b4264340caa9584e47c9e for
changes needed there, which will have to get made upstream in vLLM.
Agent tests:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
````
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
note: the openai provider exposes the litellm specific model names to
the user. this change is compatible with that. the litellm names should
be deprecated.
# What does this PR do?
Closes#2113.
Closes#1783.
Fixes a bug in handling the end of tool execution request stream where
no `finish_reason` is provided by the model.
## Test Plan
1. Ran existing unit tests
2. Added a dedicated test verifying correct behavior in this edge case
3. Ran the code snapshot from #2113
[//]: # (## Documentation)
# What does this PR do?
Closes#2111.
Fixes an error causing Llama Stack to just return `<tool_call>` and
complete the turn without actually executing the tool. See the issue
description for more detail.
## Test Plan
1) Ran existing unit tests
2) Added a dedicated test verifying correct behavior in this edge case
3) Ran the code snapshot from #2111
# What does this PR do?
The ollama provider was using an older variant of the code to convert
incoming parameters from the OpenAI API completions and chat completion
endpoints into requests that get sent to the backend provider over its
own OpenAI client. This updates it to use the common
`prepare_openai_completion_params` method used elsewhere, which takes
care of removing stray `None` values even for nested structures.
Without this, some other parameters, even if they have values of `None`,
make their way to ollama and actually influence its inference output as
opposed to when those parameters are not sent at all.
## Test Plan
This passes tests/integration/inference/test_openai_completion.py and
fixes the issue found in #2098, which was tested via manual curl
requests crafted a particular way.
Closes#2098
Signed-off-by: Ben Browning <bbrownin@redhat.com>
```
$ INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
CHROMADB_URL=http://localhost:8000 \
llama stack build --image-type conda --image-name llama \
--providers vector_io=remote::chromadb,inference=remote::ollama \
--run
...
File ".../llama_stack/providers/remote/vector_io/chroma/chroma.py", line 31, in <module>
ChromaClientType = chromadb.AsyncHttpClient | chromadb.PersistentClient
TypeError: unsupported operand type(s) for |: 'function' and 'function'
```
issue: AsyncHttpClient and PersistentClient are functions that return
AsyncClientAPI and ClientAPI types, respectively. | cannot be used to
construct a type from functions.
previously the code was Union[AsyncHttpClient, PersistentClient], which
did not trigger an error
# What does this PR do?
Closes#2135