# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
This adds a check to ensure we don't attempt to concatenate `None + str`
or `str + None` when building up our arguments for streaming tool calls
in the Responses API.
## Test Plan
All existing tests pass with this change.
Unit tests:
```
python -m pytest -s -v \
tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
Integration tests:
```
llama stack run llama_stack/templates/together/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -s -v \
tests/integration/agents/test_openai_responses.py \
--text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Verification tests:
```
llama stack run llama_stack/templates/together/run.yaml
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Additionally, the manual example using Codex CLI from #2325 now succeeds
instead of throwing a 500 error.
Closes#2325
Signed-off-by: Ben Browning <bbrownin@redhat.com>
We must store the full (re-hydrated) input not just the original input
in the Response object. Of course, this is not very space efficient and
we should likely find a better storage scheme so that we can only store
unique entries in the database and then re-hydrate them efficiently
later. But that can be done safely later.
Closes https://github.com/meta-llama/llama-stack/issues/2299
## Test Plan
Unit test
# What does this PR do?
Previously prompt guard was hard coded to require cuda which prevented
it from being used on an instance without a cuda support.
This PR allows prompt guard to be configured to use either cpu or cuda.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes [#2133](https://github.com/meta-llama/llama-stack/issues/2133)
## Test Plan (Edited after incorporating suggestion)
1) started stack configured with prompt guard as follows on a system
without a GPU
and validated prompt guard could be used through the APIs
2) validated on a system with a gpu (but without llama stack) that the
python selecting between cpu and cuda support returned the right value
when a cuda device was available.
3) ran the unit tests as per -
https://github.com/meta-llama/llama-stack/blob/main/tests/unit/README.md
[//]: # (## Documentation)
---------
Signed-off-by: Michael Dawson <mdawson@devrus.com>
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.
## Test Plan
unit tests added
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
If a user has previously serialized data into their vector store without
the `metadata_token_count` in the chunk, the `query` method will fail in
a server error. This fixes that edge case by returning 0 when the key is
not detected. This solution is suboptimal but I think it's better to
understate the token size rather than recalculate it and add unnecessary
complexity to the retrieval code.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This extracts the W3C trace context headers (traceparent and tracestate)
from incoming requests, stuffs them as attributes on the spans we
create, and uses them within the tracing provider implementation to
actually wrap our spans in the proper context.
What this means in practice is that when a client (such as an OpenAI
client) is instrumented to create these traces, we'll continue that
distributed trace within Llama Stack as opposed to creating our own root
span that breaks the distributed trace between client and server.
It's slightly awkward to do this in Llama Stack because our Tracing API
knows nothing about opentelemetry, W3C trace headers, etc - that's only
knowledge the specific provider implementation has. So, that's why the
trace headers get extracted by in the server code but not actually used
until the provider implementation to form the proper context.
This also centralizes how we were adding the `__root__` and
`__root_span__` attributes, as those two were being added in different
parts of the code instead of from a single place.
Closes#2097
## Test Plan
This was tested manually using the helpful scripts from #2097. I
verified that Llama Stack properly joined the client's span when the
client was instrumented for distributed tracing, and that Llama Stack
properly started its own root span when the incoming request was not
part of an existing trace.
Here's an example of the joined spans:

Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.
Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.
Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.
The APIs are OpenAI-compliant, with an additional `input_messages`
field.
## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
In the Responses API, we convert incoming response requests to chat
completion requests. When streaming the resulting chunks of those chat
completion requests, inference providers that use OpenAI clients will
often return a `type=None` value in the tool call parts of the response.
This causes issues when we try to dump and load that response into our
pydantic model, because type cannot be None in the Responses API model
we're loading these into.
So, strip the "type" field, if present, off those chat completion tool
call results before dumping and loading them as our typed pydantic
models, which will apply our default value for that type field.
## Test Plan
This was found via manual testing of the Responses API with codex, where
I was getting errors in some tool call situations. I added a unit test
to simulate this scenario and verify the fix, as well as manual codex
testing to verify the fix.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.
I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.
The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.
Closes#2074Closes#2080
## Test Plan
### Unit Tests
Several new unit tests were added, and they all pass. Ran via:
```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
### Responses API Verification Tests
I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).
Running the stack setup for verification testing:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Together, passing 100% as an example:
```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```
## Documentation
We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This PR fixes the behavior of the `/tool-runtime/rag-tool/query`
endpoint when invoked with an empty `vector_db_ids` parameter.
As of now, it simply returns an empty result, which leads to a
misleading error message from the server and makes it difficult and
time-consuming to detect the problem with the input parameter.
The proposed fix is to return an indicative error message in this case.
## Test Plan
Running the following script:
```
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [],
},
)
],
)
response = agent.create_turn(
messages=[
{
"role": "user",
"content": "How to install OpenShift?",
}
],
session_id=agent.create_session(f"rag-session")
)
```
results in the following error message in the non-patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: Tool call result (id: 494b8020-90bb-449b-aa76-10960d6b2cc2, name: knowledge_search) does not have any content
```
and in the following one in the patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: No vector DBs were provided to the RAG tool. Please provide at least one DB.
```
# What does this PR do?
Adds the API to query metrics from telemetry.
## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Revert a change that by mistake forced efficiency_config on torchtune
provider
users.
```
fix: Don't require efficiency_config for torchtune
It was enforced by mistake when
0751a960a5 merged.
Other asserts made sense in that the code was written, potentially, to
always expect a non-None value. But not efficiency_config.
```
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with
properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates
and TelemetryConfig. This prevents silent parsing issues and ensures
consistent environment variable resolution.
Slipped in https://github.com/meta-llama/llama-stack/pull/2058
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.
This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.
All references to `SQLITE_DB_PATH` have been removed.
Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This PR addresses the content dominance problem that frequently arises
with multiple models when executing queries with the RAG tool. When the
retrieved content is too large, it disproportionately influences the
generation process, causing the model to ignore the original question
and to provide meaningless comments on the retrieved information
instead.
This situation is especially common with agentic RAG, which is the
standard way of doing RAG in Llama Stack, since directly manipulating
the prompt combining the query with the retrieved content is not
possible.
This PR appends a grounding message to the results returned by the
knowledge search tool, reminding the model about the original query and
the purpose of the inference call. This makes the problem significantly
less likely to occur.
## Test Plan
Running the following script before the fix demonstrates the content
dominance problem where the model insists to comment on the retrieved
content and refuses to address the question.
Running the script after the fix results in getting the correct answer.
```
import os
import uuid
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"
# inference settings
MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "
# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))
# init the RAG retrieval parameters
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
vector_providers = [
provider for provider in client.providers.list() if provider.api == "vector_io"
]
vector_provider_to_use = vector_providers[0]
# define and register the document collection to be used
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=VECTOR_DB_EMBEDDING_MODEL,
embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
provider_id=vector_provider_to_use.provider_id,
)
# ingest the documents into the newly created document collection
urls = [
("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=url,
mime_type=url_type,
metadata={},
)
for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)
queries = [
"How to install OpenShift?",
]
# initializing the agent
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
# we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval
},
)
],
)
for prompt in queries:
print(f"User> {prompt}")
# create a new turn with a new session ID for each prompt
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=agent.create_session(f"rag-session_{uuid.uuid4()}")
)
# print the response, including tool calls output
for log in AgentEventLogger().log(response):
print(log.content, end='')
```
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
Now a separate thread is started to execute training jobs. Training
requests now return job ID before the job completes. (Which fixes API
timeouts for any jobs that take longer than a minute.)
Note: the scheduler code is meant to be spun out in the future into a
common provider service that can be reused for different APIs and
providers. It is also expected to back the /jobs API proposed here:
https://github.com/meta-llama/llama-stack/discussions/1238
Hence its somewhat generalized form which is expected to simplify its
adoption elsewhere in the future.
Note: this patch doesn't attempt to implement missing APIs (e.g. cancel
or job removal). This work will belong to follow-up PRs.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Added unit tests for the scheduler module. For the API coverage, did
manual testing and was able to run a training cycle on GPU. The initial
call returned job ID before the training completed, as (now) expected.
Artifacts are returned as expected.
```
JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50')
```
The integration test is currently disabled for the provider. I will look
into how it can be enabled in a different PR / issue context.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`
The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.
Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.
So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.
## Test Plan
Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144
LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```
Then run the batch inference test case.
# What does this PR do?
Today, supervised_fine_tune itself and the `TrainingConfig` class have a
bunch of required fields that a provider implementation might not need.
for example, if a provider wants to handle hyperparameters in its
configuration as well as any type of dataset retrieval, optimizer or
LoRA config, a user will still need to pass in a virtually empty
`DataConfig`, `OptimizerConfig` and `AlgorithmConfig` in some cases.
Many of these fields are intended to work specifically with llama models
and knobs intended for customizing inline.
Adding remote post_training providers will require loosening these
arguments, or forcing users to pass in empty objects to satisfy the
pydantic models.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
This stubs in some OpenAI server-side compatibility with three new
endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)
The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.
## Test Plan
### vLLM
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
## Documentation
Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```
### Test Plan
- Start server with meta-reference
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL llama stack run meta-reference-gpu
```
- Added **NEW** tests with 5 test cases for multi-turn tool calls
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
- Also verified all vision and agent tests pass