# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
* Added support postgresql inference store
* Added 'oracle' template that demos how to config postgresql stores
(except for telemetry, which is not supported currently)
## Test Plan
llama stack build --template oracle --image-type conda --run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v tests/integration/
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct -k
'inference_store'
# What does this PR do?
Fix a bug in openai_compat where choices are not indexed correctly.
## Test Plan
Added a new test.
Rerun the failed inference_store tests:
llama stack run fireworks --image-type conda
pytest -s -v tests/integration/ --stack-config http://localhost:8321 -k
'test_inference_store' --text-model meta-llama/Llama-3.3-70B-Instruct
--count 10
# What does this PR do?
Changed the test to not require tool_call in output, but still keeping
the tools params there as a smoke test.
## Test Plan
Used llama3.3 from fireworks (same as CI)
<img width="1433" alt="image"
src="https://github.com/user-attachments/assets/1e5fca98-9b4f-402e-a0bc-d9f910f2c207"
/>
Run with ollama distro and 3b model.
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
The test depends on llama's tool calling ability. In the CI, we run with
a small ollama model.
The fix might be to check for either message or function_call because
the model is flaky and we aren't really testing that behavior?
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=http://localhost:8321 pytest
tests/integration/tool_runtime/test_rag_tool.py --embedding-model
text-embedding-3-small
# What does this PR do?
This fixes an issue in how we used the tool_call_buf from streaming tool
calls in the remote-vllm provider where it would end up concatenating
parameters from multiple different tool call results instead of
aggregating the results from each tool call separately.
It also fixes an issue found while digging into that where we were
accidentally mixing the json string form of tool call parameters with
the string representation of the python form, which mean we'd end up
with single quotes in what should be double-quoted json strings.
Closes#1120
## Test Plan
The following tests are now passing 100% for the remote-vllm provider,
where some of the test_text_inference were failing before this change:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
```
All but one of the agent tests are passing (including the multi-tool
one). See the PR at https://github.com/vllm-project/vllm/pull/17917 and
a gist at
https://gist.github.com/bbrowning/4734240ce96b4264340caa9584e47c9e for
changes needed there, which will have to get made upstream in vLLM.
Agent tests:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
````
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
While adding other tests, I came across this and wasn’t sure how useful
it is. It doesn’t seem to be exercised anywhere in CI, but I figured I’d
fix it anyway. Happy to remove it if preferred. :)
## Test Plan
Run:
```
uv run pytest tests/integration/inference --stack-config=ollama --report=test_report.md -v --text-model="llama3.2:3b" --embedding-model=all-MiniLM-L6-v2
```
Look at the produced `test_report.md`.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.
Open Issues:
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860
Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)
## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/unit/providers/nvidia/test_datastore.py .. [100%]
============================================================ warnings summary ============================================================
====================================================== 2 passed, 1 warning in 0.84s ======================================================
```
cc: @dglogo, @mattf, @yanxi0830
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`
The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.
Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.
So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.
## Test Plan
Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144
LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```
Then run the batch inference test case.
# What does this PR do?
This stubs in some OpenAI server-side compatibility with three new
endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)
The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.
## Test Plan
### vLLM
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
## Documentation
Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Fixes issue #1537 that causes "500 Internal Server Error" when
unregistering a toolgroup
# (Closes#1537 )
## Test Plan
```console
$ pytest -s -v tests/integration/tool_runtime/test_registration.py --stack-config=ollama --env INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
INFO 2025-03-14 21:15:03,999 tests.integration.conftest:41 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
/opt/homebrew/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
===================================================== test session starts =====================================================
platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /opt/homebrew/opt/python@3.10/bin/python3.10
cachedir: .pytest_cache
rootdir: /Users/paolo/Projects/aiplatform/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.25.3, anyio-4.8.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item
tests/integration/tool_runtime/test_registration.py::test_register_and_unregister_toolgroup[None-None-None-None-None] INFO 2025-03-14 21:15:04,478 llama_stack.providers.remote.inference.ollama.ollama:75 inference: checking
connectivity to Ollama at `http://localhost:11434`...
INFO 2025-03-14 21:15:05,350 llama_stack.providers.remote.inference.ollama.ollama:294 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO: Started server process [78391]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:57424 - "GET /sse HTTP/1.1" 200 OK
INFO: 127.0.0.1:57434 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,129 mcp.client.sse:51 uncategorized: Connecting to SSE endpoint: http://localhost:8000/sse
INFO: 127.0.0.1:57445 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,146 mcp.client.sse:71 uncategorized: Received endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO 2025-03-14 21:15:16,147 mcp.client.sse:140 uncategorized: Starting post writer with endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO 2025-03-14 21:15:16,155 mcp.server.lowlevel.server:535 uncategorized: Processing request of type
ListToolsRequest
PASSED
=============================================== 1 passed, 4 warnings in 12.17s ================================================
```
---------
Signed-off-by: Paolo Dettori <dettori@us.ibm.com>
# What does this PR do?
## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```
### Test Plan
- Start server with meta-reference
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL llama stack run meta-reference-gpu
```
- Added **NEW** tests with 5 test cases for multi-turn tool calls
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
- Also verified all vision and agent tests pass
Previously, the integration tests started the server, but never really
used it because `--stack-config=ollama` uses the ollama template and the
inline "llama stack as library" client, not the HTTP client.
This PR makes sure we test it both ways.
We also add agents tests to the mix.
## Test Plan
GitHub
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Added `--text-model` in example command.
* Added link to integration tests instruction and a note on specifying
models.
This is to avoid confusion when all tests are skipped because no model
is provided.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>