# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
This adds a check to ensure we don't attempt to concatenate `None + str`
or `str + None` when building up our arguments for streaming tool calls
in the Responses API.
## Test Plan
All existing tests pass with this change.
Unit tests:
```
python -m pytest -s -v \
tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
Integration tests:
```
llama stack run llama_stack/templates/together/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -s -v \
tests/integration/agents/test_openai_responses.py \
--text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Verification tests:
```
llama stack run llama_stack/templates/together/run.yaml
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Additionally, the manual example using Codex CLI from #2325 now succeeds
instead of throwing a 500 error.
Closes#2325
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
* Added support postgresql inference store
* Added 'oracle' template that demos how to config postgresql stores
(except for telemetry, which is not supported currently)
## Test Plan
llama stack build --template oracle --image-type conda --run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v tests/integration/
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct -k
'inference_store'
# What does this PR do?
Updates sambanova inference to use strict as false in json_schema
structured output
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
We must store the full (re-hydrated) input not just the original input
in the Response object. Of course, this is not very space efficient and
we should likely find a better storage scheme so that we can only store
unique entries in the database and then re-hydrate them efficiently
later. But that can be done safely later.
Closes https://github.com/meta-llama/llama-stack/issues/2299
## Test Plan
Unit test
# What does this PR do?
Previously prompt guard was hard coded to require cuda which prevented
it from being used on an instance without a cuda support.
This PR allows prompt guard to be configured to use either cpu or cuda.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes [#2133](https://github.com/meta-llama/llama-stack/issues/2133)
## Test Plan (Edited after incorporating suggestion)
1) started stack configured with prompt guard as follows on a system
without a GPU
and validated prompt guard could be used through the APIs
2) validated on a system with a gpu (but without llama stack) that the
python selecting between cpu and cuda support returned the right value
when a cuda device was available.
3) ran the unit tests as per -
https://github.com/meta-llama/llama-stack/blob/main/tests/unit/README.md
[//]: # (## Documentation)
---------
Signed-off-by: Michael Dawson <mdawson@devrus.com>
# What does this PR do?
Use a more common pattern and known terminology from the ecosystem,
where Route is more approved than Endpoint.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Fix a bug in openai_compat where choices are not indexed correctly.
## Test Plan
Added a new test.
Rerun the failed inference_store tests:
llama stack run fireworks --image-type conda
pytest -s -v tests/integration/ --stack-config http://localhost:8321 -k
'test_inference_store' --text-model meta-llama/Llama-3.3-70B-Instruct
--count 10
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
# What does this PR do?
Handles the case where the vllm config `tls_verify` is set to `false` or
`true`.
Closes: https://github.com/meta-llama/llama-stack/issues/2283
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
It's not used anywhere in the build process. Ancient artifact from an
old attempt of using sub packages to build distros.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
N/A
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Fixes an issue where running `llama stack build --template ollama
--image-type venv --run` fails with a TypeError when validating external
providers directory paths.
The error occurs because `os.path.exists()` is called with `Path(None)`
instead of converting it to a string first. This change ensures
consistent handling of `None` values for `external_providers_dir` across
both build and
[run](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/cli/stack/run.py#L134)
commands by using `str()` conversion before path validation.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```bash
INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template ollama --image-type venv --run
```
Command completes successfully without TypeError
[//]: # (## Documentation)
Two somewhat annoying fixes:
- we are going to index tools for non-MCP toolgroups always (like we
used to do). because there are just random assumptions in our tests,
etc. and I don't want to fix them right now
- we need to handle the funny case of toolgroups like
`builtin::rag/knowledge_search` where we added the tool name to use in
the toolgroup itself.
# What does this PR do?
The `tls_verify` can now receive a path to a certificate file if the
endpoint requires it.
Signed-off-by: Sébastien Han <seb@redhat.com>
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
# What does this PR do?
TSIA
`--enable-ui` to enable
## Test Plan
`llama stack run dev --image-type conda --enable-ui`
`localhost:8322` shows UI
llama stack run dev --image-type conda
`localhost:8322` does not work
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
Since https://github.com/meta-llama/llama-stack/pull/2193 switched to
openai sdk, we need to strip 'openai/' from the model_id
## Test Plan
start server with openai provider and send a chat completion call
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs
## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B