# What does this PR do?
As described in #3134 a langchain example works against openai's
responses impl, but not against llama stack's. This turned out to be due
to the order of the inputs. The langchain example has the two function
call outputs first, followed by each call result in turn. This seems to
be valid as it is accepted by openai's impl. However in llama stack,
these inputs are converted to chat completion inputs and the resulting
order for that api is not accpeted by openai.
This PR fixes the issue by ensuring that the converted chat completions
inputs are in the expected order.
Closes#3134
## Test Plan
Added unit and integration tests. Verified this fixes original issue as
reported.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
Context: https://github.com/meta-llama/llama-stack/issues/2937
The API design is inspired by existing offerings, but not exactly the
same:
* `top_n` as the parameter to control number of results, instead of
`top_k`, since `n` is conventional to control number
* `truncation` bool instead of `max_token_per_doc`, since we should just
handle the truncation automatically depending on model capability,
instead of user setting the context length manually.
* `data` field in the response, to be consistent with other OpenAI APIs
(though they don't have a rerank API). Also, it is one less name to
learn in the API.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
Currently the embedding integration test cases fail due to a
misalignment in the error type. This PR fixes the embedding integration
test by fixing the error type.
## Test Plan
```
pytest -s -v tests/integration/inference/test_embedding.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
# What does this PR do?
- Documentation update and fix for the NVIDIA Inference provider.
- Update the `run_moderation` for safety API with a
`NotImplementedError` placeholder. Otherwise initialization NVIDIA
inference client will raise an error.
## Test Plan
N/A
# What does this PR do?
Handles MCP tool calls in a previous response
Closes#3105
## Test Plan
Made call to create response with tool call, then made second call with
the first linked through previous_response_id. Did not get error.
Also added unit test.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
We noticed that when llama-stack is running for a long time, we would
run into database errors when trying to run messages through the agent
(which we configured to persist against postgres), seemingly due to the
database connections being stale or disconnected. This commit adds
`pool_pre_ping=True` to the SQLAlchemy engine creation to help mitigate
this issue by checking the connection before using it, and
re-establishing it if necessary.
More information in:
https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects
We're also open to other suggestions on how to handle this issue, this
PR is just a suggestion.
## Test Plan
We have not tested it yet (we're in the process of doing that) and we're
hoping it's going to resolve our issue.
# What does this PR do?
NVIDIA asymmetric embedding models (e.g.,
`nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter
not present in the standard OpenAI embeddings API. This PR adds the
`input_type="query"` as default and updates the documentation to suggest
using the `embedding` API for passage embeddings.
<!-- If resolving an issue, uncomment and update the line below -->
Resolves#2892
## Test Plan
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
# What does this PR do?
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Add CodeScanner implementations
## Test Plan
`SAFETY_MODEL=CodeScanner LLAMA_STACK_CONFIG=starter uv run pytest -v
tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
This PR need to land after this
https://github.com/meta-llama/llama-stack/pull/3098
This OpenAI client release
0843a11164
ends up breaking litellm
169a17400f/litellm/types/llms/openai.py (L40)
Update the dependency pin. Also make the imports a bit more defensive
anyhow if something else during `llama stack build` ends up moving
openai to a previous version.
## Test Plan
Run pre-release script integration tests.
Replace chat_completion calls with openai_chat_completion to eliminate
dependency on legacy inference APIs.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3067
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Adds proper streaming events for MCP tool listing (`mcp_list_tools.in_progress` and `mcp_list_tools.completed`). Also refactors things a bit more.
## Test Plan
Verified existing integration tests pass with the refactored code. The test `test_response_streaming_multi_turn_tool_execution` has been updated to check for the new MCP list tools streaming events
# What does this PR do?
Refactors the OpenAI response conversion utilities by moving helper functions from `openai_responses.py` to `utils.py`. Adds unit tests.
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
The OpenAI compatibility layer was incorrectly importing
ChatCompletionMessageToolCallParam instead of the
ChatCompletionMessageFunctionToolCall class. This caused "Cannot
instantiate typing.Union" errors when processing agent requests with
tool calls.
Closes: #3141
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Adds content part streaming events to the OpenAI-compatible Responses API to support more granular streaming of response content. This introduces:
1. New schema types for content parts: `OpenAIResponseContentPart` with variants for text output and refusals
2. New streaming event types:
- `OpenAIResponseObjectStreamResponseContentPartAdded` for when content parts begin
- `OpenAIResponseObjectStreamResponseContentPartDone` for when content parts complete
3. Implementation in the reference provider to emit these events during streaming responses. Also emits MCP arguments just like function call ones.
## Test Plan
Updated existing streaming tests to verify content part events are properly emitted
# What does this PR do?
Enhances tool execution streaming by adding support for real-time progress events during tool calls. This implementation adds streaming events for MCP and web search tools, including in-progress, searching, completed, and failed states.
The refactored `_execute_tool_call` method now returns an async iterator that yields streaming events throughout the tool execution lifecycle.
## Test Plan
Updated the integration test `test_response_streaming_multi_turn_tool_execution` to verify the presence and structure of new streaming events, including:
- Checking for MCP in-progress and completed events
- Verifying that progress events contain required fields (item_id, output_index, sequence_number)
- Ensuring completed events have the necessary sequence_number field
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
# What does this PR do?
1. Updates `AgentPersistence.list_sessions()` to properly filter out
`Turn` keys from `Session` keys.
2. Adds a suite of unit tests to confirm the `list_sessions()` behavior
and tests the failed sample in
https://github.com/meta-llama/llama-stack/issues/3048
## Fixes https://github.com/meta-llama/llama-stack/issues/3048
## Test Plan
Unit tests added.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Some fixes to MCP tests. And a bunch of fixes for Vector providers.
I also enabled a bunch of Vector IO tests to be used with
`LlamaStackLibraryClient`
## Test Plan
Run Responses tests with llama stack library client:
```
pytest -s -v tests/integration/non_ci/responses/ --stack-config=server:starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
Do the same with `-k openai_client`
The rest should be taken care of by CI.
Well our Responses tests use it so we better include it in the API, no?
I discovered it because I want to make sure `llama-stack-client` can be
used always instead of `openai-python` as the client (we do want to be
_truly_ compatible.)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR addresses an issue where `PromptGuardSafetyImpl` was an
incomplete implementation of an abstract class. The class was missing
the required run_moderation method from its parent interface.
Currently, running `pre-commit` locally fails with the error below.
```
llama_stack/providers/inline/safety/prompt_guard/__init__.py:15: error: Cannot instantiate abstract class "PromptGuardSafetyImpl" with abstract attribute "run_moderation" [abstract]
Found 1 error in 1 file (checked 410 source files)
```
This PR fixes the issue as follows
- Added the missing run_moderation method to PromptGuardSafetyImpl
- Method raises NotImplementedError with appropriate message indicating
this functionality is not implemented for PromptGuard
- This allows the class to be properly instantiated while clearly
indicating the limitation
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for
authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro,
gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
<!-- If resolving an issue, uncomment and update the line below -->
relates to https://github.com/meta-llama/llama-stack/issues/2747
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Eran Cohen <eranco@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Update Milvus doc on using search modes.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
PR adds Flash-Lite 2.0 and 2.5 models to the Gemini inference provider
Closes#3046
## Test Plan
I was not able to locate any existing test for this provider, so I
performed manual testing. But the change is really trivial and
straightforward.
# What does this PR do?
This PR implements hybrid search for Milvus DB based on the inbuilt
milvus support.
To test:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s
--tb=long --disable-warnings --asyncio-mode=auto
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
A bunch of miscellaneous cleanup focusing on tests, but ended up
speeding up starter distro substantially.
- Pulled llama stack client init for tests into `pytest_sessionstart` so
it does not clobber output
- Profiling of that told me where we were doing lots of heavy imports
for starter, so lazied them
- starter now starts 20seconds+ faster on my Mac
- A few other smallish refactors for `compat_client`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Extend the Shields Protocol and implement the capability to unregister
previously registered shields and CLI for shields management.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2581
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
First of, test API for shields
1. Install and start Ollama:
`ollama serve`
2. Pull Llama Guard Model in Ollama:
`ollama pull llama-guard3:8b`
3. Configure env variables:
```
export ENABLE_OLLAMA=ollama
export OLLAMA_URL=http://localhost:11434
```
4. Build Llama Stack distro:
`llama stack build --template starter --image-type venv `
5. Start Llama Stack server:
`llama stack run starter --port 8321`
6. Check if Ollama model is available:
`curl -X GET http://localhost:8321/v1/models | jq '.data[] |
select(.provider_id=="ollama")'`
7. Register a new Shield using Ollama provider:
```
curl -X POST http://localhost:8321/v1/shields \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"provider_id": "llama-guard",
"provider_shield_id": "ollama/llama-guard3:8b",
"params": {}
}'
```
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
8. Check if shield was registered:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
9. Run shield:
```
curl -X POST http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"messages": [
{
"role": "user",
"content": "How can I hack into someone computer?"
}
],
"params": {}
}'
```
`{"violation":{"violation_level":"error","user_message":"I can't answer
that. Can I help with something
else?","metadata":{"violation_type":"S2"}}}% `
10. Unregister shield:
`curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
11. Verify shield was deleted:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"detail":"Invalid value: Shield 'test-shield' not found"}%`
All tests passed ✅
```
========================================================================== 430 passed, 194 warnings in 19.54s ==========================================================================
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:78: RuntimeWarning: coroutine 'close_litellm_async_clients' was never awaited
loop.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Wrote HTML report to htmlcov-3.12/index.html
```
# What does this PR do?
1. Creates a new `SessionNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>