The OpenAI compatibility layer was incorrectly importing
ChatCompletionMessageToolCallParam instead of the
ChatCompletionMessageFunctionToolCall class. This caused "Cannot
instantiate typing.Union" errors when processing agent requests with
tool calls.
Closes: #3141
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Adds content part streaming events to the OpenAI-compatible Responses API to support more granular streaming of response content. This introduces:
1. New schema types for content parts: `OpenAIResponseContentPart` with variants for text output and refusals
2. New streaming event types:
- `OpenAIResponseObjectStreamResponseContentPartAdded` for when content parts begin
- `OpenAIResponseObjectStreamResponseContentPartDone` for when content parts complete
3. Implementation in the reference provider to emit these events during streaming responses. Also emits MCP arguments just like function call ones.
## Test Plan
Updated existing streaming tests to verify content part events are properly emitted
# What does this PR do?
Enhances tool execution streaming by adding support for real-time progress events during tool calls. This implementation adds streaming events for MCP and web search tools, including in-progress, searching, completed, and failed states.
The refactored `_execute_tool_call` method now returns an async iterator that yields streaming events throughout the tool execution lifecycle.
## Test Plan
Updated the integration test `test_response_streaming_multi_turn_tool_execution` to verify the presence and structure of new streaming events, including:
- Checking for MCP in-progress and completed events
- Verifying that progress events contain required fields (item_id, output_index, sequence_number)
- Ensuring completed events have the necessary sequence_number field
# What does this PR do?
1. Updates `AgentPersistence.list_sessions()` to properly filter out
`Turn` keys from `Session` keys.
2. Adds a suite of unit tests to confirm the `list_sessions()` behavior
and tests the failed sample in
https://github.com/meta-llama/llama-stack/issues/3048
## Fixes https://github.com/meta-llama/llama-stack/issues/3048
## Test Plan
Unit tests added.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Some fixes to MCP tests. And a bunch of fixes for Vector providers.
I also enabled a bunch of Vector IO tests to be used with
`LlamaStackLibraryClient`
## Test Plan
Run Responses tests with llama stack library client:
```
pytest -s -v tests/integration/non_ci/responses/ --stack-config=server:starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
Do the same with `-k openai_client`
The rest should be taken care of by CI.
# What does this PR do?
the minimum python version for the project was bumped to 3.12 a couple
months ago, but there remains some artifacts in the repo suggesting we
support >=3.10
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for
authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro,
gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
<!-- If resolving an issue, uncomment and update the line below -->
relates to https://github.com/meta-llama/llama-stack/issues/2747
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Eran Cohen <eranco@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
This PR kills the verifications infrastructure which is no longer used.
It was relocated to the `llama-stack-evals`
(https://github.com/meta-llama/llama-stack-evals) repository previously.
Responses tests used this infrastructure but that wasn't quite
necessary, just a little useful back when @bbrownin introduced the
tests. On Discord, we agreed that tests can be moved to our regular
integrations test infra.
## Test Plan
Some tests currently do fail (although they run!) I will send a
follow-up PR which makes them all pass.
# What does this PR do?
This PR implements hybrid search for Milvus DB based on the inbuilt
milvus support.
To test:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s
--tb=long --disable-warnings --asyncio-mode=auto
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
1. Introduce new base custom exception class `ResourceNotFoundError`
2. All other "not found" exception classes now inherit from
`ResourceNotFoundError`
Closes#3030
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
A bunch of miscellaneous cleanup focusing on tests, but ended up
speeding up starter distro substantially.
- Pulled llama stack client init for tests into `pytest_sessionstart` so
it does not clobber output
- Profiling of that told me where we were doing lots of heavy imports
for starter, so lazied them
- starter now starts 20seconds+ faster on my Mac
- A few other smallish refactors for `compat_client`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Extend the Shields Protocol and implement the capability to unregister
previously registered shields and CLI for shields management.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2581
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
First of, test API for shields
1. Install and start Ollama:
`ollama serve`
2. Pull Llama Guard Model in Ollama:
`ollama pull llama-guard3:8b`
3. Configure env variables:
```
export ENABLE_OLLAMA=ollama
export OLLAMA_URL=http://localhost:11434
```
4. Build Llama Stack distro:
`llama stack build --template starter --image-type venv `
5. Start Llama Stack server:
`llama stack run starter --port 8321`
6. Check if Ollama model is available:
`curl -X GET http://localhost:8321/v1/models | jq '.data[] |
select(.provider_id=="ollama")'`
7. Register a new Shield using Ollama provider:
```
curl -X POST http://localhost:8321/v1/shields \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"provider_id": "llama-guard",
"provider_shield_id": "ollama/llama-guard3:8b",
"params": {}
}'
```
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
8. Check if shield was registered:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
9. Run shield:
```
curl -X POST http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"messages": [
{
"role": "user",
"content": "How can I hack into someone computer?"
}
],
"params": {}
}'
```
`{"violation":{"violation_level":"error","user_message":"I can't answer
that. Can I help with something
else?","metadata":{"violation_type":"S2"}}}% `
10. Unregister shield:
`curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
11. Verify shield was deleted:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"detail":"Invalid value: Shield 'test-shield' not found"}%`
All tests passed ✅
```
========================================================================== 430 passed, 194 warnings in 19.54s ==========================================================================
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:78: RuntimeWarning: coroutine 'close_litellm_async_clients' was never awaited
loop.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Wrote HTML report to htmlcov-3.12/index.html
```
As the title says. Distributions is in, Templates is out.
`llama stack build --template` --> `llama stack build --distro`. For
backward compatibility, the previous option is kept but results in a
warning.
Updated `server.py` to remove the "config_or_template" backward
compatibility since it has been a couple releases since that change.
# What does this PR do?
Implement vector store search test
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
```
pytest tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes --stack-config=http://localhost:8321 --embedding-model=all-MiniLM-L6-v2 -v
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for removal of Conda support in Llama Stack
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2539
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Adds support to Vector store Open AI APIs in Qdrant.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2463
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
This should be more robust as sometimes its run without running build
first.
## Test Plan
OLLAMA_URL=http://localhost:11434 LLAMA_STACK_TEST_INFERENCE_MODE=replay
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings
LLAMA_STACK_CONFIG=server:starter uv run --with pytest-repeat pytest
tests/integration/telemetry
--text-model="ollama/llama3.2:3b-instruct-fp16" -vvs
# What does this PR do?
This PR (1) enables the files API for Weaviate and (2) enables
integration tests for Weaviate, which adds a docker container to the
github action.
This PR also handles a couple of edge cases for in creating the
collection and ensuring the tests all pass.
## Test Plan
CI enabled
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Improve user experience by providing specific guidance when no API key
is available, showing both provider data header and config options with
the correct field name for each provider.
Also adds comprehensive test coverage for API key resolution scenarios.
addresses #2990 for providers using litellm openai mixin
## Test Plan
`./scripts/unit-tests.sh
tests/unit/providers/inference/test_litellm_openai_mixin.py`
This PR significantly refactors the Integration Tests workflow. The main
goal behind the PR was to enable recording of vision tests which were
never run as part of our CI ever before. During debugging, I ended up
making several other changes refactoring and hopefully increasing the
robustness of the workflow.
After doing the experiments, I have updated the trigger event to be
`pull_request_target` so this workflow can get write permissions by
default but it will run with source code from the base (main) branch in
the source repository only. If you do change the workflow, you'd need to
experiment using the `workflow_dispatch` triggers. This should not be
news to anyone using Github Actions (except me!)
It is likely to be a little rocky though while I learn more about GitHub
Actions, etc. Please be patient :)
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
What does this PR do?
This PR adds support for Direct Preference Optimization (DPO) training
via the existing HuggingFace inline provider. It introduces a new DPO
training recipe, config schema updates, dataset integration, and
end-to-end testing to support preference-based fine-tuning with TRL.
Test Plan
Added integration test:
tests/integration/post_training/test_post_training.py::TestPostTraining::test_preference_optimize
Ran tests on both CPU and CUDA environments
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
- Initialize route_impls to None in constructor to prevent
AttributeError
- Consolidate initialization checks to single point in request() method
- Improve error message to be more helpful ("Please call initialize()
first")
- Add comprehensive test suite to prevent regressions
The library client now has better error handling when users forget to
call initialize(), showing a clear ValueError instead of confusing
AttributeError. All initialization validation is now centralized in the
request() method, with internal methods (_call_non_streaming,
_call_streaming, _convert_body) relying on this single check for
cleaner, more maintainable code.
closes#2943
## Test Plan
`./scripts/unit-tests.sh`
This PR makes setting up Ollama optional for CI. By default, we use
`replay` mode for inference requests and use the stored results from the
`tests/integration/recordings/` directory.
Every so often, users will update tests which will need us to re-record.
To do this, we check for the existence of a label `re-record-tests` on
the PR. If detected,
- ollama is spun up
- inference mode is set to record
- after the tests are done, if any new changes are detected, they are
pushed back to the PR
## Test Plan
This is GitHub CI. Gotta test it live.