Some fixes to MCP tests. And a bunch of fixes for Vector providers.
I also enabled a bunch of Vector IO tests to be used with
`LlamaStackLibraryClient`
## Test Plan
Run Responses tests with llama stack library client:
```
pytest -s -v tests/integration/non_ci/responses/ --stack-config=server:starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
Do the same with `-k openai_client`
The rest should be taken care of by CI.
Well our Responses tests use it so we better include it in the API, no?
I discovered it because I want to make sure `llama-stack-client` can be
used always instead of `openai-python` as the client (we do want to be
_truly_ compatible.)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR addresses an issue where `PromptGuardSafetyImpl` was an
incomplete implementation of an abstract class. The class was missing
the required run_moderation method from its parent interface.
Currently, running `pre-commit` locally fails with the error below.
```
llama_stack/providers/inline/safety/prompt_guard/__init__.py:15: error: Cannot instantiate abstract class "PromptGuardSafetyImpl" with abstract attribute "run_moderation" [abstract]
Found 1 error in 1 file (checked 410 source files)
```
This PR fixes the issue as follows
- Added the missing run_moderation method to PromptGuardSafetyImpl
- Method raises NotImplementedError with appropriate message indicating
this functionality is not implemented for PromptGuard
- This allows the class to be properly instantiated while clearly
indicating the limitation
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Using commas is much more shell-friendly. A semi-colon is a statement
delimiter and must be escaped.
This change is backwards incompatible but I imagine not many people are
using this. I could be wrong. Looking for feedback.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
This updates the sidebar to look a little more like other popular ones.
<img width="1913" height="1352" alt="Screenshot 2025-08-08 at 11 25
31 PM"
src="https://github.com/user-attachments/assets/00738412-1101-48ec-8864-cde4a8733ec1"
/>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for
authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro,
gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
<!-- If resolving an issue, uncomment and update the line below -->
relates to https://github.com/meta-llama/llama-stack/issues/2747
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Eran Cohen <eranco@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Update Milvus doc on using search modes.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
PR adds Flash-Lite 2.0 and 2.5 models to the Gemini inference provider
Closes#3046
## Test Plan
I was not able to locate any existing test for this provider, so I
performed manual testing. But the change is really trivial and
straightforward.
# What does this PR do?
This PR updates the UI to create new:
1. `/files/{file_id}`
2. `files/{file_id}/contents`
3. `files/{file_id}/contents/{content_id}`
The list of files are clickable which brings the user to the FIles
Detail page
The File Details page shows all of the content
The content details page shows the individual chunk/content parsed
These only use our existing OpenAI compatible APIs. I have a separate
branch where I expose the embedding and the portal is correctly
populated. I included the FE rendering code for that in this PR.
1. `vector-stores/{vector_store_id}/files/{file_id}`
<img width="1913" height="1351" alt="Screenshot 2025-08-06 at 10 20
12 PM"
src="https://github.com/user-attachments/assets/08010d5e-60c8-4bd9-9f3e-a2731ed1ad55"
/>
2. `vector-stores/{vector_store_id}/files/{file_id}/contents`
<img width="1920" height="1272" alt="Screenshot 2025-08-06 at 10 21
23 PM"
src="https://github.com/user-attachments/assets/3b91e67b-5d64-4fe6-91b6-18f14587e850"
/>
3.
`vector-stores/{vector_store_id}/files/{file_id}/contents/{content_id}`
<img width="1916" height="1273" alt="Screenshot 2025-08-06 at 10 21
45 PM"
src="https://github.com/user-attachments/assets/d38ca996-e8d9-460c-9e39-7ff0cb5ec0dd"
/>
## Test Plan
I tested this locally and reviewed the code. I generated a significant
share of the code with Claude and some manual intervention. After this,
I'll begin adding tests to the UI.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR implements hybrid search for Milvus DB based on the inbuilt
milvus support.
To test:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s
--tb=long --disable-warnings --asyncio-mode=auto
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
1. Introduce new base custom exception class `ResourceNotFoundError`
2. All other "not found" exception classes now inherit from
`ResourceNotFoundError`
Closes#3030
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
A bunch of miscellaneous cleanup focusing on tests, but ended up
speeding up starter distro substantially.
- Pulled llama stack client init for tests into `pytest_sessionstart` so
it does not clobber output
- Profiling of that told me where we were doing lots of heavy imports
for starter, so lazied them
- starter now starts 20seconds+ faster on my Mac
- A few other smallish refactors for `compat_client`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Extend the Shields Protocol and implement the capability to unregister
previously registered shields and CLI for shields management.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2581
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
First of, test API for shields
1. Install and start Ollama:
`ollama serve`
2. Pull Llama Guard Model in Ollama:
`ollama pull llama-guard3:8b`
3. Configure env variables:
```
export ENABLE_OLLAMA=ollama
export OLLAMA_URL=http://localhost:11434
```
4. Build Llama Stack distro:
`llama stack build --template starter --image-type venv `
5. Start Llama Stack server:
`llama stack run starter --port 8321`
6. Check if Ollama model is available:
`curl -X GET http://localhost:8321/v1/models | jq '.data[] |
select(.provider_id=="ollama")'`
7. Register a new Shield using Ollama provider:
```
curl -X POST http://localhost:8321/v1/shields \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"provider_id": "llama-guard",
"provider_shield_id": "ollama/llama-guard3:8b",
"params": {}
}'
```
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
8. Check if shield was registered:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
9. Run shield:
```
curl -X POST http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"messages": [
{
"role": "user",
"content": "How can I hack into someone computer?"
}
],
"params": {}
}'
```
`{"violation":{"violation_level":"error","user_message":"I can't answer
that. Can I help with something
else?","metadata":{"violation_type":"S2"}}}% `
10. Unregister shield:
`curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
11. Verify shield was deleted:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"detail":"Invalid value: Shield 'test-shield' not found"}%`
All tests passed ✅
```
========================================================================== 430 passed, 194 warnings in 19.54s ==========================================================================
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:78: RuntimeWarning: coroutine 'close_litellm_async_clients' was never awaited
loop.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Wrote HTML report to htmlcov-3.12/index.html
```
# What does this PR do?
1. Creates a new `SessionNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
1. Creates a new `ToolGroupNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
As the title says. Distributions is in, Templates is out.
`llama stack build --template` --> `llama stack build --distro`. For
backward compatibility, the previous option is kept but results in a
warning.
Updated `server.py` to remove the "config_or_template" backward
compatibility since it has been a couple releases since that change.
# What does this PR do?
Remove score_threshold based check from `OpenAIVectorStoreMixin`
Closes: https://github.com/meta-llama/llama-stack/issues/3018
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is responsible for removal of Conda support in Llama Stack
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2539
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
closes#2995
update SambaNovaInferenceAdapter to efficiently use LiteLLMOpenAIMixin
## Test Plan
```
$ uv run pytest -s -v tests/integration/inference --stack-config inference=sambanova --text-model sambanova/Meta-Llama-3.1-8B-Instruct
...
======================== 10 passed, 84 skipped, 3 xfailed, 51 warnings in 8.14s ========================
```
# What does this PR do?
Adds support to Vector store Open AI APIs in Qdrant.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2463
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
This PR (1) enables the files API for Weaviate and (2) enables
integration tests for Weaviate, which adds a docker container to the
github action.
This PR also handles a couple of edge cases for in creating the
collection and ensuring the tests all pass.
## Test Plan
CI enabled
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Improve user experience by providing specific guidance when no API key
is available, showing both provider data header and config options with
the correct field name for each provider.
Also adds comprehensive test coverage for API key resolution scenarios.
addresses #2990 for providers using litellm openai mixin
## Test Plan
`./scripts/unit-tests.sh
tests/unit/providers/inference/test_litellm_openai_mixin.py`
get_vector_db() will raise an exception if a vector store won't be
returned
client handling is redundant
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
What does this PR do?
This PR adds support for Direct Preference Optimization (DPO) training
via the existing HuggingFace inline provider. It introduces a new DPO
training recipe, config schema updates, dataset integration, and
end-to-end testing to support preference-based fine-tuning with TRL.
Test Plan
Added integration test:
tests/integration/post_training/test_post_training.py::TestPostTraining::test_preference_optimize
Ran tests on both CPU and CUDA environments
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
I've been tinkering a little with a simple chat playground in the UI, so
I'm opening the PR with what's kind of a WIP.
If you look at the first commit, that includes the big part of the
changes. The rest of the files changed come from adding installing the
`shadcn` components.
Note this is missing a lot; e.g.,
- sessions
- document upload
- audio (the shadcn components install these by default from
https://shadcn-chatbot-kit.vercel.app/docs/components/chat)
I still need to wire up a lot more to make it actually fully functional
but it does basic chat using the LS Typescript Client.
Basic demo:
<img width="1329" height="1430" alt="Image"
src="https://github.com/user-attachments/assets/917a2096-36d4-4925-b83b-f1f2cda98698"
/>
<img width="1319" height="1424" alt="Image"
src="https://github.com/user-attachments/assets/fab1583b-1c72-4bf3-baf2-405aee13c6bb"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This PR focuses on improving the developer experience by adding
comprehensive docstrings to the API data models across the Llama Stack.
These docstrings provide detailed explanations for each model and its
fields, making the API easier to understand and use.
**Key changes:**
- **Added Docstrings:** Added reST formatted docstrings to Pydantic
models in the `llama_stack/apis/` directory. This includes models for:
- Agents (`agents.py`)
- Benchmarks (`benchmarks.py`)
- Datasets (`datasets.py`)
- Inference (`inference.py`)
- And many other API modules.
- **OpenAPI Spec Update:** Regenerated the OpenAPI specification
(`docs/_static/llama-stack-spec.yaml` and
`docs/_static/llama-stack-spec.html`) to include the new docstrings.
This will be reflected in the API documentation, providing richer
information to users.
**Impact:**
- Developers using the Llama Stack API will have a better understanding
of the data structures.
- The auto-generated API documentation is now more informative.
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
1. Creates a new `VectorStoreNotFoundError` class
2. Implements the new class where appropriate
Relates to #2379
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>