Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
chore: Enable keyword search for Milvus inline (#3073)
With https://github.com/milvus-io/milvus-lite/pull/294 - Milvus Lite
supports keyword search using BM25. While introducing keyword search we
had explicitly disabled it for inline milvus. This PR removes the need
for the check, and enables `inline::milvus` for tests.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Run llama stack with `inline::milvus` enabled:
```
pytest tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes --stack-config=http://localhost:8321 --embedding-model=all-MiniLM-L6-v2 -v
```
```
INFO 2025-08-07 17:06:20,932 tests.integration.conftest:64 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
=========================================================================================== test session starts ============================================================================================
platform darwin -- Python 3.12.11, pytest-7.4.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'macOS-14.7.6-arm64-arm-64bit', 'Packages': {'pytest': '7.4.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.23.8', 'cov': '6.0.0', 'timeout': '2.2.0', 'socket': '0.7.0', 'html': '3.1.1', 'langsmith': '0.3.39', 'anyio': '4.8.0', 'metadata': '3.0.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.23.8, cov-6.0.0, timeout-2.2.0, socket-0.7.0, html-3.1.1, langsmith-0.3.39, anyio-4.8.0, metadata-3.0.0
asyncio: mode=Mode.AUTO
collected 3 items
tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-vector] PASSED [ 33%]
tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-keyword] PASSED [ 66%]
tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-hybrid] PASSED [100%]
============================================================================================ 3 passed in 4.75s =============================================================================================
```
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
chore: Fixup main pre commit (#3204)
build: Bump version to 0.2.18
chore: Faster npm pre-commit (#3206)
Adds npm to pre-commit.yml installation and caches ui
Removes node installation during pre-commit.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
chiecking in for tonight, wip moving to agents api
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
remove log
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
updated
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
fix: disable ui-prettier & ui-eslint (#3207)
chore(pre-commit): add pre-commit hook to enforce llama_stack logger usage (#3061)
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
fix: fix ```openai_embeddings``` for asymmetric embedding NIMs (#3205)
NVIDIA asymmetric embedding models (e.g.,
`nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter
not present in the standard OpenAI embeddings API. This PR adds the
`input_type="query"` as default and updates the documentation to suggest
using the `embedding` API for passage embeddings.
<!-- If resolving an issue, uncomment and update the line below -->
Resolves#2892
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
cleaning up
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
updating session manager to cache messages locally
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
fix linter
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
more cleanup
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205
While fixing this, I encountered a large amount of badness in our CI
workflow definitions.
- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.
## Test Plan
Re-record using:
```
sh scripts/integration-tests.sh --stack-config ci-tests \
--provider ollama --test-pattern test_agent_name --inference-mode record
```
Then re-run with `--inference-mode replay`. But:
Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
# What does this PR do?
Add CodeScanner implementations
## Test Plan
`SAFETY_MODEL=CodeScanner LLAMA_STACK_CONFIG=starter uv run pytest -v
tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
This PR need to land after this
https://github.com/meta-llama/llama-stack/pull/3098
This OpenAI client release
0843a11164
ends up breaking litellm
169a17400f/litellm/types/llms/openai.py (L40)
Update the dependency pin. Also make the imports a bit more defensive
anyhow if something else during `llama stack build` ends up moving
openai to a previous version.
## Test Plan
Run pre-release script integration tests.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I noticed somehow
[build_conda_env.sh](https://github.com/llamastack/llama-stack/blob/main/llama_stack/core/build_conda_env.sh)
exists in main branch. We need to kill it to be consistent with
[#2969](https://github.com/llamastack/llama-stack/pull/2969)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Update triagers to current state
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
commands where the output is important like `llama stack build
--print-deps-only` (soon to be `llama stack show`) print some log.py
`cprint`'s on _every_ execution of the CLI
for example:
<img width="912" height="331" alt="Screenshot 2025-08-18 at 1 16 30 PM"
src="https://github.com/user-attachments/assets/e5bf18fb-74a1-438c-861a-8a26eea7d014"
/>
the yellow text is likely unnecessary.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Small docs change as requested in
https://github.com/llamastack/llama-stack/pull/3160#pullrequestreview-3125038932
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
See comment here:
https://github.com/llamastack/llama-stack/pull/3162#issuecomment-3192859097
-- TL;DR it is quite complex to invoke the recording workflow correctly
for an end developer writing tests. This script simplifies the work.
No more manual GitHub UI navigation!
## Script Functionality
- Auto-detects your current branch and associated PR
- Finds the right repository context (works from forks!)
- Runs the workflow where it can actually commit back
- Validates prerequisites and provides helpful error messages
## How to Use
First ensure you are on the branch which introduced a new test and want
it recorded. **Make sure you have pushed this branch remotely, easiest
is to create a PR.**
```
# Record tests for current branch
./scripts/github/schedule-record-workflow.sh
# Record specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --run-vision-tests
# Record tests matching a pattern
./scripts/github/schedule-record-workflow.sh --test-pattern "test_streaming"
```
## Test Plan
Ran `./scripts/github/schedule-record-workflow.sh -s inference -k
tool_choice` which started
4820409329
which successfully committed recorded outputs.
# What does this PR do?
Recording tests has become a nightmare. This is the first part of making
that process simpler by making it _less_ automatic. I tried to be too
clever earlier.
It simplifies the record-integration-tests workflow to use workflow
dispatch inputs instead of PR labels. No more opaque stuff. Just go to
the GitHub UI and run the workflow with inputs. I will soon add a helper
script for this also.
Other things to aid re-running just the small set of things you need to
re-record:
- Replaces the `test-types` JSON array parameter with a more intuitive
`test-subdirs` comma-separated list. The whole JSON array crap was for
matrix.
- Adds a new `test-pattern` parameter to allow filtering tests using
pytest's `-k` option
## Test Plan
Note that this PR is in a fork not the source repository.
- Replay tests on this PR are green
- Manually
[ran](1699856292)
the replay workflow with a test-subdir and test-pattern filter, worked
- Manually
[ran](4819508034)
the **record** workflow with a simple pattern, it has worked and updated
_this_ PR.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Replace chat_completion calls with openai_chat_completion to eliminate
dependency on legacy inference APIs.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3067
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Creates a structured testing documentation section with multiple detailed pages:
- Testing overview explaining the record-replay architecture
- Integration testing guide with practical usage examples
- Record-replay system technical documentation
- Guide for writing effective tests
- Troubleshooting guide for common testing issues
Hopefully this makes things a bit easier.
# What does this PR do?
Updates test recordings.
## Test Plan
Started ollama serving the 3.2:3b model. Then ran the server:
```
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings/ \
SQLITE_STORE_DIR=$(mktemp -d) \
OLLAMA_URL=http://localhost:11434 \
llama stack build --template starter --image-type venv --run
```
Then ran the tests which needed recording:
```
pytest -sv tests/integration/agents/test_openai_responses.py \
--stack-config=server:starter \
--text-model ollama/llama3.2:3b-instruct-fp16 -k test_responses_store
```
Then, restarted the server with `LLAMA_STACK_TEST_INFERENCE_MODE=replay`, re-ran the tests and verified they passed.
# What does this PR do?
A _bunch_ on cleanup for the Responses tests.
- Got rid of YAML test cases, moved them to just use simple pydantic models
- Splitting the large monolithic test file into multiple focused test files:
- `test_basic_responses.py` for basic and image response tests
- `test_tool_responses.py` for tool-related tests
- `test_file_search.py` for file search specific tests
- Adding a `StreamingValidator` helper class to standardize streaming response validation
## Test Plan
Run the tests:
```
pytest -s -v tests/integration/non_ci/responses/ \
--stack-config=starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
# What does this PR do?
Adds proper streaming events for MCP tool listing (`mcp_list_tools.in_progress` and `mcp_list_tools.completed`). Also refactors things a bit more.
## Test Plan
Verified existing integration tests pass with the refactored code. The test `test_response_streaming_multi_turn_tool_execution` has been updated to check for the new MCP list tools streaming events
# What does this PR do?
Refactors the OpenAI response conversion utilities by moving helper functions from `openai_responses.py` to `utils.py`. Adds unit tests.
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
The OpenAI compatibility layer was incorrectly importing
ChatCompletionMessageToolCallParam instead of the
ChatCompletionMessageFunctionToolCall class. This caused "Cannot
instantiate typing.Union" errors when processing agent requests with
tool calls.
Closes: #3141
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Adds content part streaming events to the OpenAI-compatible Responses API to support more granular streaming of response content. This introduces:
1. New schema types for content parts: `OpenAIResponseContentPart` with variants for text output and refusals
2. New streaming event types:
- `OpenAIResponseObjectStreamResponseContentPartAdded` for when content parts begin
- `OpenAIResponseObjectStreamResponseContentPartDone` for when content parts complete
3. Implementation in the reference provider to emit these events during streaming responses. Also emits MCP arguments just like function call ones.
## Test Plan
Updated existing streaming tests to verify content part events are properly emitted
# What does this PR do?
Enhances tool execution streaming by adding support for real-time progress events during tool calls. This implementation adds streaming events for MCP and web search tools, including in-progress, searching, completed, and failed states.
The refactored `_execute_tool_call` method now returns an async iterator that yields streaming events throughout the tool execution lifecycle.
## Test Plan
Updated the integration test `test_response_streaming_multi_turn_tool_execution` to verify the presence and structure of new streaming events, including:
- Checking for MCP in-progress and completed events
- Verifying that progress events contain required fields (item_id, output_index, sequence_number)
- Ensuring completed events have the necessary sequence_number field
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to eliminate hardcoded status codes in
server's responses and replace it by `httpx.codes` functionality for
better consistency across the whole project and improvement in code
readability.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run `./scripts/unit-tests.sh`
**Description:**
The standard markdown [!NOTE] format is not supported on Sphinx
generated documentation, replacing those instances. Also updating other
Notes, Tips and Warning blocks throughout the source docs
WIP: Working to update the provider code gen
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to make the behavior DELETE API endpoints be
consistent with standard RESTful conventions and eliminate confusion for
API consumers.
Old Behavior
```
HTTP Status: 200 OK
Response Body: null
```
Eg. `curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
`INFO 2025-08-12 16:11:57,932 console_span_processor:65 telemetry:
15:11:57.929 [INFO] ::1:59805 - "DELETE /v1/shields/test-shield
HTTP/1.1" 200 `
Updated Behavior
```
HTTP Status: 204 No Content
Response Body: empty (no body)
```
Eg. `curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`INFO 2025-08-12 16:18:16,645 console_span_processor:62 telemetry:
15:18:16.637 [INFO] ::1:60283 - "DELETE /v1/shields/test-shield
HTTP/1.1" 204 `
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3090
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run `./scripts/unit-tests.sh`
# What does this PR do?
1. Updates `AgentPersistence.list_sessions()` to properly filter out
`Turn` keys from `Session` keys.
2. Adds a suite of unit tests to confirm the `list_sessions()` behavior
and tests the failed sample in
https://github.com/meta-llama/llama-stack/issues/3048
## Fixes https://github.com/meta-llama/llama-stack/issues/3048
## Test Plan
Unit tests added.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR changes the group name from github.ref to
github.even.pull_request_number. The reason for this is that github.ref
does not act as a unique identifier in the pull_request_target event and
only is unique in pull_request. The github action was getting canceled
was because the group name was not unique in the concurrency section.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3102
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
To test this I have created a fake github action and ran it trough act
to see what the github.ref variable produced and what alternatives can
be used. This confirmed that the github.ref was not unique and that
github.event.pull_request_number is unique to the PR.