- Add authorization parameter to Tool Runtime API signatures (list_runtime_tools, invoke_tool)
- Update MCP provider implementation to use authorization from request body instead of provider-data
- Deprecate mcp_authorization and mcp_headers from provider-data (MCPProviderDataValidator now empty)
- Update all Tool Runtime tests to pass authorization as request body parameter
- Responses API already uses request body authorization (no changes needed)
This provides a single, consistent way to pass MCP authentication tokens across both APIs, addressing reviewer feedback about avoiding multiple configuration paths.
# Problem
Responses API uses max_tool_calls parameter to limit the number of tool
calls that can be generated in a response. Currently, LLS implementation
of the Responses API does not support this parameter.
# What does this PR do?
This pull request adds the max_tool_calls field to the response object
definition and updates the inline provider. it also ensures that:
- the total number of calls to built-in and mcp tools do not exceed
max_tool_calls
- an error is thrown if max_tool_calls < 1 (behavior seen with the
OpenAI Responses API, but we can change this if needed)
Closes #[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
- Tested manually for change in model response w.r.t supplied
max_tool_calls field.
- Added integration tests to test invalid max_tool_calls parameter.
- Added integration tests to check max_tool_calls parameter with
built-in and function tools.
- Added integration tests to check max_tool_calls parameter in the
returned response object.
- Recorded OpenAI Responses API behavior using a sample script:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/max_tool_calls.py
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds OCI GenAI PaaS models for openai chat completion endpoints.
## Test Plan
In an OCI tenancy with access to GenAI PaaS, perform the following
steps:
1. Ensure you have IAM policies in place to use service (check docs
included in this PR)
2. For local development, [setup OCI
cli](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
and configure the CLI with your region, tenancy, and auth
[here](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm)
3. Once configured, go through llama-stack setup and run llama-stack
(uses config based auth) like:
```bash
OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci
```
4. Hit the `models` endpoint to list models after server is running:
```bash
curl http://localhost:8321/v1/models | jq
...
{
"identifier": "meta.llama-4-scout-17b-16e-instruct",
"provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
"provider_id": "oci",
"type": "model",
"metadata": {
"display_name": "meta.llama-4-scout-17b-16e-instruct",
"capabilities": [
"CHAT"
],
"oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
},
"model_type": "llm"
},
...
```
5. Use the "display_name" field to use the model in a
`/chat/completions` request:
```bash
# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": true,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": false,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
```
6. Try out other models from the `/models` endpoint.
Fixed incorrect import in test_mcp_authentication.py:
- Changed: from llama_stack import LlamaStackAsLibraryClient
- To: from llama_stack.core.library_client import LlamaStackAsLibraryClient
This aligns with the correct import pattern used in other test files.
Updates integration tests to use the new mcp_authorization field
instead of the old method of passing Authorization in mcp_headers.
Changes:
- tests/integration/tool_runtime/test_mcp.py
- tests/integration/inference/test_tools_with_schemas.py
- tests/integration/tool_runtime/test_mcp_json_schema.py (6 occurrences)
All tests now use:
provider_data = {"mcp_authorization": {uri: AUTH_TOKEN}}
Instead of the old rejected format:
provider_data = {"mcp_headers": {uri: {"Authorization": f"Bearer {AUTH_TOKEN}"}}}
This aligns with the security architecture that prevents
accidentally leaking inference tokens to MCP servers.
# What does this PR do?
Resolves#4102
1. Added `web_search_2025_08_26` to the `WebSearchToolTypes` list and
the `OpenAIResponseInputToolWebSearch.type` Literal union
2. No changes needed to tool execution logic - all `web_search` types
map to the same underlying tool
3. Backward compatibility is maintained - existing `web_search`,
`web_search_preview`, and `web_search_preview_2025_03_11` types continue
to work
4. Added an integration test case using {"type":
"web_search_2025_08_26"} to verify it works correctly
5. Updated `docs/docs/providers/openai_responses_limitations.mdx` to
reflect that `web_search_2025_08_26` is now supported.
6. Removed incorrect references to `MOD1/MOD2/MOD3` (which don't exist
in the codebase)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
https://github.com/llamastack/llama-stack/pull/4055 cleaned the agents
implementation but while doing so it removed some tests which actually
corresponded to the responses implementation. This PR brings those tests
and assocated recordings back.
(We should likely combine all responses tests into one suite, but that
is beyond the scope of this PR.)
o Introduces vLLM provider support to the record/replay testing
framework
o Enabling both recording and replay of vLLM API interactions alongside
existing Ollama support.
The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API
surface
including vision features.
--
This is an alternative to #3128 , using qwen3 instead of llama 3.2 1B
appears to be more capable at structure output and tool calls.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
- when create vector store is called without chunk strategy, we actually
the strategy used so that the value is persisted instead of
strategy='None'
## Test Plan
updated tests
# What does this PR do?
1. Make telemetry tests as easy as possible for users by expanding the
`SpanStub` data class and creating the `MetricStub` dataclass as a way
to consistently marshal telemetry data in test fixtures and unmarshal
and handle it in tests.
2. Structure server and client tests to always follow the same standards
for consistent testing experience by using the `SpanStub` and
`MetricStub` data class objects.
3. Enable Metrics Testing for completions endpoint
4. Correct token metrics to use histograms instead of counts to capture
tokens per request rather than a cumulative count of tokens over the
lifecycle of the server.
## Test Plan
These are tests
Added a script to cleanup recordings. While doing this, moved the CI
matrix generation to a separate script so there is a single source of
truth for the matrix.
Ran the cleanup script as:
```
PYTHONPATH=. python scripts/cleanup_recordings.py
```
Also added this as part of the pre-commit workflow to ensure that the
recordings are always up to date and that no stale recordings are left
in the repo.
- Removes the deprecated agents (sessions and turns) API that was marked
alpha in 0.3.0
- Cleans up unused imports and orphaned types after the API removal
- Removes `SessionNotFoundError` and `AgentTurnInputType` which are no
longer needed
The agents API is completely superseded by the Responses + Conversations
APIs, and the client SDK Agent class already uses those implementations.
Corresponding client-side PR:
https://github.com/llamastack/llama-stack-client-python/pull/295
The llama-stack-client now uses /`v1/openai/v1/models` which returns
OpenAI-compatible model objects with 'id' and 'custom_metadata' fields
instead of the Resource-style 'identifier' field. Updated api_recorder
to handle the new endpoint and modified tests to access model metadata
appropriately. Deleted stale model recordings for re-recording.
**NOTE: CI will be red on this one since it is dependent on
https://github.com/llamastack/llama-stack-client-python/pull/291/files
landing. I verified locally that it is green.**
Without this hint Qwen3-0.6B tends to reply with the full name
and sometimes doesn't reply with the correct drafted year.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Allow filtering for v1alpha, v1beta, deprecated and v1. Backward
incompatible change since by default it only returns v1 apis now.
## Test Plan
added unit test
# What does this PR do?
Add rerank API for NVIDIA Inference Provider.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3278
## Test Plan
Unit test:
```
pytest tests/unit/providers/nvidia/test_rerank_inference.py
```
Integration test:
```
pytest -s -v tests/integration/inference/test_rerank.py --stack-config="inference=nvidia" --rerank-model=nvidia/nvidia/nv-rerankqa-mistral-4b-v3 --env NVIDIA_API_KEY="" --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```
… case variations
The ollama/llama3.2:3b-instruct-fp16 model returns string values with
trailing whitespace in structured JSON output. Updated test assertions
to use case-insensitive substring matching instead of exact equality.
Use .lower() for case-insensitive comparison
Check if expected value is contained in actual value (handles
whitespace)
Closes: #3996
Signed-off-by: Derek Higgins <derekh@redhat.com>
This should be "remote::vllm". This causes some log probs tests to be
skipped with remote vllm. (They
fail if run).
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
chunk_id in the Chunk class executes actual logic to compute a chunk ID.
This sort of logic should not live in the API spec.
Instead, the providers should be in charge of calling generate_chunk_id,
and pass it to `Chunk`.
this removes the incorrect dependency between Provider impl and API impl
Signed-off-by: Charlie Doern <cdoern@redhat.com>