# What does this PR do?
This PR checks whether, if a previous response is linked, there are
mcp_list_tools objects that can be reused instead of listing the tools
explicitly every time.
Closes#3106
## Test Plan
Tested manually.
Added unit tests to cover new behaviour.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
use SecretStr for OpenAIMixin providers
- RemoteInferenceProviderConfig now has auth_credential: SecretStr
- the default alias is api_key (most common name)
- some providers override to use api_token (RunPod, vLLM, Databricks)
- some providers exclude it (Ollama, TGI, Vertex AI)
addresses #3517
## Test Plan
ci w/ new tests
Updated scripts/normalize_recordings.py to dynamically find and process
all 'recordings' directories under tests/ using pathlib.rglob() instead
of hardcoding a single path.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Allows model check to fail gracefully instead of crashing on startup.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
set VLLM_URL to your VLLM server
```
(base) akram@Mac llama-stack % LAMA_STACK_LOGGING="all=debug" VLLM_ENABLE_MODEL_DISCOVERY=false MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
```
INFO 2025-10-08 20:11:24,637 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-08 20:11:24,866 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-08 20:11:26,160 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=9fba207425
5851c718aca717a5887d76%3A%2Fmodels">Found</a>.
[...]
INFO 2025-10-08 20:11:26,295 uvicorn.error:84 uncategorized: Started server process [83144]
INFO 2025-10-08 20:11:26,296 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-10-08 20:11:26,297 llama_stack.core.server.server:170 core::server: Starting up
INFO 2025-10-08 20:11:26,297 llama_stack.core.stack:399 core: starting registry refresh task
INFO 2025-10-08 20:11:26,311 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-10-08 20:11:26,312 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
ERROR 2025-10-08 20:11:26,791 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=8ef0cba3e1
71a4f8b04cb445cfb91a4c%3A%2Fmodels">Found</a>.
```
## Summary
Adds OpenAI-compatible usage tracking types to enable reporting token
consumption for both streaming and non-streaming responses.
## Type Definitions
**Chat Completion Usage** (inference API):
```python
class OpenAIChatCompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
prompt_tokens_details: OpenAIChatCompletionUsagePromptTokensDetails | None
completion_tokens_details: OpenAIChatCompletionUsageCompletionTokensDetails | None
```
**Response Usage** (responses API):
```python
class OpenAIResponseUsage(BaseModel):
input_tokens: int
output_tokens: int
total_tokens: int
input_tokens_details: OpenAIResponseUsageInputTokensDetails | None
output_tokens_details: OpenAIResponseUsageOutputTokensDetails | None
```
This matches OpenAI's usage reporting format and enables PR #3766 to
implement usage tracking in streaming responses.
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
After removing model management CLI in #3700, this PR updates remaining
references to the old `llama download` command to use `huggingface-cli
download` instead.
## Changes
- Updated error messages in `meta_reference/common.py` to recommend
`huggingface-cli download`
- Updated error messages in
`torchtune/recipes/lora_finetuning_single_device.py` to use
`huggingface-cli download`
- Updated post-training notebook to use `huggingface-cli download`
instead of `llama download`
- Fixed typo: "you model" -> "your model"
## Test Plan
- Verified error messages provide correct guidance for users
- Checked that notebook instructions are up-to-date with current tooling
## Summary
Fixes#2990
Remote provider authentication errors (401/403) were being converted to
500 Internal Server Error, preventing users from understanding why their
requests failed.
## The Problem
When a request with an invalid API key was sent to a remote provider:
- Provider correctly returns 401 with error details
- Llama Stack's `translate_exception()` didn't recognize provider SDK
exceptions
- Fell through to generic 500 error handler
- User received: "Internal server error: An unexpected error occurred."
## The Fix
Added handler in `translate_exception()` that checks for exceptions with
a `status_code` attribute and preserves the original HTTP status code
and error message.
**Before:**
```json
HTTP 500
{"detail": "Internal server error: An unexpected error occurred."}
```
**After:**
```json
HTTP 401
{"detail": "Error code: 401 - {'error': {'message': 'Invalid API Key', 'type': 'invalid_request_error', 'code': 'invalid_api_key'}}"}
```
## Tested With
- ✅ groq: 401 "Invalid API Key"
- ✅ openai: 401 "Incorrect API key provided"
- ✅ together: 401 "Invalid API key provided"
- ✅ fireworks: 403 "unauthorized"
## Test Plan
**Automated test script:**
https://gist.github.com/ashwinb/1199dd7585ffa3f4be67b111cc65f2f3
The test script:
1. Builds separate stacks for each provider
2. Registers models (with validation temporarily disabled for testing)
3. Sends requests with invalid API keys via `x-llamastack-provider-data`
header
4. Verifies HTTP status codes are 401/403 (not 500)
**Results before fix:** All providers returned 500
**Results after fix:** All providers correctly return 401/403
**Manual verification:**
```bash
# 1. Build stack
llama stack build --image-type venv --providers inference=remote::groq
# 2. Start stack
llama stack run
# 3. Send request with invalid API key
curl http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-H 'x-llamastack-provider-data: {"groq_api_key": "invalid-key"}' \
-d '{"model": "groq/llama3-70b-8192", "messages": [{"role": "user", "content": "test"}]}'
# Expected: HTTP 401 with provider error message (not 500)
```
## Impact
- Works with all remote providers using OpenAI SDK (groq, openai,
together, fireworks, etc.)
- Works with any provider SDK that follows the pattern of exceptions
with `status_code` attribute
- No breaking changes - only affects error responses
# What does this PR do?
objects (vector dbs, models, scoring functions, etc) have an identifier
and associated object values.
we allow exact duplicate registrations.
we reject registrations when the identifier exists and the associated
object values differ.
note: model are namespaced, i.e. {provider_id}/{identifier}, while other
object types are not
## Test Plan
ci w/ new tests
This change removes the `llama model` and `llama download` subcommands
from the CLI, replacing them with recommendations to use the Hugging
Face CLI instead.
Rationale for this change:
- The model management functionality was largely duplicating what
Hugging Face CLI already provides, leading to unnecessary maintenance
overhead (except the download source from Meta?)
- Maintaining our own implementation required fixing bugs and keeping up
with changes in model repositories and download mechanisms
- The Hugging Face CLI is more mature, widely adopted, and better
maintained
- This allows us to focus on the core Llama Stack functionality rather
than reimplementing model management tools
Changes made:
- Removed all model-related CLI commands and their implementations
- Updated documentation to recommend using `huggingface-cli` for model
downloads
- Removed Meta-specific download logic and statements
- Simplified the CLI to focus solely on stack management operations
Users should now use:
- `huggingface-cli download` for downloading models
- `huggingface-cli scan-cache` for listing downloaded models
This is a breaking change as it removes previously available CLI
commands.
Signed-off-by: Sébastien Han <seb@redhat.com>
Replaces opaque error messages when recordings are not found with
somewhat better guidance
Before:
```
No recorded response found for request hash: abc123...
To record this response, run with LLAMA_STACK_TEST_INFERENCE_MODE=record
```
After:
```
Recording not found for request hash: abc123
Model: gpt-4 | Request: POST https://api.openai.com/v1/chat/completions
Run './scripts/integration-tests.sh --inference-mode record-if-missing' with required API keys to generate.
```
These vector databases are already thoroughly tested in integration
tests.
Unit tests now focus on sqlite_vec, faiss, and pgvector with mocked
dependencies, removing the need for external service dependencies.
## Changes:
- Deleted test_qdrant.py unit test file
- Removed chroma/qdrant fixtures and parametrization from conftest.py
- Fixed SqliteKVStoreConfig import to use correct location
- Removed chromadb, qdrant-client, pymilvus, milvus-lite, and
weaviate-client from unit test dependencies in pyproject.toml
Renames `inference_recorder.py` to `api_recorder.py` and extends it to
support recording/replaying tool invocations in addition to inference
calls.
This allows us to record web-search, etc. tool calls and thereafter
apply recordings for `tests/integration/responses`
## Test Plan
```
export OPENAI_API_KEY=...
export TAVILY_SEARCH_API_KEY=...
./scripts/integration-tests.sh --stack-config ci-tests \
--suite responses --inference-mode record-if-missing
```
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
Completes the refactoring started in previous commit by:
1. **Fix library client** (critical): Add logic to detect Pydantic model parameters
and construct them properly from request bodies. The key fix is to NOT exclude
any params when converting the body for Pydantic models - we need all fields
to pass to the Pydantic constructor.
Before: _convert_body excluded all params, leaving body empty for Pydantic construction
After: Check for Pydantic params first, skip exclusion, construct model with full body
2. **Update remaining providers** to use new Pydantic-based signatures:
- litellm_openai_mixin: Extract extra fields via __pydantic_extra__
- databricks: Use TYPE_CHECKING import for params type
- llama_openai_compat: Use TYPE_CHECKING import for params type
- sentence_transformers: Update method signatures to use params
3. **Update unit tests** to use new Pydantic signature:
- test_openai_mixin.py: Use OpenAIChatCompletionRequestParams
This fixes test failures where the library client was trying to construct
Pydantic models with empty dictionaries.
The previous fix had a bug: it called _convert_body() which only keeps fields
that match function parameter names. For Pydantic methods with signature:
openai_chat_completion(params: OpenAIChatCompletionRequestParams)
The signature only has 'params', but the body has 'model', 'messages', etc.
So _convert_body() returned an empty dict.
Fix: Skip _convert_body() entirely for Pydantic params. Use the raw body
directly to construct the Pydantic model (after stripping NOT_GIVENs).
This properly fixes the ValidationError where required fields were missing.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
Completes the refactoring started in previous commit by:
1. **Fix library client** (critical): Add logic to detect Pydantic model parameters
and construct them properly from request bodies. The key fix is to NOT exclude
any params when converting the body for Pydantic models - we need all fields
to pass to the Pydantic constructor.
Before: _convert_body excluded all params, leaving body empty for Pydantic construction
After: Check for Pydantic params first, skip exclusion, construct model with full body
2. **Update remaining providers** to use new Pydantic-based signatures:
- litellm_openai_mixin: Extract extra fields via __pydantic_extra__
- databricks: Use TYPE_CHECKING import for params type
- llama_openai_compat: Use TYPE_CHECKING import for params type
- sentence_transformers: Update method signatures to use params
3. **Update unit tests** to use new Pydantic signature:
- test_openai_mixin.py: Use OpenAIChatCompletionRequestParams
This fixes test failures where the library client was trying to construct
Pydantic models with empty dictionaries.
The previous fix had a bug: it called _convert_body() which only keeps fields
that match function parameter names. For Pydantic methods with signature:
openai_chat_completion(params: OpenAIChatCompletionRequestParams)
The signature only has 'params', but the body has 'model', 'messages', etc.
So _convert_body() returned an empty dict.
Fix: Skip _convert_body() entirely for Pydantic params. Use the raw body
directly to construct the Pydantic model (after stripping NOT_GIVENs).
This properly fixes the ValidationError where required fields were missing.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
# What does this PR do?
Adds traces around tool execution and mcp tool listing for better
observability.
Closes#3108
## Test Plan
Manually examined traces in jaeger to verify the added information was
available.
Signed-off-by: Gordon Sim <gsim@redhat.com>