The AuthenticationMiddleware was blocking all requests without an
Authorization header, including health and version endpoints that are
needed by monitoring tools, load balancers, and Kubernetes probes.
This commit allows endpoints ending in /health or /version to bypass
authentication, enabling operational tooling to function properly
without requiring credentials.
Closes: #3735
Signed-off-by: Derek Higgins <derekh@redhat.com>
Implementats usage accumulation to StreamingResponseOrchestrator.
The most important part was to pass `stream_options = { "include_usage":
true }` to the chat_completion call. This means I will have to record
all responses tests again because request hash will change :)
Test changes:
- Add usage assertions to streaming and non-streaming tests
- Update test recordings with actual usage data from OpenAI
There are many changes to responses which are landing. They are
introducing fundamental new types. This means re-recordings even from
the inference calls. Let's avoid that for now.
Once everything lands I will re-record everything, make things pass and
re-enable.
# What does this PR do?
This PR checks whether, if a previous response is linked, there are
mcp_list_tools objects that can be reused instead of listing the tools
explicitly every time.
Closes#3106
## Test Plan
Tested manually.
Added unit tests to cover new behaviour.
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
use SecretStr for OpenAIMixin providers
- RemoteInferenceProviderConfig now has auth_credential: SecretStr
- the default alias is api_key (most common name)
- some providers override to use api_token (RunPod, vLLM, Databricks)
- some providers exclude it (Ollama, TGI, Vertex AI)
addresses #3517
## Test Plan
ci w/ new tests
Updated scripts/normalize_recordings.py to dynamically find and process
all 'recordings' directories under tests/ using pathlib.rglob() instead
of hardcoding a single path.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Allows model check to fail gracefully instead of crashing on startup.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
set VLLM_URL to your VLLM server
```
(base) akram@Mac llama-stack % LAMA_STACK_LOGGING="all=debug" VLLM_ENABLE_MODEL_DISCOVERY=false MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
```
INFO 2025-10-08 20:11:24,637 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-08 20:11:24,866 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-08 20:11:26,160 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=9fba207425
5851c718aca717a5887d76%3A%2Fmodels">Found</a>.
[...]
INFO 2025-10-08 20:11:26,295 uvicorn.error:84 uncategorized: Started server process [83144]
INFO 2025-10-08 20:11:26,296 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO 2025-10-08 20:11:26,297 llama_stack.core.server.server:170 core::server: Starting up
INFO 2025-10-08 20:11:26,297 llama_stack.core.stack:399 core: starting registry refresh task
INFO 2025-10-08 20:11:26,311 uvicorn.error:62 uncategorized: Application startup complete.
INFO 2025-10-08 20:11:26,312 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
ERROR 2025-10-08 20:11:26,791 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=8ef0cba3e1
71a4f8b04cb445cfb91a4c%3A%2Fmodels">Found</a>.
```
## Summary
Adds OpenAI-compatible usage tracking types to enable reporting token
consumption for both streaming and non-streaming responses.
## Type Definitions
**Chat Completion Usage** (inference API):
```python
class OpenAIChatCompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
prompt_tokens_details: OpenAIChatCompletionUsagePromptTokensDetails | None
completion_tokens_details: OpenAIChatCompletionUsageCompletionTokensDetails | None
```
**Response Usage** (responses API):
```python
class OpenAIResponseUsage(BaseModel):
input_tokens: int
output_tokens: int
total_tokens: int
input_tokens_details: OpenAIResponseUsageInputTokensDetails | None
output_tokens_details: OpenAIResponseUsageOutputTokensDetails | None
```
This matches OpenAI's usage reporting format and enables PR #3766 to
implement usage tracking in streaming responses.
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
After removing model management CLI in #3700, this PR updates remaining
references to the old `llama download` command to use `huggingface-cli
download` instead.
## Changes
- Updated error messages in `meta_reference/common.py` to recommend
`huggingface-cli download`
- Updated error messages in
`torchtune/recipes/lora_finetuning_single_device.py` to use
`huggingface-cli download`
- Updated post-training notebook to use `huggingface-cli download`
instead of `llama download`
- Fixed typo: "you model" -> "your model"
## Test Plan
- Verified error messages provide correct guidance for users
- Checked that notebook instructions are up-to-date with current tooling
## Summary
Fixes#2990
Remote provider authentication errors (401/403) were being converted to
500 Internal Server Error, preventing users from understanding why their
requests failed.
## The Problem
When a request with an invalid API key was sent to a remote provider:
- Provider correctly returns 401 with error details
- Llama Stack's `translate_exception()` didn't recognize provider SDK
exceptions
- Fell through to generic 500 error handler
- User received: "Internal server error: An unexpected error occurred."
## The Fix
Added handler in `translate_exception()` that checks for exceptions with
a `status_code` attribute and preserves the original HTTP status code
and error message.
**Before:**
```json
HTTP 500
{"detail": "Internal server error: An unexpected error occurred."}
```
**After:**
```json
HTTP 401
{"detail": "Error code: 401 - {'error': {'message': 'Invalid API Key', 'type': 'invalid_request_error', 'code': 'invalid_api_key'}}"}
```
## Tested With
- ✅ groq: 401 "Invalid API Key"
- ✅ openai: 401 "Incorrect API key provided"
- ✅ together: 401 "Invalid API key provided"
- ✅ fireworks: 403 "unauthorized"
## Test Plan
**Automated test script:**
https://gist.github.com/ashwinb/1199dd7585ffa3f4be67b111cc65f2f3
The test script:
1. Builds separate stacks for each provider
2. Registers models (with validation temporarily disabled for testing)
3. Sends requests with invalid API keys via `x-llamastack-provider-data`
header
4. Verifies HTTP status codes are 401/403 (not 500)
**Results before fix:** All providers returned 500
**Results after fix:** All providers correctly return 401/403
**Manual verification:**
```bash
# 1. Build stack
llama stack build --image-type venv --providers inference=remote::groq
# 2. Start stack
llama stack run
# 3. Send request with invalid API key
curl http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-H 'x-llamastack-provider-data: {"groq_api_key": "invalid-key"}' \
-d '{"model": "groq/llama3-70b-8192", "messages": [{"role": "user", "content": "test"}]}'
# Expected: HTTP 401 with provider error message (not 500)
```
## Impact
- Works with all remote providers using OpenAI SDK (groq, openai,
together, fireworks, etc.)
- Works with any provider SDK that follows the pattern of exceptions
with `status_code` attribute
- No breaking changes - only affects error responses
# What does this PR do?
objects (vector dbs, models, scoring functions, etc) have an identifier
and associated object values.
we allow exact duplicate registrations.
we reject registrations when the identifier exists and the associated
object values differ.
note: model are namespaced, i.e. {provider_id}/{identifier}, while other
object types are not
## Test Plan
ci w/ new tests
This change removes the `llama model` and `llama download` subcommands
from the CLI, replacing them with recommendations to use the Hugging
Face CLI instead.
Rationale for this change:
- The model management functionality was largely duplicating what
Hugging Face CLI already provides, leading to unnecessary maintenance
overhead (except the download source from Meta?)
- Maintaining our own implementation required fixing bugs and keeping up
with changes in model repositories and download mechanisms
- The Hugging Face CLI is more mature, widely adopted, and better
maintained
- This allows us to focus on the core Llama Stack functionality rather
than reimplementing model management tools
Changes made:
- Removed all model-related CLI commands and their implementations
- Updated documentation to recommend using `huggingface-cli` for model
downloads
- Removed Meta-specific download logic and statements
- Simplified the CLI to focus solely on stack management operations
Users should now use:
- `huggingface-cli download` for downloading models
- `huggingface-cli scan-cache` for listing downloaded models
This is a breaking change as it removes previously available CLI
commands.
Signed-off-by: Sébastien Han <seb@redhat.com>