Renames `inference_recorder.py` to `api_recorder.py` and extends it to
support recording/replaying tool invocations in addition to inference
calls.
This allows us to record web-search, etc. tool calls and thereafter
apply recordings for `tests/integration/responses`
## Test Plan
```
export OPENAI_API_KEY=...
export TAVILY_SEARCH_API_KEY=...
./scripts/integration-tests.sh --stack-config ci-tests \
--suite responses --inference-mode record-if-missing
```
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
Completes the refactoring started in previous commit by:
1. **Fix library client** (critical): Add logic to detect Pydantic model parameters
and construct them properly from request bodies. The key fix is to NOT exclude
any params when converting the body for Pydantic models - we need all fields
to pass to the Pydantic constructor.
Before: _convert_body excluded all params, leaving body empty for Pydantic construction
After: Check for Pydantic params first, skip exclusion, construct model with full body
2. **Update remaining providers** to use new Pydantic-based signatures:
- litellm_openai_mixin: Extract extra fields via __pydantic_extra__
- databricks: Use TYPE_CHECKING import for params type
- llama_openai_compat: Use TYPE_CHECKING import for params type
- sentence_transformers: Update method signatures to use params
3. **Update unit tests** to use new Pydantic signature:
- test_openai_mixin.py: Use OpenAIChatCompletionRequestParams
This fixes test failures where the library client was trying to construct
Pydantic models with empty dictionaries.
The previous fix had a bug: it called _convert_body() which only keeps fields
that match function parameter names. For Pydantic methods with signature:
openai_chat_completion(params: OpenAIChatCompletionRequestParams)
The signature only has 'params', but the body has 'model', 'messages', etc.
So _convert_body() returned an empty dict.
Fix: Skip _convert_body() entirely for Pydantic params. Use the raw body
directly to construct the Pydantic model (after stripping NOT_GIVENs).
This properly fixes the ValidationError where required fields were missing.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
Completes the refactoring started in previous commit by:
1. **Fix library client** (critical): Add logic to detect Pydantic model parameters
and construct them properly from request bodies. The key fix is to NOT exclude
any params when converting the body for Pydantic models - we need all fields
to pass to the Pydantic constructor.
Before: _convert_body excluded all params, leaving body empty for Pydantic construction
After: Check for Pydantic params first, skip exclusion, construct model with full body
2. **Update remaining providers** to use new Pydantic-based signatures:
- litellm_openai_mixin: Extract extra fields via __pydantic_extra__
- databricks: Use TYPE_CHECKING import for params type
- llama_openai_compat: Use TYPE_CHECKING import for params type
- sentence_transformers: Update method signatures to use params
3. **Update unit tests** to use new Pydantic signature:
- test_openai_mixin.py: Use OpenAIChatCompletionRequestParams
This fixes test failures where the library client was trying to construct
Pydantic models with empty dictionaries.
The previous fix had a bug: it called _convert_body() which only keeps fields
that match function parameter names. For Pydantic methods with signature:
openai_chat_completion(params: OpenAIChatCompletionRequestParams)
The signature only has 'params', but the body has 'model', 'messages', etc.
So _convert_body() returned an empty dict.
Fix: Skip _convert_body() entirely for Pydantic params. Use the raw body
directly to construct the Pydantic model (after stripping NOT_GIVENs).
This properly fixes the ValidationError where required fields were missing.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
# What does this PR do?
Adds traces around tool execution and mcp tool listing for better
observability.
Closes#3108
## Test Plan
Manually examined traces in jaeger to verify the added information was
available.
Signed-off-by: Gordon Sim <gsim@redhat.com>
Adds --collect-only flag to scripts/integration-tests.sh that skips
server startup and passes the flag to pytest for test collection only.
When specified, minimal flags are required (no --stack-config or --setup
needed).
## Changes
- Added `--collect-only` flag that skips server startup
- Made `--stack-config` and `--setup` optional when using
`--collect-only`
- Skip `llama` command check when collecting tests only
## Usage
```bash
# Collect tests without starting server
./scripts/integration-tests.sh --subdirs inference --collect-only
```
# What does this PR do?
Removing Weaviate, PostGres, and Milvus unit tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Propagate test IDs from client to server via HTTP headers to maintain
proper test isolation when running with server-based stack configs.
Without
this, recorded/replayed inference requests in server mode would leak
across
tests.
Changes:
- Patch client _prepare_request to inject test ID into provider data
header
- Sync test context from provider data on server side before storage
operations
- Set LLAMA_STACK_TEST_STACK_CONFIG_TYPE env var based on stack config
- Configure console width for cleaner log output in CI
- Add SQLITE_STORE_DIR temp directory for test data isolation
# What does this PR do?
It prevents a tool call message being added to the chat completions
message without a corresponding tool call result, which is needed in the
case that an approval is required first or if the approval request is
denied. In both these cases the tool call messages is popped of the next
turn messages.
Closes#3728
## Test Plan
Ran the integration tests
Manual check of both approval and denial against gpt-4o
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
- The watsonx.ai provider now uses the LiteLLM mixin instead of using
IBM's library, which does not seem to be working (see #3165 for
context).
- The watsonx.ai provider now lists all the models available by calling
the watsonx.ai server instead of having a hard coded list of known
models. (That list gets out of date quickly)
- An edge case in
[llama_stack/core/routers/inference.py](https://github.com/llamastack/llama-stack/pull/3674/files#diff-a34bc966ed9befd9f13d4883c23705dff49be0ad6211c850438cdda6113f3455)
is addressed that was causing my manual tests to fail.
- Fixes `b64_encode_openai_embeddings_response` which was trying to
enumerate over a dictionary and then reference elements of the
dictionary using .field instead of ["field"]. That method is called by
the LiteLLM mixin for embedding models, so it is needed to get the
watsonx.ai embedding models to work.
- A unit test along the lines of the one in #3348 is added. A more
comprehensive plan for automatically testing the end-to-end
functionality for inference providers would be a good idea, but is out
of scope for this PR.
- Updates to the watsonx distribution. Some were in response to the
switch to LiteLLM (e.g., updating the Python packages needed). Others
seem to be things that were already broken that I found along the way
(e.g., a reference to a watsonx specific doc template that doesn't seem
to exist).
Closes#3165
Also it is related to a line-item in #3387 but doesn't really address
that goal (because it uses the LiteLLM mixin, not the OpenAI one). I
tried the OpenAI one and it doesn't work with watsonx.ai, presumably
because the watsonx.ai service is not OpenAI compatible. It works with
LiteLLM because LiteLLM has a provider implementation for watsonx.ai.
## Test Plan
The test script below goes back and forth between the OpenAI and watsonx
providers. The idea is that the OpenAI provider shows how it should work
and then the watsonx provider output shows that it is also working with
watsonx. Note that the result from the MCP test is not as good (the
Llama 3.3 70b model does not choose tools as wisely as gpt-4o), but it
is still working and providing a valid response. For more details on
setup and the MCP server being used for testing, see [the AI Alliance
sample
notebook](https://github.com/The-AI-Alliance/llama-stack-examples/blob/main/notebooks/01-responses/)
that these examples are drawn from.
```python
#!/usr/bin/env python3
import json
from llama_stack_client import LlamaStackClient
from litellm import completion
import http.client
def print_response(response):
"""Print response in a nicely formatted way"""
print(f"ID: {response.id}")
print(f"Status: {response.status}")
print(f"Model: {response.model}")
print(f"Created at: {response.created_at}")
print(f"Output items: {len(response.output)}")
for i, output_item in enumerate(response.output):
if len(response.output) > 1:
print(f"\n--- Output Item {i+1} ---")
print(f"Output type: {output_item.type}")
if output_item.type in ("text", "message"):
print(f"Response content: {output_item.content[0].text}")
elif output_item.type == "file_search_call":
print(f" Tool Call ID: {output_item.id}")
print(f" Tool Status: {output_item.status}")
# 'queries' is a list, so we join it for clean printing
print(f" Queries: {', '.join(output_item.queries)}")
# Display results if they exist, otherwise note they are empty
print(f" Results: {output_item.results if output_item.results else 'None'}")
elif output_item.type == "mcp_list_tools":
print_mcp_list_tools(output_item)
elif output_item.type == "mcp_call":
print_mcp_call(output_item)
else:
print(f"Response content: {output_item.content}")
def print_mcp_call(mcp_call):
"""Print MCP call in a nicely formatted way"""
print(f"\n🛠️ MCP Tool Call: {mcp_call.name}")
print(f" Server: {mcp_call.server_label}")
print(f" ID: {mcp_call.id}")
print(f" Arguments: {mcp_call.arguments}")
if mcp_call.error:
print("Error: {mcp_call.error}")
elif mcp_call.output:
print("Output:")
# Try to format JSON output nicely
try:
parsed_output = json.loads(mcp_call.output)
print(json.dumps(parsed_output, indent=4))
except:
# If not valid JSON, print as-is
print(f" {mcp_call.output}")
else:
print(" ⏳ No output yet")
def print_mcp_list_tools(mcp_list_tools):
"""Print MCP list tools in a nicely formatted way"""
print(f"\n🔧 MCP Server: {mcp_list_tools.server_label}")
print(f" ID: {mcp_list_tools.id}")
print(f" Available Tools: {len(mcp_list_tools.tools)}")
print("=" * 80)
for i, tool in enumerate(mcp_list_tools.tools, 1):
print(f"\n{i}. {tool.name}")
print(f" Description: {tool.description}")
# Parse and display input schema
schema = tool.input_schema
if schema and 'properties' in schema:
properties = schema['properties']
required = schema.get('required', [])
print(" Parameters:")
for param_name, param_info in properties.items():
param_type = param_info.get('type', 'unknown')
param_desc = param_info.get('description', 'No description')
required_marker = " (required)" if param_name in required else " (optional)"
print(f" • {param_name} ({param_type}){required_marker}")
if param_desc:
print(f" {param_desc}")
if i < len(mcp_list_tools.tools):
print("-" * 40)
def main():
"""Main function to run all the tests"""
# Configuration
LLAMA_STACK_URL = "http://localhost:8321/"
LLAMA_STACK_MODEL_IDS = [
"openai/gpt-3.5-turbo",
"openai/gpt-4o",
"llama-openai-compat/Llama-3.3-70B-Instruct",
"watsonx/meta-llama/llama-3-3-70b-instruct"
]
# Using gpt-4o for this demo, but feel free to try one of the others or add more to run.yaml.
OPENAI_MODEL_ID = LLAMA_STACK_MODEL_IDS[1]
WATSONX_MODEL_ID = LLAMA_STACK_MODEL_IDS[-1]
NPS_MCP_URL = "http://localhost:3005/sse/"
print("=== Llama Stack Testing Script ===")
print(f"Using OpenAI model: {OPENAI_MODEL_ID}")
print(f"Using WatsonX model: {WATSONX_MODEL_ID}")
print(f"MCP URL: {NPS_MCP_URL}")
print()
# Initialize client
print("Initializing LlamaStackClient...")
client = LlamaStackClient(base_url="http://localhost:8321")
# Test 1: List models
print("\n=== Test 1: List Models ===")
try:
models = client.models.list()
print(f"Found {len(models)} models")
except Exception as e:
print(f"Error listing models: {e}")
raise e
# Test 2: Basic chat completion with OpenAI
print("\n=== Test 2: Basic Chat Completion (OpenAI) ===")
try:
chat_completion_response = client.chat.completions.create(
model=OPENAI_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print("OpenAI Response:")
for chunk in chat_completion_response.choices[0].message.content:
print(chunk, end="", flush=True)
print()
except Exception as e:
print(f"Error with OpenAI chat completion: {e}")
raise e
# Test 3: Basic chat completion with WatsonX
print("\n=== Test 3: Basic Chat Completion (WatsonX) ===")
try:
chat_completion_response_wxai = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print("WatsonX Response:")
for chunk in chat_completion_response_wxai.choices[0].message.content:
print(chunk, end="", flush=True)
print()
except Exception as e:
print(f"Error with WatsonX chat completion: {e}")
raise e
# Test 4: Tool calling with OpenAI
print("\n=== Test 4: Tool Calling (OpenAI) ===")
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a specific location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
},
},
"required": ["location"],
},
},
}
]
messages = [
{"role": "user", "content": "What's the weather like in Boston, MA?"}
]
try:
print("--- Initial API Call ---")
response = client.chat.completions.create(
model=OPENAI_MODEL_ID,
messages=messages,
tools=tools,
tool_choice="auto", # "auto" is the default
)
print("OpenAI tool calling response received")
except Exception as e:
print(f"Error with OpenAI tool calling: {e}")
raise e
# Test 5: Tool calling with WatsonX
print("\n=== Test 5: Tool Calling (WatsonX) ===")
try:
wxai_response = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=messages,
tools=tools,
tool_choice="auto", # "auto" is the default
)
print("WatsonX tool calling response received")
except Exception as e:
print(f"Error with WatsonX tool calling: {e}")
raise e
# Test 6: Streaming with WatsonX
print("\n=== Test 6: Streaming Response (WatsonX) ===")
try:
chat_completion_response_wxai_stream = client.chat.completions.create(
model=WATSONX_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}],
stream=True
)
print("Model response: ", end="")
for chunk in chat_completion_response_wxai_stream:
# Each 'chunk' is a ChatCompletionChunk object.
# We want the content from the 'delta' attribute.
if hasattr(chunk, 'choices') and chunk.choices is not None:
content = chunk.choices[0].delta.content
# The first few chunks might have None content, so we check for it.
if content is not None:
print(content, end="", flush=True)
print()
except Exception as e:
print(f"Error with streaming: {e}")
raise e
# Test 7: MCP with OpenAI
print("\n=== Test 7: MCP Integration (OpenAI) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=OPENAI_MODEL_ID,
input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.",
tools=[
{
"type": "mcp",
"server_url": NPS_MCP_URL,
"server_label": "National Parks Service tools",
"allowed_tools": ["search_parks", "get_park_events"],
}
]
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (OpenAI): {e}")
raise e
# Test 8: MCP with WatsonX
print("\n=== Test 8: MCP Integration (WatsonX) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=WATSONX_MODEL_ID,
input="What is the capital of France?"
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (WatsonX): {e}")
raise e
# Test 9: MCP with Llama 3.3
print("\n=== Test 9: MCP Integration (Llama 3.3) ===")
try:
mcp_llama_stack_client_response = client.responses.create(
model=WATSONX_MODEL_ID,
input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.",
tools=[
{
"type": "mcp",
"server_url": NPS_MCP_URL,
"server_label": "National Parks Service tools",
"allowed_tools": ["search_parks", "get_park_events"],
}
]
)
print_response(mcp_llama_stack_client_response)
except Exception as e:
print(f"Error with MCP (Llama 3.3): {e}")
raise e
# Test 10: Embeddings
print("\n=== Test 10: Embeddings ===")
try:
conn = http.client.HTTPConnection("localhost:8321")
payload = json.dumps({
"model": "watsonx/ibm/granite-embedding-278m-multilingual",
"input": "Hello, world!",
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
conn.request("POST", "/v1/openai/v1/embeddings", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
except Exception as e:
print(f"Error with Embeddings: {e}")
raise e
print("\n=== Testing Complete ===")
if __name__ == "__main__":
main()
```
---------
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 10.0.0 to
10.1.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/stale/releases">actions/stale's
releases</a>.</em></p>
<blockquote>
<h2>v10.1.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add <code>only-issue-types</code> option to filter issues by type by
<a href="https://github.com/Bibo-Joshi"><code>@Bibo-Joshi</code></a> in
<a
href="https://redirect.github.com/actions/stale/pull/1255">actions/stale#1255</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/Bibo-Joshi"><code>@Bibo-Joshi</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/stale/pull/1255">actions/stale#1255</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/stale/compare/v10...v10.1.0">https://github.com/actions/stale/compare/v10...v10.1.0</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="5f858e3efb"><code>5f858e3</code></a>
Add <code>only-issue-types</code> option to filter issues by type (<a
href="https://redirect.github.com/actions/stale/issues/1255">#1255</a>)</li>
<li>See full diff in <a
href="3a9db7e6a4...5f858e3efb">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
When the user wants to change the attributes (which could include model
name, dimensions,...etc) of an already registered provider, they will
get an error message asking that they first unregister the provider
before registering a new one.
# What does this PR do?
This PR updated the register function to raise an error to the user when
they attempt to register a provider that was already registered asking
them to un-register the existing provider first.
<!-- If resolving an issue, uncomment and update the line below -->
#2313
## Test Plan
Tested the change with /tests/unit/registry/test_registry.py
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
# What does this PR do?
user can simply set env vars in the beginning of the command.`FOO=BAR
llama stack run ...`
## Test Plan
Run
TELEMETRY_SINKS=coneol uv run --with llama-stack llama stack build
--distro=starter --image-type=venv --run
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3711).
* #3714
* __->__ #3711
# What does this PR do?
Removing some dead code, found by vulture and checked by claude that
there are no references or imports for these
## Test Plan
CI
# What does this PR do?
(Used claude to solve #3715, coded with claude but tested by me)
## From claude summary:
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
**Problem**: The `NVIDIAInferenceAdapter` class was missing the
`alias_to_provider_id_map` attribute, which caused the error:
`ERROR 'NVIDIAInferenceAdapter' object has no attribute
'alias_to_provider_id_map'`
**Root Cause**: The `NVIDIAInferenceAdapter` only inherited from
`OpenAIMixin`, but some parts of the system expected it to have the
`alias_to_provider_id_map` attribute, which is provided by the
`ModelRegistryHelper` class.
**Solution**:
1. **Added ModelRegistryHelper import**: Imported the
`ModelRegistryHelper` class from
`llama_stack.providers.utils.inference.model_registry`
2. **Updated inheritance**: Changed the class declaration to inherit
from both `OpenAIMixin` and `ModelRegistryHelper`
3. **Added proper initialization**: Added an `__init__` method that
properly initializes the `ModelRegistryHelper` with empty model entries
(since NVIDIA uses dynamic model discovery) and the allowed models from
the configuration
**Key Changes**:
* Added `from llama_stack.providers.utils.inference.model_registry
import ModelRegistryHelper`
* Changed class declaration from `class
NVIDIAInferenceAdapter(OpenAIMixin):` to `class
NVIDIAInferenceAdapter(OpenAIMixin, ModelRegistryHelper):`
* Added `__init__` method that calls `ModelRegistryHelper.__init__(self,
model_entries=[], allowed_models=config.allowed_models)`
The inheritance order is important - `OpenAIMixin` comes first to ensure
its `check_model_availability()` method takes precedence over the
`ModelRegistryHelper` version, as mentioned in the class documentation.
This fix ensures that the `NVIDIAInferenceAdapter` has the required
`alias_to_provider_id_map` attribute while maintaining all existing
functionality.<!-- If resolving an issue, uncomment and update the line
below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Launching llama-stack server successfully, see logs:
```
NVIDIA_API_KEY=dummy NVIDIA_BASE_URL=http://localhost:8912 llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --image-type venv &
[2] 3753042
(venv) nvidia@nv-meta-H100-testing-gpu01:~/kai/llama-stack$ WARNING 2025-10-07 00:29:09,848 root:266 uncategorized: Unknown logging category:
openai::conversations. Falling back to default 'root' level: 20
WARNING 2025-10-07 00:29:09,932 root:266 uncategorized: Unknown logging category: cli.
Falling back to default 'root' level: 20
INFO 2025-10-07 00:29:09,937 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:09,937 llama_stack.cli.stack.run:136 cli: Using run
configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml
Using virtual environment: /home/nvidia/kai/venv
Virtual environment already activated
+ '[' -n /home/nvidia/.llama/distributions/starter/starter-run.yaml ']'
+ yaml_config_arg=/home/nvidia/.llama/distributions/starter/starter-run.yaml
+ llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --port 8321
WARNING 2025-10-07 00:29:11,432 root:266 uncategorized: Unknown logging category:
openai::conversations. Falling back to default 'root' level: 20
WARNING 2025-10-07 00:29:11,593 root:266 uncategorized: Unknown logging category: cli.
Falling back to default 'root' level: 20
INFO 2025-10-07 00:29:11,603 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,604 llama_stack.cli.stack.run:136 cli: Using run
configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,624 llama_stack.cli.stack.run:155 cli: No image type or
image name provided. Assuming environment packages.
INFO 2025-10-07 00:29:11,625 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,644 llama_stack.cli.stack.run:230 cli: HTTPS enabled with
certificates:
Key: None
Cert: None
INFO 2025-10-07 00:29:11,645 llama_stack.cli.stack.run:232 cli: Listening on ['::',
'0.0.0.0']:8321
INFO 2025-10-07 00:29:11,816 llama_stack.core.utils.config_resolution:45 core:
Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml
INFO 2025-10-07 00:29:11,836 llama_stack.core.server.server:480 core::server: Run
configuration:
INFO 2025-10-07 00:29:11,845 llama_stack.core.server.server:483 core::server: apis:
- agents
- batches
- datasetio
- eval
- files
- inference
- post_training
- safety
- scoring
- telemetry
- tool_runtime
- vector_io
benchmarks: []
datasets: []
image_name: starter
inference_store:
db_path: /home/nvidia/.llama/distributions/starter/inference_store.db
type: sqlite
metadata_store:
db_path: /home/nvidia/.llama/distributions/starter/registry.db
type: sqlite
models: []
providers:
agents:
- config:
persistence_store:
db_path: /home/nvidia/.llama/distributions/starter/agents_store.db
type: sqlite
responses_store:
db_path: /home/nvidia/.llama/distributions/starter/responses_store.db
type: sqlite
provider_id: meta-reference
provider_type: inline::meta-reference
batches:
- config:
kvstore:
db_path: /home/nvidia/.llama/distributions/starter/batches.db
type: sqlite
provider_id: reference
provider_type: inline::reference
datasetio:
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/huggingface_datasetio.db
type: sqlite
provider_id: huggingface
provider_type: remote::huggingface
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/localfs_datasetio.db
type: sqlite
provider_id: localfs
provider_type: inline::localfs
eval:
- config:
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/meta_reference_eval.db
type: sqlite
provider_id: meta-reference
provider_type: inline::meta-reference
files:
- config:
metadata_store:
db_path: /home/nvidia/.llama/distributions/starter/files_metadata.db
type: sqlite
storage_dir: /home/nvidia/.llama/distributions/starter/files
provider_id: meta-reference-files
provider_type: inline::localfs
inference:
- config:
api_key: '********'
url: https://api.fireworks.ai/inference/v1
provider_id: fireworks
provider_type: remote::fireworks
- config:
api_key: '********'
url: https://api.together.xyz/v1
provider_id: together
provider_type: remote::together
- config: {}
provider_id: bedrock
provider_type: remote::bedrock
- config:
api_key: '********'
append_api_version: true
url: http://localhost:8912
provider_id: nvidia
provider_type: remote::nvidia
- config:
api_key: '********'
base_url: https://api.openai.com/v1
provider_id: openai
provider_type: remote::openai
- config:
api_key: '********'
provider_id: anthropic
provider_type: remote::anthropic
- config:
api_key: '********'
provider_id: gemini
provider_type: remote::gemini
- config:
api_key: '********'
url: https://api.groq.com
provider_id: groq
provider_type: remote::groq
- config:
api_key: '********'
url: https://api.sambanova.ai/v1
provider_id: sambanova
provider_type: remote::sambanova
- config: {}
provider_id: sentence-transformers
provider_type: inline::sentence-transformers
post_training:
- config:
checkpoint_format: meta
provider_id: torchtune-cpu
provider_type: inline::torchtune-cpu
safety:
- config:
excluded_categories: []
provider_id: llama-guard
provider_type: inline::llama-guard
- config: {}
provider_id: code-scanner
provider_type: inline::code-scanner
scoring:
- config: {}
provider_id: basic
provider_type: inline::basic
- config: {}
provider_id: llm-as-judge
provider_type: inline::llm-as-judge
- config:
openai_api_key: '********'
provider_id: braintrust
provider_type: inline::braintrust
telemetry:
- config:
service_name: "\u200B"
sinks: sqlite
sqlite_db_path: /home/nvidia/.llama/distributions/starter/trace_store.db
provider_id: meta-reference
provider_type: inline::meta-reference
tool_runtime:
- config:
api_key: '********'
max_results: 3
provider_id: brave-search
provider_type: remote::brave-search
- config:
api_key: '********'
max_results: 3
provider_id: tavily-search
provider_type: remote::tavily-search
- config: {}
provider_id: rag-runtime
provider_type: inline::rag-runtime
- config: {}
provider_id: model-context-protocol
provider_type: remote::model-context-protocol
vector_io:
- config:
kvstore:
db_path: /home/nvidia/.llama/distributions/starter/faiss_store.db
type: sqlite
provider_id: faiss
provider_type: inline::faiss
- config:
db_path: /home/nvidia/.llama/distributions/starter/sqlite_vec.db
kvstore:
db_path:
/home/nvidia/.llama/distributions/starter/sqlite_vec_registry.db
type: sqlite
provider_id: sqlite-vec
provider_type: inline::sqlite-vec
scoring_fns: []
server:
port: 8321
shields: []
tool_groups:
- provider_id: tavily-search
toolgroup_id: builtin::websearch
- provider_id: rag-runtime
toolgroup_id: builtin::rag
vector_dbs: []
version: 2
INFO 2025-10-07 00:29:12,138
llama_stack.providers.remote.inference.nvidia.nvidia:49 inference::nvidia:
Initializing NVIDIAInferenceAdapter(http://localhost:8912)...
INFO 2025-10-07 00:29:12,921
llama_stack.providers.utils.inference.inference_store:74 inference: Write
queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-07 00:29:13,524
llama_stack.providers.utils.responses.responses_store:96 openai_responses:
Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-07 00:29:13,679 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:13,681 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider fireworks: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:13,682 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed
with: Pass Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
WARNING 2025-10-07 00:29:13,684 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider together: Pass
Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
Handling connection for 8912
INFO 2025-10-07 00:29:14,047 llama_stack.providers.utils.inference.openai_mixin:448
providers::utils: NVIDIAInferenceAdapter.list_provider_model_ids() returned 3
models
ERROR 2025-10-07 00:29:14,062 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,063 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider openai: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,099 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed
with: "Could not resolve authentication method. Expected either api_key or
auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers
to be explicitly omitted"
WARNING 2025-10-07 00:29:14,100 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider anthropic: "Could not
resolve authentication method. Expected either api_key or auth_token to be
set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly
omitted"
ERROR 2025-10-07 00:29:14,102 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,103 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider gemini: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,105 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with:
API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-10-07 00:29:14,106 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider groq: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider
config.
ERROR 2025-10-07 00:29:14,107 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,109 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider sambanova: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,454 uvicorn.error:84 uncategorized: Started server process
[3753046]
INFO 2025-10-07 00:29:14,455 uvicorn.error:48 uncategorized: Waiting for
application startup.
INFO 2025-10-07 00:29:14,457 llama_stack.core.server.server:170 core::server:
Starting up
INFO 2025-10-07 00:29:14,458 llama_stack.core.stack:415 core: starting registry
refresh task
ERROR 2025-10-07 00:29:14,459 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,461 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider fireworks: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,462 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed
with: Pass Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
WARNING 2025-10-07 00:29:14,463 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider together: Pass
Together API Key in the header X-LlamaStack-Provider-Data as {
"together_api_key": <your api key>}
ERROR 2025-10-07 00:29:14,465 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,466 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider openai: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,500 uvicorn.error:62 uncategorized: Application startup
complete.
ERROR 2025-10-07 00:29:14,502 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed
with: "Could not resolve authentication method. Expected either api_key or
auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers
to be explicitly omitted"
WARNING 2025-10-07 00:29:14,503 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider anthropic: "Could not
resolve authentication method. Expected either api_key or auth_token to be
set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly
omitted"
ERROR 2025-10-07 00:29:14,504 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or
in the provider config.
WARNING 2025-10-07 00:29:14,506 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider gemini: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the
provider config.
ERROR 2025-10-07 00:29:14,507 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with:
API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-10-07 00:29:14,508 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider groq: API key is not
set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider
config.
ERROR 2025-10-07 00:29:14,510 llama_stack.providers.utils.inference.openai_mixin:439
providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed
with: API key is not set. Please provide a valid API key in the provider data
header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-10-07 00:29:14,511 llama_stack.core.routing_tables.models:36
core::routing_tables: Model refresh failed for provider sambanova: API key is
not set. Please provide a valid API key in the provider data header, e.g.
x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the
provider config.
INFO 2025-10-07 00:29:14,513 uvicorn.error:216 uncategorized: Uvicorn running on
http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
tested with curl model, it also works:
```
curl http://localhost:8321/v1/models
{"data":[{"identifier":"bedrock/meta.llama3-1-8b-instruct-v1:0","provider_resource_id":"meta.llama3-1-8b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-70b-instruct-v1:0","provider_resource_id":"meta.llama3-1-70b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-405b-instruct-v1:0","provider_resource_id":"meta.llama3-1-405b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/bigcode/starcoder2-7b","provider_resource_id":"bigcode/starcoder2-7b","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/meta/llama-3.3-70b-instruct","provider_resource_id":"meta/llama-3.3-70b-instruct","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/nvidia/llama-3.2-nv-embedqa-1b-v2","provider_resource_id":"nvidia/llama-3.2-nv-embedqa-1b-v2","provider_id":"nvidia","type":"model","metadata":{"embedding_dimension":2048,"context_length":8192},"model_type":"embedding"},{"identifier":"sentence-transformers/all-MiniLM-L6-v2","provider_resource_id":"all-MiniLM-L6-v2","provider_id":"sentence-transformers","type":"model","metadata":{"embedding_dimension":384},"model_type":"embedding"}]}%
```
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
Have been running into flaky unit test failures:
5217035494
Fixing below
1. Shutting down properly by cancelling any stale file batches tasks
running in background.
2. Also, use unique_kvstore_config, so the test dont use same db path
and maintain test isolation
## Test Plan
Ran unit test locally and CI
- Allows use of unavailable models on startup
- Add has_model method to ModelsRoutingTable for checking pre-registered
models
- Update check_model_availability to check model_store before provider
APIs
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Start llama stack and point unavailable vLLM
```
VLLM_URL=https://my-unavailable-vllm/v1 MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
llama stack will start without crashing but only notifying error.
```
- provider_id: rag-runtime
toolgroup_id: builtin::rag
vector_dbs: []
version: 2
INFO 2025-10-07 06:40:41,804 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO 2025-10-07 06:40:42,066 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR 2025-10-07 06:40:58,882 llama_stack.providers.utils.inference.openai_mixin:436 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: Request timed out.
WARNING 2025-10-07 06:40:58,883 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider vllm: Request timed out.
[...]
INFO 2025-10-07 06:40:59,036 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO 2025-10-07 06:41:04,064 openai._base_client:1618 uncategorized: Retrying request to /models in 0.398814 seconds
INFO 2025-10-07 06:41:09,497 openai._base_client:1618 uncategorized: Retrying request to /models in 0.781908 seconds
ERROR 2025-10-07 06:41:15,282 llama_stack.providers.utils.inference.openai_mixin:436 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: Request timed out.
WARNING 2025-10-07 06:41:15,283 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider vllm: Request timed out.
```