# What does this PR do?
Fireworks doesn't allow repsonse_format with tool use. The default
response format is 'text' anyway, so we can safely omit.
## Test Plan
Below script failed without the change, runs after.
```
#!/usr/bin/env python3
"""
Script to test Responses API with kubernetes-mcp-server.
This script:
1. Connects to the llama stack server
2. Uses the Responses API with MCP tools
3. Asks for the list of Kubernetes namespaces using the kubernetes-mcp-server
"""
import json
from openai import OpenAI
# Connect to the llama stack server
base_url = "http://localhost:8321/v1"
client = OpenAI(base_url=base_url, api_key="fake")
# Define the MCP tool pointing to the kubernetes-mcp-server
# The kubernetes-mcp-server is running on port 3000 with SSE endpoint at /sse
mcp_server_url = "http://localhost:3000/sse"
tools = [
{
"type": "mcp",
"server_label": "k8s",
"server_url": mcp_server_url,
}
]
# Create a response request asking for k8s namespaces
print("Sending request to list Kubernetes namespaces...")
print(f"Using MCP server at: {mcp_server_url}")
print("Available tools will be listed automatically by the MCP server.")
print()
response = client.responses.create(
# model="meta-llama/Llama-3.2-3B-Instruct", # Using the vllm model
model="fireworks/accounts/fireworks/models/llama4-scout-instruct-basic",
# model="openai/gpt-4o",
input="what are all the Kubernetes namespaces? Use tool call to `namespaces_list`. make sure to adhere to the tool calling format UNDER ALL CIRCUMSTANCES.",
tools=tools,
stream=False,
)
print("\n" + "=" * 80)
print("RESPONSE OUTPUT:")
print("=" * 80)
# Print the output
for i, output in enumerate(response.output):
print(f"\n[Output {i + 1}] Type: {output.type}")
if output.type == "mcp_list_tools":
print(f" Server: {output.server_label}")
print(f" Tools available: {[t.name for t in output.tools]}")
elif output.type == "mcp_call":
print(f" Tool called: {output.name}")
print(f" Arguments: {output.arguments}")
print(f" Result: {output.output}")
if output.error:
print(f" Error: {output.error}")
elif output.type == "message":
print(f" Role: {output.role}")
print(f" Content: {output.content}")
print("\n" + "=" * 80)
print("FINAL RESPONSE TEXT:")
print("=" * 80)
print(response.output_text)
```
# What does this PR do?
migrate safety api implementation from /inference/chat-completion to
/v1/chat/completions
## Test Plan
ci w/ recordings
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
move the eval=inline::meta-reference implementation to use
openai_completion/openai_chat_completion
note: this breaks backward compatibility if eval setup used sampling
params' repetition_penalty or strategy
## Test Plan
ci w/ new recordings
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Add items and title to ToolParameter/ToolParamDefinition. Adding items
will resolve the issue that occurs with Gemini LLM when an MCP tool has
array-type properties.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Unite test cases will be added.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kai Wu <kaiwu@meta.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
recorded for: ./scripts/integration-tests.sh --stack-config
server:ci-tests --suite base --setup fireworks --subdirs inference
--pattern openai
## Test Plan
./scripts/integration-tests.sh --stack-config server:ci-tests --suite
base --setup fireworks --subdirs inference --pattern openai
# What does this PR do?
unpublish (make unavailable to users) the following apis -
- `/v1/inference/completion`, replaced by `/v1/openai/v1/completions`
- `/v1/inference/chat-completion`, replaced by
`/v1/openai/v1/chat/completions`
- `/v1/inference/embeddings`, replaced by `/v1/openai/v1/embeddings`
- `/v1/inference/batch-completion`, replaced by `/v1/openai/v1/batches`
- `/v1/inference/batch-chat-completion`, replaced by
`/v1/openai/v1/batches`
note: the implementations are still available for internal use, e.g.
agents uses chat-completion.
# What does this PR do?
simplify Ollama inference adapter by -
- moving image_url download code to OpenAIMixin
- being a ModelRegistryHelper instead of having one (mypy blocks
check_model_availability method assignment)
## Test Plan
- add unit tests for new download feature
- add integration tests for openai_chat_completion w/ image_url (close
test gap)
# What does this PR do?
use together's new base64 support
## Test Plan
recordings for: ./scripts/integration-tests.sh --stack-config
server:ci-tests --suite base --setup together --subdirs inference
--pattern openai
# What does this PR do?
- remove auto-download of ollama embedding models
- add embedding model metadata to dynamic listing w/ unit test
- add support and tests for allowed_models
- removed inference provider models.py files where dynamic listing is
enabled
- store embedding metadata in embedding_model_metadata field on
inference providers
- make model_entries optional on ModelRegistryHelper and
LiteLLMOpenAIMixin
- make OpenAIMixin a ModelRegistryHelper
- skip base64 embedding test for remote::ollama, always returns floats
- only use OpenAI client for ollama model listing
- remove unused build_model_entry function
- remove unused get_huggingface_repo function
## Test Plan
ci w/ new tests
# What does this PR do?
use ollama embedding models for ollama test, previously using
sentence-transformer
recordings:
- ./scripts/integration-tests.sh --stack-config server:ci-tests --suite
base --setup ollama --inference-mode record
- ./scripts/integration-tests.sh --stack-config server:ci-tests --suite
vision --setup ollama-vision --inference-mode record
## Test Plan
ci w/ added skip base64 embedding test
# What does this PR do?
add/enable the Databricks inference adapter
Databricks inference adapter was broken, closes#3486
- remove deprecated completion / chat_completion endpoints
- enable dynamic model listing w/o refresh, listing is not async
- use SecretStr instead of str for token
- backward incompatible change: for consistency with databricks docs,
env DATABRICKS_URL -> DATABRICKS_HOST and DATABRICKS_API_TOKEN ->
DATABRICKS_TOKEN
- databricks urls are custom per user/org, add special recorder handling
for databricks urls
- add integration test --setup databricks
- enable chat completions tests
- enable embeddings tests
- disable n > 1 tests
- disable embeddings base64 tests
- disable embeddings dimensions tests
note: reasoning models, e.g. gpt oss, fail because databricks has a
custom, incompatible response format
## Test Plan
ci and
```
./scripts/integration-tests.sh --stack-config server:ci-tests --setup databricks --subdirs inference --pattern openai
```
note: databricks needs to be manually added to the ci-tests distro for
replay testing
# What does this PR do?
adds dynamic model support to TGI
add new overwrite_completion_id feature to OpenAIMixin to deal with TGI
always returning id=""
## Test Plan
tgi: `docker run --gpus all --shm-size 1g -p 8080:80 -v /data:/data
ghcr.io/huggingface/text-generation-inference --model-id
Qwen/Qwen3-0.6B`
stack: `TGI_URL=http://localhost:8080 uv run llama stack build
--image-type venv --distro ci-tests --run`
test: `./scripts/integration-tests.sh --stack-config
http://localhost:8321 --setup tgi --subdirs inference --pattern openai`
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.
This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.
## Test Plan
Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
Recording files use a predictable naming format, making the SQLite index
redundant. The binary SQLite file was causing frequent git conflicts.
Simplify by calculating file paths directly from request hashes.
Signed-off-by: Derek Higgins <derekh@redhat.com>
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205
While fixing this, I encountered a large amount of badness in our CI
workflow definitions.
- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.
## Test Plan
Re-record using:
```
sh scripts/integration-tests.sh --stack-config ci-tests \
--provider ollama --test-pattern test_agent_name --inference-mode record
```
Then re-run with `--inference-mode replay`. But:
Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
# What does this PR do?
Recording tests has become a nightmare. This is the first part of making
that process simpler by making it _less_ automatic. I tried to be too
clever earlier.
It simplifies the record-integration-tests workflow to use workflow
dispatch inputs instead of PR labels. No more opaque stuff. Just go to
the GitHub UI and run the workflow with inputs. I will soon add a helper
script for this also.
Other things to aid re-running just the small set of things you need to
re-record:
- Replaces the `test-types` JSON array parameter with a more intuitive
`test-subdirs` comma-separated list. The whole JSON array crap was for
matrix.
- Adds a new `test-pattern` parameter to allow filtering tests using
pytest's `-k` option
## Test Plan
Note that this PR is in a fork not the source repository.
- Replay tests on this PR are green
- Manually
[ran](1699856292)
the replay workflow with a test-subdir and test-pattern filter, worked
- Manually
[ran](4819508034)
the **record** workflow with a simple pattern, it has worked and updated
_this_ PR.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
Updates test recordings.
## Test Plan
Started ollama serving the 3.2:3b model. Then ran the server:
```
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings/ \
SQLITE_STORE_DIR=$(mktemp -d) \
OLLAMA_URL=http://localhost:11434 \
llama stack build --template starter --image-type venv --run
```
Then ran the tests which needed recording:
```
pytest -sv tests/integration/agents/test_openai_responses.py \
--stack-config=server:starter \
--text-model ollama/llama3.2:3b-instruct-fp16 -k test_responses_store
```
Then, restarted the server with `LLAMA_STACK_TEST_INFERENCE_MODE=replay`, re-ran the tests and verified they passed.
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
A bunch of miscellaneous cleanup focusing on tests, but ended up
speeding up starter distro substantially.
- Pulled llama stack client init for tests into `pytest_sessionstart` so
it does not clobber output
- Profiling of that told me where we were doing lots of heavy imports
for starter, so lazied them
- starter now starts 20seconds+ faster on my Mac
- A few other smallish refactors for `compat_client`
This PR significantly refactors the Integration Tests workflow. The main
goal behind the PR was to enable recording of vision tests which were
never run as part of our CI ever before. During debugging, I ended up
making several other changes refactoring and hopefully increasing the
robustness of the workflow.
After doing the experiments, I have updated the trigger event to be
`pull_request_target` so this workflow can get write permissions by
default but it will run with source code from the base (main) branch in
the source repository only. If you do change the workflow, you'd need to
experiment using the `workflow_dispatch` triggers. This should not be
news to anyone using Github Actions (except me!)
It is likely to be a little rocky though while I learn more about GitHub
Actions, etc. Please be patient :)
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This PR makes setting up Ollama optional for CI. By default, we use
`replay` mode for inference requests and use the stored results from the
`tests/integration/recordings/` directory.
Every so often, users will update tests which will need us to re-record.
To do this, we check for the existence of a label `re-record-tests` on
the PR. If detected,
- ollama is spun up
- inference mode is set to record
- after the tests are done, if any new changes are detected, they are
pushed back to the PR
## Test Plan
This is GitHub CI. Gotta test it live.
Continuing with https://github.com/meta-llama/llama-stack/pull/2952
This also includes a "fix" to inference store related tests so that we
pull a large number of inference responses from the DB so as to always
find the one we just wrote.
Continue to build on top of
https://github.com/meta-llama/llama-stack/pull/2941
## Test Plan
Run server with `LLAMA_STACK_TEST_INFERENCE_MODE=record` and then run
the integration tests with `--stack-config=server:starter`. Then restart
the server with `LLAMA_STACK_TEST_INFERENCE_MODE=replay` and re-run the
tests. Verify that no request hit Ollama at any point.