o Introduces vLLM provider support to the record/replay testing
framework
o Enabling both recording and replay of vLLM API interactions alongside
existing Ollama support.
The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API
surface
including vision features.
--
This is an alternative to #3128 , using qwen3 instead of llama 3.2 1B
appears to be more capable at structure output and tool calls.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
# What does this PR do?
1. Make telemetry tests as easy as possible for users by expanding the
`SpanStub` data class and creating the `MetricStub` dataclass as a way
to consistently marshal telemetry data in test fixtures and unmarshal
and handle it in tests.
2. Structure server and client tests to always follow the same standards
for consistent testing experience by using the `SpanStub` and
`MetricStub` data class objects.
3. Enable Metrics Testing for completions endpoint
4. Correct token metrics to use histograms instead of counts to capture
tokens per request rather than a cumulative count of tokens over the
lifecycle of the server.
## Test Plan
These are tests
Added a script to cleanup recordings. While doing this, moved the CI
matrix generation to a separate script so there is a single source of
truth for the matrix.
Ran the cleanup script as:
```
PYTHONPATH=. python scripts/cleanup_recordings.py
```
Also added this as part of the pre-commit workflow to ensure that the
recordings are always up to date and that no stale recordings are left
in the repo.
# What does this PR do?
This prevents interference from already running servers, and allows
multiple concurrent integration test runs. Unleash the AIs!
## Test Plan
start a LS server at port 8321
Then observe test uses port 8322:
❯ uv run --no-sync ./scripts/integration-tests.sh --stack-config
server:ci-tests --inference-mode replay --setup ollama --suite base
--pattern '(telemetry or safety)'
=== Llama Stack Integration Test Runner ===
Stack Config: server:ci-tests
Setup: ollama
Inference Mode: replay
Test Suite: base
Test Subdirs:
Test Pattern: (telemetry or safety)
Checking llama packages
llama-stack 0.4.0.dev0 /Users/erichuang/projects/new_test_server
llama-stack-client 0.3.0
ollama 0.6.0
=== Applying Setup Environment Variables ===
Setting SQLITE_STORE_DIR:
/var/folders/cz/vyh7y1d11xg881lsxsshnc5c0000gn/T/tmp.bKLsaVAxyU
Setting stack config type: server
Setting up environment variables:
export OLLAMA_URL='http://0.0.0.0:11434'
export SAFETY_MODEL='ollama/llama-guard3:1b'
Will use port: 8322
=== Starting Llama Stack Server ===
Waiting for Llama Stack Server to start on port 8322...
✅ Llama Stack Server started successfully
Backports UV index configuration fixes from `release-0.3.x` (PR #4002).
The main issue: when we created the release branch infrastructure, we
configured UV to use `test.pypi` as the PRIMARY index to resolve RC
dependencies. This caused UV to look for ALL packages there first, which
led to problems - some packages don't have binary wheels on `test.pypi`,
so UV tried building from source and failed (like the `psycopg2-binary`
issue we hit).
The fix is simple: use PyPI as primary (default) and `test.pypi` as an
EXTRA index. UV will check PyPI first for everything, and only fall back
to `test.pypi` for packages not found there (like our RC client
versions).
This PR includes:
- Fixed `install-llama-stack-client` action to output
`UV_EXTRA_INDEX_URL` instead of `UV_INDEX_URL`
- New `uv-run-with-index.sh` wrapper that auto-detects release branches
and sets UV env vars
- Updated pre-commit hooks (`uv-lock`, codegen, etc.) to use the wrapper
- Pass UV env vars as Docker build args in all locations
- Scope UV env vars properly in Containerfile (inline for llama-stack
install, explicitly unset before distribution deps)
- Export UV env vars to `GITHUB_ENV` in setup-runner for cross-step
persistence
The wrapper detects release branches automatically in both CI and local
environments, so this "just works" without manual configuration. On main
(non-release branch), the wrapper becomes a no-op.
Tested and validated on `release-0.3.x` where all CI checks pass.
# What does this PR do?
When running ./scripts/integration-tests.sh --network host on mac fails
regularly due to how Docker runs on MacOS.
if on mac, keep network bridge mode.
before:
=== Starting Docker Container ===
Using image: localhost/distribution-ci-tests:dev
WARNING: Published ports are discarded when using host network mode
Waiting for Docker container to start...
❌ Docker container failed to start
Container logs:
INFO 2025-10-29 18:38:32,180 llama_stack.cli.stack.run:100 cli: Using
run configuration:
/workspace/src/llama_stack/distributions/ci-tests/run.yaml
... (stack starts but is not reachable on network)
after:
=== Starting Docker Container ===
Using image: localhost/distribution-ci-tests:dev
Using bridge networking with port mapping (non-Linux) Waiting for Docker
container to start...
✅ Docker container started successfully
=== Running Integration Tests ===
## Test Plan
integration tests pass!
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Migrates package structure to src/ layout following Python packaging
best practices.
All code moved from `llama_stack/` to `src/llama_stack/`. Public API
unchanged - imports remain `import llama_stack.*`.
Updated build configs, pre-commit hooks, scripts, and GitHub workflows
accordingly. All hooks pass, package builds cleanly.
**Developer note**: Reinstall after pulling: `pip install -e .`
## Summary
- Link pre-commit bot comment to workflow run instead of PR for better
debugging
- Dump docker container logs before removal to ensure logs are actually
captured
## Changes
1. **Pre-commit bot**: Changed the initial bot comment to link
"pre-commit hooks" text to the actual workflow run URL instead of just
having the PR number auto-link
2. **Docker logs**: Moved docker container log dumping from GitHub
Actions to the integration-tests.sh script's stop_container() function,
ensuring logs are captured before container removal
## Test plan
- Pre-commit bot comment will now have a clickable link to the workflow
run
- Docker container logs will be successfully captured in CI runs
**This PR changes configurations in a backward incompatible way.**
Run configs today repeat full SQLite/Postgres snippets everywhere a
store is needed, which means duplicated credentials, extra connection
pools, and lots of drift between files. This PR introduces named storage
backends so the stack and providers can share a single catalog and
reference those backends by name.
## Key Changes
- Add `storage.backends` to `StackRunConfig`, register each KV/SQL
backend once at startup, and validate that references point to the right
family.
- Move server stores under `storage.stores` with lightweight references
(backend + namespace/table) instead of full configs.
- Update every provider/config/doc to use the new reference style;
docs/codegen now surface the simplified YAML.
## Migration
Before:
```yaml
metadata_store:
type: sqlite
db_path: ~/.llama/distributions/foo/registry.db
inference_store:
type: postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
conversations_store:
type: postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
```
After:
```yaml
storage:
backends:
kv_default:
type: kv_sqlite
db_path: ~/.llama/distributions/foo/kvstore.db
sql_default:
type: sql_postgres
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DB}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
stores:
metadata:
backend: kv_default
namespace: registry
inference:
backend: sql_default
table_name: inference_store
max_write_queue_size: 10000
num_writers: 4
conversations:
backend: sql_default
table_name: openai_conversations
```
Provider configs follow the same pattern—for example, a Chroma vector
adapter switches from:
```yaml
providers:
vector_io:
- provider_id: chromadb
provider_type: remote::chromadb
config:
url: ${env.CHROMADB_URL}
kvstore:
type: sqlite
db_path: ~/.llama/distributions/foo/chroma.db
```
to:
```yaml
providers:
vector_io:
- provider_id: chromadb
provider_type: remote::chromadb
config:
url: ${env.CHROMADB_URL}
persistence:
backend: kv_default
namespace: vector_io::chroma_remote
```
Once the backends are declared, everything else just points at them, so
rotating credentials or swapping to Postgres happens in one place and
the stack reuses a single connection pool.
# What does this PR do?
## Test Plan
.venv ❯ sh ./scripts/install.sh
⚠️ Found existing container(s) for 'ollama-server', removing...
⚠️ Found existing container(s) for 'llama-stack', removing...
⚠️ Found existing container(s) for 'jaeger', removing...
⚠️ Found existing container(s) for 'otel-collector', removing...
⚠️ Found existing container(s) for 'prometheus', removing...
⚠️ Found existing container(s) for 'grafana', removing...
📡 Starting telemetry stack...
🦙 Starting Ollama...
⏳ Waiting for Ollama daemon...
📦 Ensuring model is pulled: llama3.2:3b...
🦙 Starting Llama Stack...
⏳ Waiting for Llama Stack API...
..
🎉 Llama Stack is ready!
👉 API endpoint: http://localhost:8321📖 Documentation:
https://llamastack.github.io/latest/references/api_reference/index.html💻 To access the llama stack CLI, exec into the container:
docker exec -ti llama-stack bash
📡 Telemetry dashboards:
Jaeger UI: http://localhost:16686
Prometheus UI: http://localhost:9090
Grafana UI: http://localhost:3000 (admin/admin)
OTEL Collector: http://localhost:4318🐛 Report an issue @ https://github.com/llamastack/llama-stack/issues if
you think it's a bug
**!!BREAKING CHANGE!!**
The lookup is also straightforward -- we always look for this identifier
and don't try to find a match for something without the provider_id
prefix.
Note that, this ideally means we need to update the `register_model()`
API also (we should kill "identifier" from there) but I am not doing
that as part of this PR.
## Test Plan
Existing unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to replace the Llama Stack's default embedding
model by nomic-embed-text-v1.5.
These are the key reasons why Llama Stack community decided to switch
from all-MiniLM-L6-v2 to nomic-embed-text-v1.5:
1. The training data for
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data)
includes a lot of data sets with various licensing terms, so it is
tricky to know when/whether it is appropriate to use this model for
commercial applications.
2. The model is not particularly competitive on major benchmarks. For
example, if you look at the [MTEB
Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click
on Miscellaneous/BEIR to see English information retrieval accuracy, you
see that the top of the leaderboard is dominated by enormous models but
also that there are many, many models of relatively modest size whith
much higher Retrieval scores. If you want to look closely at the data, I
recommend clicking "Download Table" because it is easier to browse that
way.
More discussion info can be founded
[here](https://github.com/llamastack/llama-stack/issues/2418)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2418
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. Run `./scripts/unit-tests.sh`
2. Integration tests via CI wokrflow
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
use SecretStr for OpenAIMixin providers
- RemoteInferenceProviderConfig now has auth_credential: SecretStr
- the default alias is api_key (most common name)
- some providers override to use api_token (RunPod, vLLM, Databricks)
- some providers exclude it (Ollama, TGI, Vertex AI)
addresses #3517
## Test Plan
ci w/ new tests
Updated scripts/normalize_recordings.py to dynamically find and process
all 'recordings' directories under tests/ using pathlib.rglob() instead
of hardcoding a single path.
Signed-off-by: Derek Higgins <derekh@redhat.com>
Renames `inference_recorder.py` to `api_recorder.py` and extends it to
support recording/replaying tool invocations in addition to inference
calls.
This allows us to record web-search, etc. tool calls and thereafter
apply recordings for `tests/integration/responses`
## Test Plan
```
export OPENAI_API_KEY=...
export TAVILY_SEARCH_API_KEY=...
./scripts/integration-tests.sh --stack-config ci-tests \
--suite responses --inference-mode record-if-missing
```
Adds --collect-only flag to scripts/integration-tests.sh that skips
server startup and passes the flag to pytest for test collection only.
When specified, minimal flags are required (no --stack-config or --setup
needed).
## Changes
- Added `--collect-only` flag that skips server startup
- Made `--stack-config` and `--setup` optional when using
`--collect-only`
- Skip `llama` command check when collecting tests only
## Usage
```bash
# Collect tests without starting server
./scripts/integration-tests.sh --subdirs inference --collect-only
```
Propagate test IDs from client to server via HTTP headers to maintain
proper test isolation when running with server-based stack configs.
Without
this, recorded/replayed inference requests in server mode would leak
across
tests.
Changes:
- Patch client _prepare_request to inject test ID into provider data
header
- Sync test context from provider data on server side before storage
operations
- Set LLAMA_STACK_TEST_STACK_CONFIG_TYPE env var based on stack config
- Configure console width for cleaner log output in CI
- Add SQLITE_STORE_DIR temp directory for test data isolation
# What does this PR do?
user can simply set env vars in the beginning of the command.`FOO=BAR
llama stack run ...`
## Test Plan
Run
TELEMETRY_SINKS=coneol uv run --with llama-stack llama stack build
--distro=starter --image-type=venv --run
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3711).
* #3714
* __->__ #3711
# What does this PR do?
Added missing configuration files
## Test Plan
run ./scripts/telemetry/setup_telemetry.sh
```
OTEL_SERVICE_NAME=llama_stack OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 TELEMETRY_SINKS=otel_trace,otel_metric uv run --with llama-stack llama stack build --distro=starter --image-type=venv --run
```
Navigate to grafana localhost:3000, query metrics and traces
IDs are now deterministic hashes based on request content, and
timestamps are normalized to constants, eliminating spurious changes
when re-recording tests.
## Changes
- Updated `inference_recorder.py` to normalize IDs and timestamps during
recording
- Added `scripts/normalize_recordings.py` utility to re-normalize
existing recordings
- Created documentation in `tests/integration/recordings/README.md`
- Normalized 350 existing recording files
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Fixes broken links and Docusaurus search
Closes#3518
## Test Plan
The following should produce a clean build with no warnings and search enabled:
```
npm install
npm run gen-api-docs all
npm run build
npm run serve
```
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
- Fixes Docusaurus build errors
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
- `npm run build` compiles the build properly
- Broken links expected and will be fixed in a follow-on PR
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
- Updates provider and distro codegen to handle the new format
- Migrates provider and distro files to the new format
## Test Plan
- Manual testing
<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
Add default value for PR_HEAD_REPO to prevent 'unbound variable' error
when no PR exists for a branch.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Fix pre-commit issues: non executable shebang file, @pytest.mark.asyncio
decorator
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
This PR refactors the integration test system to use global "setups"
which provides better separation of concerns:
**suites = what to test, setups = how to configure.**
NOTE: if you naming suggestions, please provide feedback
Changes:
- New `tests/integration/setups.py` with global, reusable configurations
(ollama, vllm, gpt, claude)
- Modified `scripts/integration-tests.sh` options to match with the
underlying pytest options
- Updated documentation to reflect the new global setup system
The main benefit is that setups can be reused across multiple suites
(e.g., use "gpt" with any suite) even though sometimes they could
specifically tailored for a suite (vision <> ollama-vision). It is now
easier to add new configurations without modifying existing suites.
Usage examples:
- `pytest tests/integration --suite=responses --setup=gpt`
- `pytest tests/integration --suite=vision` # auto-selects
"ollama-vision" setup
- `pytest tests/integration --suite=base --setup=vllm`
Our integration tests need to be 'grouped' because each group often
needs a specific set of models it works with. We separated vision tests
due to this, and we have a separate set of tests which test "Responses"
API.
This PR makes this system a bit more official so it is very easy to
target these groups and apply all testing infrastructure towards all the
groups (for example, record-replay) uniformly.
There are three suites declared:
- base
- vision
- responses
Note that our CI currently runs the "base" and "vision" suites.
You can use the `--suite` option when running pytest (or any of the
testing scripts or workflows.) For example:
```
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/ --stack-config starter --suite vision
```
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.
This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.
## Test Plan
Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
# What does this PR do?
1. Adds `scripts/run-ui-linter.sh`
- Light script that checks whether `node_modules`,`eslint`, and
`prettier` exist before running linter
- When I introduced [the linter for the
UI](https://github.com/llamastack/llama-stack/pull/3156/files#diff-63a9c44a44acf85fea213a857769990937107cf072831e1a26808cfde9d096b9)
it forced the UI linter on all users, the small `node_modules` check
means that only users that have installed the UI locally (since
`node_modules` is in the gitignore) will actually end up having this
run. Additionally this does not do any install and just runs the
existing linter/prettier as requested by @mattf
2. Updates `.github/workflows/pre-commit.yml` to run CI again
- When I introduced the UI linter in the CI [in this
PR](https://github.com/llamastack/llama-stack/pull/3191) a failure
occurred because dependabot needed to be updated to also bump the
`package-lock.json` which was done [in this
PR](https://github.com/llamastack/llama-stack/pull/3212). All of this to
say, we shouldn't observe failures from dependabot again.
3. Updates `.pre-commit-config.yaml`
- Calls `scripts/run-ui-linter.sh`
## AI Assistance Notice
I used Copilot minimally.
## Test Plan
As
[requested](https://github.com/llamastack/llama-stack/pull/3207#discussion_r2288004872)
by @mattf I ran this after removing all of my `node_modules` and the
linter passed.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205
While fixing this, I encountered a large amount of badness in our CI
workflow definitions.
- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.
## Test Plan
Re-record using:
```
sh scripts/integration-tests.sh --stack-config ci-tests \
--provider ollama --test-pattern test_agent_name --inference-mode record
```
Then re-run with `--inference-mode replay`. But:
Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
This OpenAI client release
0843a11164
ends up breaking litellm
169a17400f/litellm/types/llms/openai.py (L40)
Update the dependency pin. Also make the imports a bit more defensive
anyhow if something else during `llama stack build` ends up moving
openai to a previous version.
## Test Plan
Run pre-release script integration tests.
See comment here:
https://github.com/llamastack/llama-stack/pull/3162#issuecomment-3192859097
-- TL;DR it is quite complex to invoke the recording workflow correctly
for an end developer writing tests. This script simplifies the work.
No more manual GitHub UI navigation!
## Script Functionality
- Auto-detects your current branch and associated PR
- Finds the right repository context (works from forks!)
- Runs the workflow where it can actually commit back
- Validates prerequisites and provides helpful error messages
## How to Use
First ensure you are on the branch which introduced a new test and want
it recorded. **Make sure you have pushed this branch remotely, easiest
is to create a PR.**
```
# Record tests for current branch
./scripts/github/schedule-record-workflow.sh
# Record specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --run-vision-tests
# Record tests matching a pattern
./scripts/github/schedule-record-workflow.sh --test-pattern "test_streaming"
```
## Test Plan
Ran `./scripts/github/schedule-record-workflow.sh -s inference -k
tool_choice` which started
4820409329
which successfully committed recorded outputs.
# What does this PR do?
Recording tests has become a nightmare. This is the first part of making
that process simpler by making it _less_ automatic. I tried to be too
clever earlier.
It simplifies the record-integration-tests workflow to use workflow
dispatch inputs instead of PR labels. No more opaque stuff. Just go to
the GitHub UI and run the workflow with inputs. I will soon add a helper
script for this also.
Other things to aid re-running just the small set of things you need to
re-record:
- Replaces the `test-types` JSON array parameter with a more intuitive
`test-subdirs` comma-separated list. The whole JSON array crap was for
matrix.
- Adds a new `test-pattern` parameter to allow filtering tests using
pytest's `-k` option
## Test Plan
Note that this PR is in a fork not the source repository.
- Replay tests on this PR are green
- Manually
[ran](1699856292)
the replay workflow with a test-subdir and test-pattern filter, worked
- Manually
[ran](4819508034)
the **record** workflow with a simple pattern, it has worked and updated
_this_ PR.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>