llama-stack-mirror/tests/integration
Charlie Doern 66f3cf4002
feat: wire Stainless preview SDK into integration tests (#4360)
# What does this PR do?

Enable stainless-builds workflow to test preview SDKs by calling
integration-tests workflow with python_url parameter. Add stainless
matrix config for faster CI runs on SDK changes.

  - Make integration-tests.yml reusable with workflow_call inputs
  - Thread python_url through test setup actions to install preview SDK
- Add matrix_key parameter to generate_ci_matrix.py for custom matrices
- Update stainless-builds.yml to call integration tests with preview URL

This allows us to test a client on the PR introducing the new changes
before merging. Contributors can even write new tests using the
generated client which should pass on the PR, indicating that they will
pass on main upon merge

## Test Plan

see triggered action using the workflows on this branch:
5810594042
which installs the stainless SDK from the given url.

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-12-16 09:20:40 -08:00
..
agents fix: Fix max_tool_calls for openai provider and add integration tests for the max_tool_calls feat (#4190) 2025-11-19 10:27:56 -08:00
batches fix: rename llama_stack_api dir (#4155) 2025-11-13 15:04:36 -08:00
client-typescript feat(tests): add TypeScript client integration test support (#4185) 2025-11-19 10:07:53 -08:00
common/recordings feat: Add support for query rewrite in vector_store.search (#4171) 2025-12-10 10:06:19 -05:00
conversations feat: Add OpenAI Conversations API (#3429) 2025-10-03 08:47:18 -07:00
datasets chore(misc): update datasets, benchmarks to use alpha, beta prefixes (#3891) 2025-10-22 15:26:35 -07:00
eval feat(responses)!: Add web_search_2025_08_26 to the WebSearchToolTypes (#4103) 2025-11-07 10:01:12 -08:00
files refactor(storage): make { kvstore, sqlstore } as llama stack "internal" APIs (#4181) 2025-11-18 13:15:16 -08:00
fixtures fix: call setup_logging early to apply category-specific log levels (#4253) 2025-12-02 13:29:04 -08:00
inference feat: Adding OCI Embeddings (#4300) 2025-12-08 13:05:39 -08:00
inspect chore: Stack server no longer depends on llama-stack-client (#4094) 2025-11-07 09:54:09 -08:00
post_training fix: rename llama_stack_api dir (#4155) 2025-11-13 15:04:36 -08:00
providers fix: access control to fail-closed when owner attributes are missing (#4273) 2025-12-04 08:38:32 -08:00
recordings fix(docs): link to test replay-record docs for discoverability (#4134) 2025-11-12 10:04:56 -08:00
responses feat: Making static prompt values in Rag/File Search configurable in Vector Store Config (#4368) 2025-12-15 11:39:01 -05:00
safety fix: rename llama_stack_api dir (#4155) 2025-11-13 15:04:36 -08:00
scoring feat: create HTTP DELETE API endpoints to unregister ScoringFn and Benchmark resources in Llama Stack (#3371) 2025-09-15 12:43:38 -07:00
telemetry feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127) 2025-12-01 10:33:18 -08:00
test_cases ci: test adjustments for Qwen3-0.6B (#3978) 2025-11-03 12:19:35 -08:00
tool_runtime fix: Remove authorization from provider data (#4161) 2025-11-17 12:16:35 -08:00
tools fix: toolgroups unregister (#1704) 2025-03-19 13:43:51 -07:00
vector_io feat: Add support for query rewrite in vector_store.search (#4171) 2025-12-10 10:06:19 -05:00
__init__.py fix: remove ruff N999 (#1388) 2025-03-07 11:14:04 -08:00
ci_matrix.json feat: wire Stainless preview SDK into integration tests (#4360) 2025-12-16 09:20:40 -08:00
conftest.py feat: remove usage of build yaml (#4192) 2025-12-10 10:12:12 +01:00
README.md feat: remove usage of build yaml (#4192) 2025-12-10 10:12:12 +01:00
suites.py feat!: standardize base_url for inference (#4177) 2025-11-19 08:44:28 -08:00
test_persistence_integration.py feat: remove usage of build yaml (#4192) 2025-12-10 10:12:12 +01:00

Integration Testing Guide

Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.

Quick Start

# Run all integration tests with existing recordings
uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter

Configuration Options

You can see all options with:

cd tests/integration

# this will show a long list of options, look for "Custom options:"
pytest --help

Here are the most important options:

  • --stack-config: specify the stack config to use. You have four ways to point to a stack:
    • server:<config> - automatically start a server with the given config (e.g., server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
    • server:<config>:<port> - same as above but with a custom port (e.g., server:starter:8322)
    • a URL which points to a Llama Stack distribution server
    • a distribution name (e.g., starter) or a path to a config.yaml file
    • a comma-separated list of api=provider pairs, e.g. inference=ollama,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
  • --env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

Model parameters can be influenced by the following options:

  • --text-model: comma-separated list of text models.
  • --vision-model: comma-separated list of vision models.
  • --embedding-model: comma-separated list of embedding models.
  • --safety-shield: comma-separated list of safety shields.
  • --judge-model: comma-separated list of judge models.
  • --embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 768

Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.

Suites and Setups

  • --suite: single named suite that narrows which tests are collected.
  • Available suites:
    • base: collects most tests (excludes responses and post_training)
    • responses: collects tests under tests/integration/responses (needs strong tool-calling models)
    • vision: collects only tests/integration/inference/test_vision_inference.py
  • --setup: global configuration that can be used with any suite. Setups prefill model/env defaults; explicit CLI flags always win.
    • Available setups:
      • ollama: Local Ollama provider with lightweight models (sets OLLAMA_URL, uses llama3.2:3b-instruct-fp16)
      • vllm: VLLM provider for efficient local inference (sets VLLM_URL, uses Llama-3.2-1B-Instruct)
      • gpt: OpenAI GPT models for high-quality responses (uses gpt-4o)
      • claude: Anthropic Claude models for high-quality responses (uses claude-3-5-sonnet)

Examples

# Fast responses run with a strong tool-calling model
pytest -s -v tests/integration --stack-config=server:starter --suite=responses --setup=gpt

# Fast single-file vision run with Ollama defaults
pytest -s -v tests/integration --stack-config=server:starter --suite=vision --setup=ollama

# Base suite with VLLM for performance
pytest -s -v tests/integration --stack-config=server:starter --suite=base --setup=vllm

# Override a default from setup
pytest -s -v tests/integration --stack-config=server:starter \
  --suite=responses --setup=gpt --embedding-model=text-embedding-3-small

Examples

Testing against a Server

Run all text inference tests by auto-starting a server with the starter config:

OLLAMA_URL=http://localhost:11434 \
  pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=server:starter \
   --text-model=ollama/llama3.2:3b-instruct-fp16 \
   --embedding-model=nomic-embed-text-v1.5

Run tests with auto-server startup on a custom port:

OLLAMA_URL=http://localhost:11434 \
  pytest -s -v tests/integration/inference/ \
   --stack-config=server:starter:8322 \
   --text-model=ollama/llama3.2:3b-instruct-fp16 \
   --embedding-model=nomic-embed-text-v1.5

Testing with Library Client

The library client constructs the Stack "in-process" instead of using a server. This is useful during the iterative development process since you don't need to constantly start and stop servers.

You can do this by simply using --stack-config=starter instead of --stack-config=server:starter.

Using ad-hoc distributions

Sometimes, you may want to make up a distribution on the fly. This is useful for testing a single provider or a single API or a small combination of providers. You can do so by specifying a comma-separated list of api=provider pairs to the --stack-config option, e.g. inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference.

pytest -s -v tests/integration/inference/ \
   --stack-config=inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Another example: Running Vector IO tests for embedding models:

pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=inline::sentence-transformers,vector_io=inline::sqlite-vec \
   --embedding-model=nomic-embed-text-v1.5

Recording Modes

The testing system supports four modes controlled by environment variables:

REPLAY Mode (Default)

Uses cached responses instead of making API calls:

pytest tests/integration/

Records only when no recording exists, otherwise replays. This is the preferred mode for iterative development:

pytest tests/integration/inference/test_new_feature.py --inference-mode=record-if-missing

RECORD Mode

Force-records all API interactions, overwriting existing recordings. Use with caution as this will re-record everything:

pytest tests/integration/inference/test_new_feature.py --inference-mode=record

LIVE Mode

Tests make real API calls (not recorded):

pytest tests/integration/ --inference-mode=live

By default, the recording directory is tests/integration/recordings. You can override this by setting the LLAMA_STACK_TEST_RECORDING_DIR environment variable.

Managing Recordings

Viewing Recordings

# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"

# Inspect specific response
cat recordings/responses/abc123.json | jq '.'

Re-recording Tests

Use the automated workflow script for easier re-recording:

./scripts/github/schedule-record-workflow.sh --subdirs "inference,agents"

See the main testing guide for full details.

Local Re-recording

# Re-record specific tests
pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py --inference-mode=record

Note that when re-recording tests, you must use a Stack pointing to a server (i.e., server:starter). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.

Writing Tests

Basic Test Pattern

def test_basic_chat_completion(llama_stack_client, text_model_id):
    response = llama_stack_client.chat.completions.create(
        model=text_model_id,
        messages=[{"role": "user", "content": "Hello"}],
    )

    # Test structure, not AI output quality
    assert response.choices[0].message is not None
    assert isinstance(response.choices[0].message.content, str)
    assert len(response.choices[0].message.content) > 0

Provider-Specific Tests

def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
        pytest.skip(f"Model {embedding_model_id} doesn't support task types")

    query_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["What is machine learning?"],
        task_type="query",
    )

    assert query_response.embeddings is not None

TypeScript Client Replays

TypeScript SDK tests can run alongside Python tests when testing against server:<config> stacks. Set TS_CLIENT_PATH to the path or version of llama-stack-client-typescript to enable:

# Use published npm package (responses suite)
TS_CLIENT_PATH=^0.3.2 scripts/integration-tests.sh --stack-config server:ci-tests --suite responses --setup gpt

# Use local checkout from ~/.cache (recommended for development)
git clone https://github.com/llamastack/llama-stack-client-typescript.git ~/.cache/llama-stack-client-typescript
TS_CLIENT_PATH=~/.cache/llama-stack-client-typescript scripts/integration-tests.sh --stack-config server:ci-tests --suite responses --setup gpt

# Run base suite with TypeScript tests
TS_CLIENT_PATH=~/.cache/llama-stack-client-typescript scripts/integration-tests.sh --stack-config server:ci-tests --suite base --setup ollama

TypeScript tests run immediately after Python tests pass, using the same replay fixtures. The mapping between Python suites/setups and TypeScript test files is defined in tests/integration/client-typescript/suites.json.

If TS_CLIENT_PATH is unset, TypeScript tests are skipped entirely.