llama-stack-mirror/tests/integration
Ashwin Bharambe 08b4a1deb3
feat(tests): introduce inference record/replay to increase test reliability (#2941)
Implements a comprehensive recording and replay system for inference API
calls that eliminates dependency on online inference providers during
testing. The system treats inference as deterministic by recording real
API responses and replaying them in subsequent test runs. Applies to
OpenAI clients (which should cover many inference requests) as well as
Ollama AsyncClient.

For storing, we use a hybrid system: Sqlite for fast lookups and JSON
files for easy greppability / debuggability.

As expected, tests become much much faster (more than 3x in just
inference testing.)

```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record LLAMA_STACK_TEST_RECORDING_DIR=<...> \
  uv run pytest -s -v tests/integration/inference \
  --stack-config=starter \
  -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
  --text-model="ollama/llama3.2:3b-instruct-fp16" \
  --embedding-model=sentence-transformers/all-MiniLM-L6-v2
```

```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay LLAMA_STACK_TEST_RECORDING_DIR=<...> \
  uv run pytest -s -v tests/integration/inference \
  --stack-config=starter \
  -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
  --text-model="ollama/llama3.2:3b-instruct-fp16" \
  --embedding-model=sentence-transformers/all-MiniLM-L6-v2
```

- `LLAMA_STACK_TEST_INFERENCE_MODE`: `live` (default), `record`, or
`replay`
- `LLAMA_STACK_TEST_RECORDING_DIR`: Storage location (must be specified
for record or replay modes)
2025-07-29 12:41:31 -07:00
..
agents fix: Safety in starter (#2731) 2025-07-14 15:07:40 -07:00
datasets fix: test_datasets HF scenario in CI (#2090) 2025-05-06 14:09:15 +02:00
eval fix: fix jobs api literal return type (#1757) 2025-03-21 14:04:21 -07:00
files feat: enable auth for LocalFS Files Provider (#2773) 2025-07-18 19:11:01 -07:00
fixtures feat: enable ls client for files tests (#2769) 2025-07-18 12:10:30 -07:00
inference feat(tests): introduce inference record/replay to increase test reliability (#2941) 2025-07-29 12:41:31 -07:00
inspect chore: default to pytest asyncio-mode=auto (#2730) 2025-07-11 13:00:24 -07:00
post_training fix: error on failed job, do not wait for timeout (#2945) 2025-07-29 11:07:51 -07:00
providers chore: default to pytest asyncio-mode=auto (#2730) 2025-07-11 13:00:24 -07:00
safety feat: add llama guard 4 model (#2579) 2025-07-03 22:29:04 -07:00
scoring feat(api): (1/n) datasets api clean up (#1573) 2025-03-17 16:55:45 -07:00
telemetry chore(test): fix flaky telemetry tests (#2815) 2025-07-22 12:30:14 -07:00
test_cases feat: Add suffix to openai_completions (#2449) 2025-06-13 16:06:06 -07:00
tool_runtime fix: allow running vector tests with embedding dimension (#2467) 2025-06-19 13:29:04 +05:30
tools fix: toolgroups unregister (#1704) 2025-03-19 13:43:51 -07:00
vector_io feat: implement chunk deletion for vector stores (#2701) 2025-07-25 10:30:30 -04:00
__init__.py fix: remove ruff N999 (#1388) 2025-03-07 11:14:04 -08:00
conftest.py feat: consolidate most distros into "starter" (#2516) 2025-07-04 15:58:03 +02:00
README.md feat: consolidate most distros into "starter" (#2516) 2025-07-04 15:58:03 +02:00

Llama Stack Integration Tests

We use pytest for parameterizing and running tests. You can see all options with:

cd tests/integration

# this will show a long list of options, look for "Custom options:"
pytest --help

Here are the most important options:

  • --stack-config: specify the stack config to use. You have four ways to point to a stack:
    • server:<config> - automatically start a server with the given config (e.g., server:fireworks). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
    • server:<config>:<port> - same as above but with a custom port (e.g., server:together:8322)
    • a URL which points to a Llama Stack distribution server
    • a template (e.g., starter) or a path to a run.yaml file
    • a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
  • --env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

Model parameters can be influenced by the following options:

  • --text-model: comma-separated list of text models.
  • --vision-model: comma-separated list of vision models.
  • --embedding-model: comma-separated list of embedding models.
  • --safety-shield: comma-separated list of safety shields.
  • --judge-model: comma-separated list of judge models.
  • --embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 384

Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.

Examples

Testing against a Server

Run all text inference tests by auto-starting a server with the fireworks config:

pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=server:fireworks \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run tests with auto-server startup on a custom port:

pytest -s -v tests/integration/inference/ \
   --stack-config=server:together:8322 \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run multiple test suites with auto-server (eliminates manual server management):

# Auto-start server and run all integration tests
export FIREWORKS_API_KEY=<your_key>

pytest -s -v tests/integration/inference/ tests/integration/safety/ tests/integration/agents/ \
   --stack-config=server:fireworks \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Testing with Library Client

Run all text inference tests with the starter distribution using the together provider:

ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=starter \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run all text inference tests with the starter distribution using the together provider and meta-llama/Llama-3.1-8B-Instruct:

ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=starter \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Running all inference tests for a number of models using the together provider:

TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
EMBEDDING_MODELS=all-MiniLM-L6-v2
ENABLE_TOGETHER=together
export TOGETHER_API_KEY=<together_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=together \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Same thing but instead of using the distribution, use an adhoc stack with just one provider (fireworks for inference):

export FIREWORKS_API_KEY=<fireworks_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=inference=fireworks \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Running Vector IO tests for a number of embedding models:

EMBEDDING_MODELS=all-MiniLM-L6-v2

pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
   --embedding-model=$EMBEDDING_MODELS