mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

History

Ashwin Bharambe 08b4a1deb3 feat(tests): introduce inference record/replay to increase test reliability (#2941 ) Implements a comprehensive recording and replay system for inference API calls that eliminates dependency on online inference providers during testing. The system treats inference as deterministic by recording real API responses and replaying them in subsequent test runs. Applies to OpenAI clients (which should cover many inference requests) as well as Ollama AsyncClient. For storing, we use a hybrid system: Sqlite for fast lookups and JSON files for easy greppability / debuggability. As expected, tests become much much faster (more than 3x in just inference testing.) ```bash LLAMA_STACK_TEST_INFERENCE_MODE=record LLAMA_STACK_TEST_RECORDING_DIR=<...> \ uv run pytest -s -v tests/integration/inference \ --stack-config=starter \ -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \ --text-model="ollama/llama3.2:3b-instruct-fp16" \ --embedding-model=sentence-transformers/all-MiniLM-L6-v2 ``` ```bash LLAMA_STACK_TEST_INFERENCE_MODE=replay LLAMA_STACK_TEST_RECORDING_DIR=<...> \ uv run pytest -s -v tests/integration/inference \ --stack-config=starter \ -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \ --text-model="ollama/llama3.2:3b-instruct-fp16" \ --embedding-model=sentence-transformers/all-MiniLM-L6-v2 ``` - `LLAMA_STACK_TEST_INFERENCE_MODE`: `live` (default), `record`, or `replay` - `LLAMA_STACK_TEST_RECORDING_DIR`: Storage location (must be specified for record or replay modes)		2025-07-29 12:41:31 -07:00
..
agents	fix: Safety in starter (#2731 )	2025-07-14 15:07:40 -07:00
datasets	fix: test_datasets HF scenario in CI (#2090 )	2025-05-06 14:09:15 +02:00
eval	fix: fix jobs api literal return type (#1757 )	2025-03-21 14:04:21 -07:00
files	feat: enable auth for LocalFS Files Provider (#2773 )	2025-07-18 19:11:01 -07:00
fixtures	feat: enable ls client for files tests (#2769 )	2025-07-18 12:10:30 -07:00
inference	feat(tests): introduce inference record/replay to increase test reliability (#2941 )	2025-07-29 12:41:31 -07:00
inspect	chore: default to pytest asyncio-mode=auto (#2730 )	2025-07-11 13:00:24 -07:00
post_training	fix: error on failed job, do not wait for timeout (#2945 )	2025-07-29 11:07:51 -07:00
providers	chore: default to pytest asyncio-mode=auto (#2730 )	2025-07-11 13:00:24 -07:00
safety	feat: add llama guard 4 model (#2579 )	2025-07-03 22:29:04 -07:00
scoring	feat(api): (1/n) datasets api clean up (#1573 )	2025-03-17 16:55:45 -07:00
telemetry	chore(test): fix flaky telemetry tests (#2815 )	2025-07-22 12:30:14 -07:00
test_cases	feat: Add `suffix` to openai_completions (#2449 )	2025-06-13 16:06:06 -07:00
tool_runtime	fix: allow running vector tests with embedding dimension (#2467 )	2025-06-19 13:29:04 +05:30
tools	fix: toolgroups unregister (#1704 )	2025-03-19 13:43:51 -07:00
vector_io	feat: implement chunk deletion for vector stores (#2701 )	2025-07-25 10:30:30 -04:00
__init__.py	fix: remove ruff N999 (#1388 )	2025-03-07 11:14:04 -08:00
conftest.py	feat: consolidate most distros into "starter" (#2516 )	2025-07-04 15:58:03 +02:00
README.md	feat: consolidate most distros into "starter" (#2516 )	2025-07-04 15:58:03 +02:00

README.md

Llama Stack Integration Tests

We use pytest for parameterizing and running tests. You can see all options with:

cd tests/integration

# this will show a long list of options, look for "Custom options:"
pytest --help

Here are the most important options:

--stack-config: specify the stack config to use. You have four ways to point to a stack:
- server:<config> - automatically start a server with the given config (e.g., server:fireworks). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- server:<config>:<port> - same as above but with a custom port (e.g., server:together:8322)
- a URL which points to a Llama Stack distribution server
- a template (e.g., starter) or a path to a run.yaml file
- a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
--env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

Model parameters can be influenced by the following options:

--text-model: comma-separated list of text models.
--vision-model: comma-separated list of vision models.
--embedding-model: comma-separated list of embedding models.
--safety-shield: comma-separated list of safety shields.
--judge-model: comma-separated list of judge models.
--embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 384

Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.

Examples

Testing against a Server

Run all text inference tests by auto-starting a server with the fireworks config:

pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=server:fireworks \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run tests with auto-server startup on a custom port:

pytest -s -v tests/integration/inference/ \
   --stack-config=server:together:8322 \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run multiple test suites with auto-server (eliminates manual server management):

# Auto-start server and run all integration tests
export FIREWORKS_API_KEY=<your_key>

pytest -s -v tests/integration/inference/ tests/integration/safety/ tests/integration/agents/ \
   --stack-config=server:fireworks \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Testing with Library Client

Run all text inference tests with the starter distribution using the together provider:

ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=starter \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run all text inference tests with the starter distribution using the together provider and meta-llama/Llama-3.1-8B-Instruct:

ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=starter \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Running all inference tests for a number of models using the together provider:

TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
EMBEDDING_MODELS=all-MiniLM-L6-v2
ENABLE_TOGETHER=together
export TOGETHER_API_KEY=<together_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=together \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Same thing but instead of using the distribution, use an adhoc stack with just one provider (fireworks for inference):

export FIREWORKS_API_KEY=<fireworks_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=inference=fireworks \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Running Vector IO tests for a number of embedding models:

EMBEDDING_MODELS=all-MiniLM-L6-v2

pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
   --embedding-model=$EMBEDDING_MODELS