The
tests/integration/inference/test_openai_completion.py tests fail on a
few scenarios like:
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n
FAILED tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] - AssertionError: assert 1 == 2
+ where 1 = len({0: 'thethenamenameofofthetheususcapitalcapitalisiswashingtonwashington,,dd.c.c..'})
test_openai_completion_logprobs
E openai.BadRequestError: Error code: 400 - {'error': {'detail': {'errors': [{'loc': ['body', 'logprobs'], 'msg': 'Input should be a valid boolean, unable to interpret input', 'type': 'bool_parsing'}]}}}
test_openai_completion_stop_sequence
E openai.BadRequestError: Error code: 400 - {'detail': 'litellm.BadRequestError: OpenAIException - {"errors":[{"code":"json_type_error","message":"Json field type error: CommonTextChatParameters.stop must be an array, and the element must be of type string","more_info":"https://cloud.ibm.com/apidocs/watsonx-ai#text-chat"}],"trace":"f758b3bbd4f357aa9b16f3dc5ee1170e","status_code":400}'}
So adding the right exception but we still provide some coverage for
openai through litellm.
Now tests pass:
```
INFO 2025-10-14 14:20:17,115 tests.integration.conftest:50 tests: Test stack config type: library_client
(stack_config=None)
======================================================== test session starts =========================================================
platform darwin -- Python 3.12.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'macOS-26.0.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 32 items
tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity] PASSED [ 3%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming_suffix[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:suffix] SKIPPED [ 6%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity] PASSED [ 9%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=meta-llama/llama-3-3-70b-instruct] SKIPPED [ 12%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 15%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] PASSED [ 18%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] SKIPPED [ 21%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 25%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 28%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[txt=meta-llama/llama-3-3-70b-instruct] SKIPPEDfiles.) [ 31%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_stop_sequence[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:stop_sequence] SKIPPED [ 34%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:log_probs] SKIPPED [ 37%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:log_probs] SKIPPED [ 40%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 43%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools_and_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 46%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tool_choice_none[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 50%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_structured_output[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:structured_output] PASSED [ 53%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 56%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] PASSED [ 59%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] SKIPPED [ 62%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 65%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 68%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 71%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] PASSED [ 75%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] SKIPPED [ 78%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 81%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 84%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 87%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] PASSED [ 90%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] SKIPPED [ 93%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 96%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [100%]
======================================================== slowest 10 durations ========================================================
5.97s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tool_choice_none[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling]
3.39s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02]
3.26s call tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True]
2.64s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools_and_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling]
1.78s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_structured_output[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:structured_output]
1.73s call tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity]
1.58s call tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True]
1.51s call tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity]
1.41s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02]
1.20s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02]
====================================================== short test summary info =======================================================
SKIPPED [1] tests/integration/inference/test_openai_completion.py:85: Suffix is not supported for the model: meta-llama/llama-3-3-70b-instruct.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:135: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support vllm extra_body parameters.
SKIPPED [4] tests/integration/inference/test_openai_completion.py:115: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support n param.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:141: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support chat completion calls with base64 encoded files.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:514: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support /v1/completions stop sequence.
SKIPPED [2] tests/integration/inference/test_openai_completion.py:72: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support /v1/completions logprobs.
============================================ 22 passed, 10 skipped, 2 warnings in 35.11s =============================================
```
Signed-off-by: Sébastien Han <seb@redhat.com>
|
||
|---|---|---|
| .. | ||
| agents | ||
| batches | ||
| common/recordings | ||
| conversations | ||
| datasets | ||
| eval | ||
| files | ||
| fixtures | ||
| inference | ||
| inspect | ||
| post_training | ||
| providers | ||
| recordings | ||
| responses | ||
| safety | ||
| scoring | ||
| telemetry | ||
| test_cases | ||
| tool_runtime | ||
| tools | ||
| vector_io | ||
| __init__.py | ||
| conftest.py | ||
| README.md | ||
| suites.py | ||
Integration Testing Guide
Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
Quick Start
# Run all integration tests with existing recordings
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter
Configuration Options
You can see all options with:
cd tests/integration
# this will show a long list of options, look for "Custom options:"
pytest --help
Here are the most important options:
--stack-config: specify the stack config to use. You have four ways to point to a stack:server:<config>- automatically start a server with the given config (e.g.,server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.server:<config>:<port>- same as above but with a custom port (e.g.,server:starter:8322)- a URL which points to a Llama Stack distribution server
- a distribution name (e.g.,
starter) or a path to arun.yamlfile - a comma-separated list of api=provider pairs, e.g.
inference=ollama,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
--env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.
Model parameters can be influenced by the following options:
--text-model: comma-separated list of text models.--vision-model: comma-separated list of vision models.--embedding-model: comma-separated list of embedding models.--safety-shield: comma-separated list of safety shields.--judge-model: comma-separated list of judge models.--embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 384
Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.
Suites and Setups
--suite: single named suite that narrows which tests are collected.- Available suites:
base: collects most tests (excludes responses and post_training)responses: collects tests undertests/integration/responses(needs strong tool-calling models)vision: collects onlytests/integration/inference/test_vision_inference.py
--setup: global configuration that can be used with any suite. Setups prefill model/env defaults; explicit CLI flags always win.- Available setups:
ollama: Local Ollama provider with lightweight models (sets OLLAMA_URL, uses llama3.2:3b-instruct-fp16)vllm: VLLM provider for efficient local inference (sets VLLM_URL, uses Llama-3.2-1B-Instruct)gpt: OpenAI GPT models for high-quality responses (uses gpt-4o)claude: Anthropic Claude models for high-quality responses (uses claude-3-5-sonnet)
- Available setups:
Examples
# Fast responses run with a strong tool-calling model
pytest -s -v tests/integration --stack-config=server:starter --suite=responses --setup=gpt
# Fast single-file vision run with Ollama defaults
pytest -s -v tests/integration --stack-config=server:starter --suite=vision --setup=ollama
# Base suite with VLLM for performance
pytest -s -v tests/integration --stack-config=server:starter --suite=base --setup=vllm
# Override a default from setup
pytest -s -v tests/integration --stack-config=server:starter \
--suite=responses --setup=gpt --embedding-model=text-embedding-3-small
Examples
Testing against a Server
Run all text inference tests by auto-starting a server with the starter config:
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=server:starter \
--text-model=ollama/llama3.2:3b-instruct-fp16 \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
Run tests with auto-server startup on a custom port:
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/inference/ \
--stack-config=server:starter:8322 \
--text-model=ollama/llama3.2:3b-instruct-fp16 \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
Testing with Library Client
The library client constructs the Stack "in-process" instead of using a server. This is useful during the iterative development process since you don't need to constantly start and stop servers.
You can do this by simply using --stack-config=starter instead of --stack-config=server:starter.
Using ad-hoc distributions
Sometimes, you may want to make up a distribution on the fly. This is useful for testing a single provider or a single API or a small combination of providers. You can do so by specifying a comma-separated list of api=provider pairs to the --stack-config option, e.g. inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference.
pytest -s -v tests/integration/inference/ \
--stack-config=inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference \
--text-model=$TEXT_MODELS \
--vision-model=$VISION_MODELS \
--embedding-model=$EMBEDDING_MODELS
Another example: Running Vector IO tests for embedding models:
pytest -s -v tests/integration/vector_io/ \
--stack-config=inference=inline::sentence-transformers,vector_io=inline::sqlite-vec \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
Recording Modes
The testing system supports four modes controlled by environment variables:
REPLAY Mode (Default)
Uses cached responses instead of making API calls:
pytest tests/integration/
RECORD-IF-MISSING Mode (Recommended for adding new tests)
Records only when no recording exists, otherwise replays. This is the preferred mode for iterative development:
pytest tests/integration/inference/test_new_feature.py --inference-mode=record-if-missing
RECORD Mode
Force-records all API interactions, overwriting existing recordings. Use with caution as this will re-record everything:
pytest tests/integration/inference/test_new_feature.py --inference-mode=record
LIVE Mode
Tests make real API calls (not recorded):
pytest tests/integration/ --inference-mode=live
By default, the recording directory is tests/integration/recordings. You can override this by setting the LLAMA_STACK_TEST_RECORDING_DIR environment variable.
Managing Recordings
Viewing Recordings
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
# Inspect specific response
cat recordings/responses/abc123.json | jq '.'
Re-recording Tests
Remote Re-recording (Recommended)
Use the automated workflow script for easier re-recording:
./scripts/github/schedule-record-workflow.sh --subdirs "inference,agents"
See the main testing guide for full details.
Local Re-recording
# Re-record specific tests
pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py --inference-mode=record
Note that when re-recording tests, you must use a Stack pointing to a server (i.e., server:starter). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.
Writing Tests
Basic Test Pattern
def test_basic_chat_completion(llama_stack_client, text_model_id):
response = llama_stack_client.chat.completions.create(
model=text_model_id,
messages=[{"role": "user", "content": "Hello"}],
)
# Test structure, not AI output quality
assert response.choices[0].message is not None
assert isinstance(response.choices[0].message.content, str)
assert len(response.choices[0].message.content) > 0
Provider-Specific Tests
def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
pytest.skip(f"Model {embedding_model_id} doesn't support task types")
query_response = llama_stack_client.inference.embeddings(
model_id=embedding_model_id,
contents=["What is machine learning?"],
task_type="query",
)
assert query_response.embeddings is not None