llama-stack-mirror/tests/integration
Sébastien Han 53eda78993
tests: adapt openai test for watsonx
The
tests/integration/inference/test_openai_completion.py tests fail on a
few scenarios like:

tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n

FAILED tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] - AssertionError: assert 1 == 2
 +  where 1 = len({0: 'thethenamenameofofthetheususcapitalcapitalisiswashingtonwashington,,dd.c.c..'})

test_openai_completion_logprobs
E   openai.BadRequestError: Error code: 400 - {'error': {'detail': {'errors': [{'loc': ['body', 'logprobs'], 'msg': 'Input should be a valid boolean, unable to interpret input', 'type': 'bool_parsing'}]}}}

test_openai_completion_stop_sequence
E   openai.BadRequestError: Error code: 400 - {'detail': 'litellm.BadRequestError: OpenAIException - {"errors":[{"code":"json_type_error","message":"Json field type error: CommonTextChatParameters.stop must be an array, and the element must be of type string","more_info":"https://cloud.ibm.com/apidocs/watsonx-ai#text-chat"}],"trace":"f758b3bbd4f357aa9b16f3dc5ee1170e","status_code":400}'}

So adding the right exception but we still provide some coverage for
openai through litellm.

Now tests pass:

```
INFO     2025-10-14 14:20:17,115 tests.integration.conftest:50 tests: Test stack config type: library_client
         (stack_config=None)
======================================================== test session starts =========================================================
platform darwin -- Python 3.12.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'macOS-26.0.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 32 items

tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity] PASSED [  3%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming_suffix[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:suffix] SKIPPED [  6%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity] PASSED [  9%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=meta-llama/llama-3-3-70b-instruct] SKIPPED [ 12%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 15%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] PASSED [ 18%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] SKIPPED [ 21%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 25%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 28%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[txt=meta-llama/llama-3-3-70b-instruct] SKIPPEDfiles.) [ 31%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_stop_sequence[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:stop_sequence] SKIPPED [ 34%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:log_probs] SKIPPED [ 37%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:log_probs] SKIPPED [ 40%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 43%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools_and_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 46%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tool_choice_none[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling] PASSED [ 50%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_structured_output[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:structured_output] PASSED [ 53%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 56%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] PASSED [ 59%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] SKIPPED [ 62%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 65%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 68%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 71%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] PASSED [ 75%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_01] SKIPPED [ 78%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 81%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True] PASSED [ 84%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 87%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] PASSED [ 90%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02] SKIPPED [ 93%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [ 96%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-False] PASSED [100%]

======================================================== slowest 10 durations ========================================================
5.97s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tool_choice_none[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling]
3.39s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02]
3.26s call     tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=meta-llama/llama-3-3-70b-instruct-True]
2.64s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools_and_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:tool_calling]
1.78s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_structured_output[txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:structured_output]
1.73s call     tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity]
1.58s call     tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-True]
1.51s call     tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=meta-llama/llama-3-3-70b-instruct-inference:completion:sanity]
1.41s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:streaming_02]
1.20s call     tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=meta-llama/llama-3-3-70b-instruct-inference:chat_completion:non_streaming_02]
====================================================== short test summary info =======================================================
SKIPPED [1] tests/integration/inference/test_openai_completion.py:85: Suffix is not supported for the model: meta-llama/llama-3-3-70b-instruct.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:135: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support vllm extra_body parameters.
SKIPPED [4] tests/integration/inference/test_openai_completion.py:115: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support n param.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:141: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support chat completion calls with base64 encoded files.
SKIPPED [1] tests/integration/inference/test_openai_completion.py:514: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support /v1/completions stop sequence.
SKIPPED [2] tests/integration/inference/test_openai_completion.py:72: Model meta-llama/llama-3-3-70b-instruct hosted by remote::watsonx doesn't support /v1/completions logprobs.
============================================ 22 passed, 10 skipped, 2 warnings in 35.11s =============================================
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-10-14 14:32:42 +02:00
..
agents chore!: BREAKING CHANGE removing VectorDB APIs (#3774) 2025-10-11 14:07:08 -07:00
batches feat(api)!: BREAKING CHANGE: support passing extra_body through to providers (#3777) 2025-10-10 16:21:44 -07:00
common/recordings feat(api)!: BREAKING CHANGE: support passing extra_body through to providers (#3777) 2025-10-10 16:21:44 -07:00
conversations feat: Add OpenAI Conversations API (#3429) 2025-10-03 08:47:18 -07:00
datasets fix: test_datasets HF scenario in CI (#2090) 2025-05-06 14:09:15 +02:00
eval fix(tests): ensure test isolation in server mode (#3737) 2025-10-08 12:03:36 -07:00
files fix(tests): ensure test isolation in server mode (#3737) 2025-10-08 12:03:36 -07:00
fixtures feat(tests): make inference_recorder into api_recorder (include tool_invoke) (#3403) 2025-10-09 14:27:51 -07:00
inference tests: adapt openai test for watsonx 2025-10-14 14:32:42 +02:00
inspect chore: default to pytest asyncio-mode=auto (#2730) 2025-07-11 13:00:24 -07:00
post_training chore(pre-commit): add pre-commit hook to enforce llama_stack logger usage (#3061) 2025-08-20 07:15:35 -04:00
providers fix(tests): ensure test isolation in server mode (#3737) 2025-10-08 12:03:36 -07:00
recordings feat(api): Add vector store file batches api (#3642) 2025-10-06 16:58:22 -07:00
responses feat(responses)!: add reasoning and annotation added events (#3793) 2025-10-11 16:47:14 -07:00
safety chore!: Safety api refactoring to use OpenAIMessageParam (#3796) 2025-10-12 08:01:00 -07:00
scoring feat: create HTTP DELETE API endpoints to unregister ScoringFn and Benchmark resources in Llama Stack (#3371) 2025-09-15 12:43:38 -07:00
telemetry fix(tests): ensure test isolation in server mode (#3737) 2025-10-08 12:03:36 -07:00
test_cases chore(apis): unpublish deprecated /v1/inference apis (#3297) 2025-09-27 11:20:06 -07:00
tool_runtime chore!: BREAKING CHANGE removing VectorDB APIs (#3774) 2025-10-11 14:07:08 -07:00
tools fix: toolgroups unregister (#1704) 2025-03-19 13:43:51 -07:00
vector_io chore: Auto-detect Provider ID when only 1 Vector Store Provider avai… (#3802) 2025-10-13 10:25:36 -07:00
__init__.py fix: remove ruff N999 (#1388) 2025-03-07 11:14:04 -08:00
conftest.py feat(tests): make inference_recorder into api_recorder (include tool_invoke) (#3403) 2025-10-09 14:27:51 -07:00
README.md feat(tests): implement test isolation for inference recordings (#3681) 2025-10-04 11:34:18 -07:00
suites.py feat(tests): make inference_recorder into api_recorder (include tool_invoke) (#3403) 2025-10-09 14:27:51 -07:00

Integration Testing Guide

Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.

Quick Start

# Run all integration tests with existing recordings
uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter

Configuration Options

You can see all options with:

cd tests/integration

# this will show a long list of options, look for "Custom options:"
pytest --help

Here are the most important options:

  • --stack-config: specify the stack config to use. You have four ways to point to a stack:
    • server:<config> - automatically start a server with the given config (e.g., server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
    • server:<config>:<port> - same as above but with a custom port (e.g., server:starter:8322)
    • a URL which points to a Llama Stack distribution server
    • a distribution name (e.g., starter) or a path to a run.yaml file
    • a comma-separated list of api=provider pairs, e.g. inference=ollama,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
  • --env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

Model parameters can be influenced by the following options:

  • --text-model: comma-separated list of text models.
  • --vision-model: comma-separated list of vision models.
  • --embedding-model: comma-separated list of embedding models.
  • --safety-shield: comma-separated list of safety shields.
  • --judge-model: comma-separated list of judge models.
  • --embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 384

Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.

Suites and Setups

  • --suite: single named suite that narrows which tests are collected.
  • Available suites:
    • base: collects most tests (excludes responses and post_training)
    • responses: collects tests under tests/integration/responses (needs strong tool-calling models)
    • vision: collects only tests/integration/inference/test_vision_inference.py
  • --setup: global configuration that can be used with any suite. Setups prefill model/env defaults; explicit CLI flags always win.
    • Available setups:
      • ollama: Local Ollama provider with lightweight models (sets OLLAMA_URL, uses llama3.2:3b-instruct-fp16)
      • vllm: VLLM provider for efficient local inference (sets VLLM_URL, uses Llama-3.2-1B-Instruct)
      • gpt: OpenAI GPT models for high-quality responses (uses gpt-4o)
      • claude: Anthropic Claude models for high-quality responses (uses claude-3-5-sonnet)

Examples

# Fast responses run with a strong tool-calling model
pytest -s -v tests/integration --stack-config=server:starter --suite=responses --setup=gpt

# Fast single-file vision run with Ollama defaults
pytest -s -v tests/integration --stack-config=server:starter --suite=vision --setup=ollama

# Base suite with VLLM for performance
pytest -s -v tests/integration --stack-config=server:starter --suite=base --setup=vllm

# Override a default from setup
pytest -s -v tests/integration --stack-config=server:starter \
  --suite=responses --setup=gpt --embedding-model=text-embedding-3-small

Examples

Testing against a Server

Run all text inference tests by auto-starting a server with the starter config:

OLLAMA_URL=http://localhost:11434 \
  pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=server:starter \
   --text-model=ollama/llama3.2:3b-instruct-fp16 \
   --embedding-model=sentence-transformers/all-MiniLM-L6-v2

Run tests with auto-server startup on a custom port:

OLLAMA_URL=http://localhost:11434 \
  pytest -s -v tests/integration/inference/ \
   --stack-config=server:starter:8322 \
   --text-model=ollama/llama3.2:3b-instruct-fp16 \
   --embedding-model=sentence-transformers/all-MiniLM-L6-v2

Testing with Library Client

The library client constructs the Stack "in-process" instead of using a server. This is useful during the iterative development process since you don't need to constantly start and stop servers.

You can do this by simply using --stack-config=starter instead of --stack-config=server:starter.

Using ad-hoc distributions

Sometimes, you may want to make up a distribution on the fly. This is useful for testing a single provider or a single API or a small combination of providers. You can do so by specifying a comma-separated list of api=provider pairs to the --stack-config option, e.g. inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference.

pytest -s -v tests/integration/inference/ \
   --stack-config=inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Another example: Running Vector IO tests for embedding models:

pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=inline::sentence-transformers,vector_io=inline::sqlite-vec \
   --embedding-model=sentence-transformers/all-MiniLM-L6-v2

Recording Modes

The testing system supports four modes controlled by environment variables:

REPLAY Mode (Default)

Uses cached responses instead of making API calls:

pytest tests/integration/

Records only when no recording exists, otherwise replays. This is the preferred mode for iterative development:

pytest tests/integration/inference/test_new_feature.py --inference-mode=record-if-missing

RECORD Mode

Force-records all API interactions, overwriting existing recordings. Use with caution as this will re-record everything:

pytest tests/integration/inference/test_new_feature.py --inference-mode=record

LIVE Mode

Tests make real API calls (not recorded):

pytest tests/integration/ --inference-mode=live

By default, the recording directory is tests/integration/recordings. You can override this by setting the LLAMA_STACK_TEST_RECORDING_DIR environment variable.

Managing Recordings

Viewing Recordings

# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"

# Inspect specific response
cat recordings/responses/abc123.json | jq '.'

Re-recording Tests

Use the automated workflow script for easier re-recording:

./scripts/github/schedule-record-workflow.sh --subdirs "inference,agents"

See the main testing guide for full details.

Local Re-recording

# Re-record specific tests
pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py --inference-mode=record

Note that when re-recording tests, you must use a Stack pointing to a server (i.e., server:starter). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.

Writing Tests

Basic Test Pattern

def test_basic_chat_completion(llama_stack_client, text_model_id):
    response = llama_stack_client.chat.completions.create(
        model=text_model_id,
        messages=[{"role": "user", "content": "Hello"}],
    )

    # Test structure, not AI output quality
    assert response.choices[0].message is not None
    assert isinstance(response.choices[0].message.content, str)
    assert len(response.choices[0].message.content) > 0

Provider-Specific Tests

def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
        pytest.skip(f"Model {embedding_model_id} doesn't support task types")

    query_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["What is machine learning?"],
        task_type="query",
    )

    assert query_response.embeddings is not None