feat(tests): migrate to global "setups" system for test configuration (#3390)

This PR refactors the integration test system to use global "setups" which provides better separation of concerns: **suites = what to test, setups = how to configure.** NOTE: if you naming suggestions, please provide feedback Changes: - New `tests/integration/setups.py` with global, reusable configurations (ollama, vllm, gpt, claude) - Modified `scripts/integration-tests.sh` options to match with the underlying pytest options - Updated documentation to reflect the new global setup system The main benefit is that setups can be reused across multiple suites (e.g., use "gpt" with any suite) even though sometimes they could specifically tailored for a suite (vision <> ollama-vision). It is now easier to add new configurations without modifying existing suites. Usage examples: - `pytest tests/integration --suite=responses --setup=gpt` - `pytest tests/integration --suite=vision` # auto-selects "ollama-vision" setup - `pytest tests/integration --suite=base --setup=vllm`
2025-12-03 09:53:45 +00:00 · 2025-09-09 15:50:56 -07:00 · 2025-09-09 15:50:56 -07:00 · a8aa815b6a
commit a8aa815b6a
parent 28696c3f30
11 changed files with 385 additions and 206 deletions
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@ -6,9 +6,7 @@ Integration tests verify complete workflows across different providers using Lla

 ```bash
 # Run all integration tests with existing recordings
-LLAMA_STACK_TEST_INFERENCE_MODE=replay \
-  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
-  uv run --group test \
+uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter
 ```

@ -42,25 +40,35 @@ Model parameters can be influenced by the following options:
 Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped
 if no model is specified.

-### Suites (fast selection + sane defaults)
+### Suites and Setups

- `--suite`: comma-separated list of named suites that both narrow which tests are collected and prefill common model options (unless you pass them explicitly).
+- `--suite`: single named suite that narrows which tests are collected.
 - Available suites:
-  - `responses`: collects tests under `tests/integration/responses`; this is a separate suite because it needs a strong tool-calling model.
-  - `vision`: collects only `tests/integration/inference/test_vision_inference.py`; defaults `--vision-model=ollama/llama3.2-vision:11b`, `--embedding-model=sentence-transformers/all-MiniLM-L6-v2`.
- Explicit flags always win. For example, `--suite=responses --text-model=<X>` overrides the suite’s text model.
+  - `base`: collects most tests (excludes responses and post_training)
+  - `responses`: collects tests under `tests/integration/responses` (needs strong tool-calling models)
+  - `vision`: collects only `tests/integration/inference/test_vision_inference.py`
+- `--setup`: global configuration that can be used with any suite. Setups prefill model/env defaults; explicit CLI flags always win.
+  - Available setups:
+    - `ollama`: Local Ollama provider with lightweight models (sets OLLAMA_URL, uses llama3.2:3b-instruct-fp16)
+    - `vllm`: VLLM provider for efficient local inference (sets VLLM_URL, uses Llama-3.2-1B-Instruct)
+    - `gpt`: OpenAI GPT models for high-quality responses (uses gpt-4o)
+    - `claude`: Anthropic Claude models for high-quality responses (uses claude-3-5-sonnet)

-Examples:
+Examples

 ```bash
-# Fast responses run with defaults
-pytest -s -v tests/integration --stack-config=server:starter --suite=responses
+# Fast responses run with a strong tool-calling model
+pytest -s -v tests/integration --stack-config=server:starter --suite=responses --setup=gpt

-# Fast single-file vision run with defaults
-pytest -s -v tests/integration --stack-config=server:starter --suite=vision
+# Fast single-file vision run with Ollama defaults
+pytest -s -v tests/integration --stack-config=server:starter --suite=vision --setup=ollama

-# Combine suites and override a default
-pytest -s -v tests/integration --stack-config=server:starter --suite=responses,vision --embedding-model=text-embedding-3-small
+# Base suite with VLLM for performance
+pytest -s -v tests/integration --stack-config=server:starter --suite=base --setup=vllm
+
+# Override a default from setup
+pytest -s -v tests/integration --stack-config=server:starter \
+  --suite=responses --setup=gpt --embedding-model=text-embedding-3-small
 ```

 ## Examples
@ -127,14 +135,13 @@ pytest tests/integration/
 ### RECORD Mode
 Captures API interactions for later replay:
 ```bash
-LLAMA_STACK_TEST_INFERENCE_MODE=record \
-pytest tests/integration/inference/test_new_feature.py
+pytest tests/integration/inference/test_new_feature.py --inference-mode=record
 ```

 ### LIVE Mode
 Tests make real API calls (but not recorded):
 ```bash
-LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
+pytest tests/integration/ --inference-mode=live
 ```

 By default, the recording directory is `tests/integration/recordings`. You can override this by setting the `LLAMA_STACK_TEST_RECORDING_DIR` environment variable.
@ -155,15 +162,14 @@ cat recordings/responses/abc123.json | jq '.'
 #### Remote Re-recording (Recommended)
 Use the automated workflow script for easier re-recording:
 ```bash
-./scripts/github/schedule-record-workflow.sh --test-subdirs "inference,agents"
+./scripts/github/schedule-record-workflow.sh --subdirs "inference,agents"
 ```
 See the [main testing guide](../README.md#remote-re-recording-recommended) for full details.

 #### Local Re-recording
 ```bash
 # Re-record specific tests
-LLAMA_STACK_TEST_INFERENCE_MODE=record \
-pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py
+pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py --inference-mode=record
 ```

 Note that when re-recording tests, you must use a Stack pointing to a server (i.e., `server:starter`). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.
--- a/tests/integration/conftest.py
+++ b/tests/integration/conftest.py
@ -15,7 +15,7 @@ from dotenv import load_dotenv

 from llama_stack.log import get_logger

-from .suites import SUITE_DEFINITIONS
+from .suites import SETUP_DEFINITIONS, SUITE_DEFINITIONS

 logger = get_logger(__name__, category="tests")

@ -63,19 +63,33 @@ def pytest_configure(config):
        key, value = env_var.split("=", 1)
        os.environ[key] = value

-    suites_raw = config.getoption("--suite")
-    suites: list[str] = []
-    if suites_raw:
-        suites = [p.strip() for p in str(suites_raw).split(",") if p.strip()]
-        unknown = [p for p in suites if p not in SUITE_DEFINITIONS]
-        if unknown:
+    inference_mode = config.getoption("--inference-mode")
+    os.environ["LLAMA_STACK_TEST_INFERENCE_MODE"] = inference_mode
+
+    suite = config.getoption("--suite")
+    if suite:
+        if suite not in SUITE_DEFINITIONS:
+            raise pytest.UsageError(f"Unknown suite: {suite}. Available: {', '.join(sorted(SUITE_DEFINITIONS.keys()))}")
+
+    # Apply setups (global parameterizations): env + defaults
+    setup = config.getoption("--setup")
+    if suite and not setup:
+        setup = SUITE_DEFINITIONS[suite].default_setup
+
+    if setup:
+        if setup not in SETUP_DEFINITIONS:
            raise pytest.UsageError(
-                f"Unknown suite(s): {', '.join(unknown)}. Available: {', '.join(sorted(SUITE_DEFINITIONS.keys()))}"
+                f"Unknown setup '{setup}'. Available: {', '.join(sorted(SETUP_DEFINITIONS.keys()))}"
            )
-    for suite in suites:
-        suite_def = SUITE_DEFINITIONS.get(suite, {})
-        defaults: dict = suite_def.get("defaults", {})
-        for dest, value in defaults.items():
+
+        setup_obj = SETUP_DEFINITIONS[setup]
+        logger.info(f"Applying setup '{setup}'{' for suite ' + suite if suite else ''}")
+        # Apply env first
+        for k, v in setup_obj.env.items():
+            if k not in os.environ:
+                os.environ[k] = str(v)
+        # Apply defaults if not provided explicitly
+        for dest, value in setup_obj.defaults.items():
            current = getattr(config.option, dest, None)
            if not current:
                setattr(config.option, dest, value)
@ -120,6 +134,13 @@ def pytest_addoption(parser):
        default=384,
        help="Output dimensionality of the embedding model to use for testing. Default: 384",
    )
+
+    parser.addoption(
+        "--inference-mode",
+        help="Inference mode: { record, replay, live } (default: replay)",
+        choices=["record", "replay", "live"],
+        default="replay",
+    )
    parser.addoption(
        "--report",
        help="Path where the test report should be written, e.g. --report=/path/to/report.md",
@ -127,14 +148,18 @@ def pytest_addoption(parser):

    available_suites = ", ".join(sorted(SUITE_DEFINITIONS.keys()))
    suite_help = (
-        "Comma-separated integration test suites to narrow collection and prefill defaults. "
-        "Available: "
-        f"{available_suites}. "
-        "Explicit CLI flags (e.g., --text-model) override suite defaults. "
-        "Examples: --suite=responses or --suite=responses,vision."
+        f"Single test suite to run (narrows collection). Available: {available_suites}. Example: --suite=responses"
    )
    parser.addoption("--suite", help=suite_help)

+    # Global setups for any suite
+    available_setups = ", ".join(sorted(SETUP_DEFINITIONS.keys()))
+    setup_help = (
+        f"Global test setup configuration. Available: {available_setups}. "
+        "Can be used with any suite. Example: --setup=ollama"
+    )
+    parser.addoption("--setup", help=setup_help)
+

 MODEL_SHORT_IDS = {
    "meta-llama/Llama-3.2-3B-Instruct": "3B",
@ -221,16 +246,12 @@ pytest_plugins = ["tests.integration.fixtures.common"]

 def pytest_ignore_collect(path: str, config: pytest.Config) -> bool:
    """Skip collecting paths outside the selected suite roots for speed."""
-    suites_raw = config.getoption("--suite")
-    if not suites_raw:
+    suite = config.getoption("--suite")
+    if not suite:
        return False

-    names = [p.strip() for p in str(suites_raw).split(",") if p.strip()]
-    roots: list[str] = []
-    for name in names:
-        suite_def = SUITE_DEFINITIONS.get(name)
-        if suite_def:
-            roots.extend(suite_def.get("roots", []))
+    sobj = SUITE_DEFINITIONS.get(suite)
+    roots: list[str] = sobj.get("roots", []) if isinstance(sobj, dict) else getattr(sobj, "roots", [])
    if not roots:
        return False

--- a/tests/integration/suites.py
+++ b/tests/integration/suites.py
@ -8,46 +8,112 @@
 # For example:
 #
 # ```bash
-# pytest tests/integration/ --suite=vision
+# pytest tests/integration/ --suite=vision --setup=ollama
 # ```
 #
-# Each suite can:
-# - restrict collection to specific roots (dirs or files)
-# - provide default CLI option values (e.g. text_model, embedding_model, etc.)
+"""
+Each suite defines what to run (roots). Suites can be run with different global setups defined in setups.py.
+Setups provide environment variables and model defaults that can be reused across multiple suites.
+
+CLI examples:
+  pytest tests/integration --suite=responses --setup=gpt
+  pytest tests/integration --suite=vision --setup=ollama
+  pytest tests/integration --suite=base --setup=vllm
+"""

 from pathlib import Path

+from pydantic import BaseModel, Field
+
 this_dir = Path(__file__).parent
-default_roots = [
+
+
+class Suite(BaseModel):
+    name: str
+    roots: list[str]
+    default_setup: str | None = None
+
+
+class Setup(BaseModel):
+    """A reusable test configuration with environment and CLI defaults."""
+
+    name: str
+    description: str
+    defaults: dict[str, str] = Field(default_factory=dict)
+    env: dict[str, str] = Field(default_factory=dict)
+
+
+# Global setups - can be used with any suite "technically" but in reality, some setups might work
+# only for specific test suites.
+SETUP_DEFINITIONS: dict[str, Setup] = {
+    "ollama": Setup(
+        name="ollama",
+        description="Local Ollama provider with text + safety models",
+        env={
+            "OLLAMA_URL": "http://0.0.0.0:11434",
+            "SAFETY_MODEL": "ollama/llama-guard3:1b",
+        },
+        defaults={
+            "text_model": "ollama/llama3.2:3b-instruct-fp16",
+            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
+            "safety_model": "ollama/llama-guard3:1b",
+            "safety_shield": "llama-guard",
+        },
+    ),
+    "ollama-vision": Setup(
+        name="ollama",
+        description="Local Ollama provider with a vision model",
+        env={
+            "OLLAMA_URL": "http://0.0.0.0:11434",
+        },
+        defaults={
+            "vision_model": "ollama/llama3.2-vision:11b",
+            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
+        },
+    ),
+    "vllm": Setup(
+        name="vllm",
+        description="vLLM provider with a text model",
+        env={
+            "VLLM_URL": "http://localhost:8000/v1",
+        },
+        defaults={
+            "text_model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
+            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
+        },
+    ),
+    "gpt": Setup(
+        name="gpt",
+        description="OpenAI GPT models for high-quality responses and tool calling",
+        defaults={
+            "text_model": "openai/gpt-4o",
+            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
+        },
+    ),
+}
+
+
+base_roots = [
    str(p)
    for p in this_dir.glob("*")
    if p.is_dir()
    and p.name not in ("__pycache__", "fixtures", "test_cases", "recordings", "responses", "post_training")
 ]

-SUITE_DEFINITIONS: dict[str, dict] = {
-    "base": {
-        "description": "Base suite that includes most tests but runs them with a text Ollama model",
-        "roots": default_roots,
-        "defaults": {
-            "text_model": "ollama/llama3.2:3b-instruct-fp16",
-            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
-        },
-    },
-    "responses": {
-        "description": "Suite that includes only the OpenAI Responses tests; needs a strong tool-calling model",
-        "roots": ["tests/integration/responses"],
-        "defaults": {
-            "text_model": "openai/gpt-4o",
-            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
-        },
-    },
-    "vision": {
-        "description": "Suite that includes only the vision tests",
-        "roots": ["tests/integration/inference/test_vision_inference.py"],
-        "defaults": {
-            "vision_model": "ollama/llama3.2-vision:11b",
-            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
-        },
-    },
+SUITE_DEFINITIONS: dict[str, Suite] = {
+    "base": Suite(
+        name="base",
+        roots=base_roots,
+        default_setup="ollama",
+    ),
+    "responses": Suite(
+        name="responses",
+        roots=["tests/integration/responses"],
+        default_setup="gpt",
+    ),
+    "vision": Suite(
+        name="vision",
+        roots=["tests/integration/inference/test_vision_inference.py"],
+        default_setup="ollama-vision",
+    ),
 }