docs(tests): Add a bunch of documentation for our testing systems

2025-08-15 14:08:00 +00:00 · 2025-08-13 17:37:02 -07:00 · 2025-08-13 17:37:02 -07:00 · f4281ce66a
commit f4281ce66a
parent e1e161553c
7 changed files with 1006 additions and 1 deletions
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@ -19,7 +19,8 @@ new_vector_database
 ## Testing
-See the [Test Page](testing.md) which describes how to test your changes.
+Llama Stack uses a record-replay testing system for reliable, cost-effective testing. See the [Testing Documentation](testing.md) for comprehensive guides on writing and running tests.
 ```{toctree}
 :maxdepth: 1
 :hidden:
--- a/docs/source/contributing/testing.md
+++ b/docs/source/contributing/testing.md
@ -1,3 +1,35 @@
 # Testing
 Llama Stack uses a record-replay system for reliable, fast, and cost-effective testing of AI applications.
 ## Testing Documentation
 ```{toctree}
 :maxdepth: 1
 testing/index
 testing/integration-testing
 testing/record-replay
 testing/writing-tests
 testing/troubleshooting
 ```
 ## Quick Start
 ```bash
 # Run tests with existing recordings
 uv run pytest tests/integration/
 # Test against live APIs
 FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
 ```
 For detailed information, see the [Testing Overview](testing/index.md).
 ---
 ## Original Documentation
 ```{include} ../../../tests/README.md
 ```
--- a/docs/source/contributing/testing/index.md
+++ b/docs/source/contributing/testing/index.md
@ -0,0 +1,103 @@
 # Testing in Llama Stack
 Llama Stack uses a record-replay testing system to handle AI API costs, non-deterministic responses, and multiple provider integrations.
 ## Core Problems
 Testing AI applications creates three challenges:
 - **API costs** accumulate quickly during development and CI
 - **Non-deterministic responses** make tests unreliable
 - **Multiple providers** require testing the same logic across different APIs
 ## Solution
 Record real API responses once, replay them for fast, deterministic tests.
 ## Architecture Overview
 ### Test Types
 - **Unit tests** (`tests/unit/`) - Test components in isolation with mocks
 - **Integration tests** (`tests/integration/`) - Test complete workflows with record-replay
 ### Core Components
 #### Record-Replay System
 Captures API calls and replays them deterministically:
 ```python
 # Record real API responses
 with inference_recording(mode=InferenceMode.RECORD, storage_dir="recordings"):
    response = await client.chat.completions.create(...)
 # Replay cached responses
 with inference_recording(mode=InferenceMode.REPLAY, storage_dir="recordings"):
    response = await client.chat.completions.create(...)  # No API call made
 ```
 #### Provider Testing
 Write tests once, run against any provider:
 ```bash
 # Same test, different providers
 pytest tests/integration/inference/ --stack-config=openai --text-model=gpt-4
 pytest tests/integration/inference/ --stack-config=starter --text-model=llama3.2:3b
 ```
 #### Test Parametrization
 Generate test combinations from CLI arguments:
 ```bash
 # Creates test for each model/provider combination
 pytest tests/integration/ \
    --stack-config=inference=fireworks \
    --text-model=llama-3.1-8b,llama-3.1-70b
 ```
 ## How It Works
 ### Recording Storage
 Recordings use SQLite for lookup and JSON for storage:
 ```
 recordings/
 ├── index.sqlite          # Fast lookup by request hash
 └── responses/
    ├── abc123def456.json  # Individual response files
    └── def789ghi012.json
 ```
 ### Why Record-Replay?
 Mocking AI APIs is brittle. Real API responses:
 - Include edge cases and realistic data structures
 - Preserve streaming behavior
 - Can be inspected and debugged
 ### Why Test All Providers?
 One test verifies behavior across all providers, catching integration bugs early.
 ## Workflow
 1. **Develop tests** in `LIVE` mode against real APIs
 2. **Record responses** with `RECORD` mode
 3. **Commit recordings** for deterministic CI
 4. **Tests replay** cached responses in CI
 ## Quick Start
 ```bash
 # Run tests with existing recordings
 uv run pytest tests/integration/
 # Test against live APIs
 FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
 ```
 See [Integration Testing](integration-testing.md) for usage details and [Record-Replay](record-replay.md) for system internals.
--- a/docs/source/contributing/testing/integration-testing.md
+++ b/docs/source/contributing/testing/integration-testing.md
@ -0,0 +1,136 @@
 # Integration Testing Guide
 Practical usage of Llama Stack's integration testing system.
 ## Basic Usage
 ```bash
 # Run all integration tests
 uv run pytest tests/integration/
 # Run specific test suites
 uv run pytest tests/integration/inference/
 uv run pytest tests/integration/agents/
 ```
 ## Live API Testing
 ```bash
 # Auto-start server
 export FIREWORKS_API_KEY=your_key
 pytest tests/integration/inference/ \
    --stack-config=server:fireworks \
    --text-model=meta-llama/Llama-3.1-8B-Instruct
 # Library client
 export TOGETHER_API_KEY=your_key
 pytest tests/integration/inference/ \
    --stack-config=starter \
    --text-model=meta-llama/Llama-3.1-8B-Instruct
 ```
 ## Configuration
 ### Stack Config
 ```bash
 --stack-config=server:fireworks          # Auto-start server
 --stack-config=server:together:8322      # Custom port
 --stack-config=starter                   # Template
 --stack-config=/path/to/run.yaml         # Config file
 --stack-config=inference=fireworks       # Adhoc providers
 --stack-config=http://localhost:5001     # Existing server
 ```
 ### Models
 ```bash
 --text-model=meta-llama/Llama-3.1-8B-Instruct
 --vision-model=meta-llama/Llama-3.2-11B-Vision-Instruct
 --embedding-model=sentence-transformers/all-MiniLM-L6-v2
 ```
 ### Environment
 ```bash
 --env FIREWORKS_API_KEY=your_key
 --env OPENAI_BASE_URL=http://localhost:11434/v1
 ```
 ## Test Scenarios
 ### New Provider Testing
 ```bash
 # Test new provider
 pytest tests/integration/inference/ \
    --stack-config=inference=your-new-provider \
    --text-model=your-model-id
 ```
 ### Multiple Models
 ```bash
 # Test multiple models
 pytest tests/integration/inference/ \
    --text-model=llama-3.1-8b,llama-3.1-70b
 ```
 ### Local Development
 ```bash
 # Test with local Ollama
 pytest tests/integration/inference/ \
    --stack-config=starter \
    --text-model=llama3.2:3b
 ```
 ## Recording Modes
 ```bash
 # Live API calls (default)
 LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
 # Record new responses
 LLAMA_STACK_TEST_INFERENCE_MODE=record \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/inference/test_new.py
 # Replay cached responses
 LLAMA_STACK_TEST_INFERENCE_MODE=replay \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/
 ```
 ## Recording Management
 ```bash
 # View recordings
 sqlite3 recordings/index.sqlite "SELECT * FROM recordings;"
 cat recordings/responses/abc123.json
 # Re-record tests
 rm -rf recordings/
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_specific.py
 ```
 ## Debugging
 ```bash
 # Verbose output
 pytest -vvs tests/integration/inference/
 # Debug logging
 LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py
 # Custom port for conflicts
 pytest tests/integration/ --stack-config=server:fireworks:8322
 ```
 ## Best Practices
 - Use existing recordings for development
 - Record new interactions only when needed
 - Test across multiple providers
 - Use descriptive test names
 - Commit recordings to version control
--- a/docs/source/contributing/testing/record-replay.md
+++ b/docs/source/contributing/testing/record-replay.md
@ -0,0 +1,80 @@
 # Record-Replay System
 The record-replay system captures real API interactions and replays them deterministically for fast, reliable testing.
 ## How It Works
 ### Request Hashing
 API requests are hashed to enable consistent lookup:
 ```python
 def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
    normalized = {
        "method": method.upper(),
        "endpoint": urlparse(url).path,
        "body": body
    }
    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
 ```
 Hashing is precise - different whitespace or float precision produces different hashes.
 ### Client Interception
 The system patches OpenAI and Ollama client methods to intercept API calls before they leave the client.
 ## Storage
 Recordings use SQLite for indexing and JSON for storage:
 ```
 recordings/
 ├── index.sqlite          # Fast lookup by request hash
 └── responses/
    ├── abc123def456.json  # Individual response files
    └── def789ghi012.json
 ```
 ## Recording Modes
 ### LIVE Mode
 Direct API calls, no recording/replay:
 ```python
 with inference_recording(mode=InferenceMode.LIVE):
    response = await client.chat.completions.create(...)
 ```
 ### RECORD Mode
 Captures API interactions:
 ```python
 with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # Response captured AND returned
 ```
 ### REPLAY Mode
 Uses stored recordings:
 ```python
 with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # Returns cached response, no API call
 ```
 ## Streaming Support
 Streaming responses are captured completely before any chunks are yielded, then replayed as an async generator that matches the original API behavior.
 ## Environment Variables
 ```bash
 export LLAMA_STACK_TEST_INFERENCE_MODE=replay
 export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
 pytest tests/integration/
 ```
 ## Common Issues
 - **"No recorded response found"** - Re-record with `RECORD` mode
 - **Serialization errors** - Response types changed, re-record
 - **Hash mismatches** - Request parameters changed slightly
--- a/docs/source/contributing/testing/troubleshooting.md
+++ b/docs/source/contributing/testing/troubleshooting.md
@ -0,0 +1,528 @@
 # Testing Troubleshooting Guide
 This guide covers common issues encountered when working with Llama Stack's testing infrastructure and how to resolve them.
 ## Quick Diagnosis
 ### Test Status Quick Check
 ```bash
 # Check if tests can run at all
 uv run pytest tests/integration/inference/test_embedding.py::test_basic_embeddings -v
 # Check available models and providers
 uv run llama stack list-providers
 uv run llama stack list-models
 # Verify server connectivity
 curl http://localhost:5001/v1/health
 ```
 ## Recording and Replay Issues
 ### "No recorded response found for request hash"
 **Symptom:**
 ```
 RuntimeError: No recorded response found for request hash: abc123def456
 Endpoint: /v1/chat/completions
 Model: meta-llama/Llama-3.1-8B-Instruct
 ```
 **Causes and Solutions:**
 1. **Missing recording** - Most common cause
   ```bash
   # Record the missing interaction
   LLAMA_STACK_TEST_INFERENCE_MODE=record \
   LLAMA_STACK_TEST_RECORDING_DIR=./test_recordings \
   pytest tests/integration/inference/test_failing.py -v
   ```
 2. **Request parameters changed**
   ```bash
   # Check what changed by comparing requests
   sqlite3 test_recordings/index.sqlite \
   "SELECT request_hash, endpoint, model, timestamp FROM recordings WHERE endpoint='/v1/chat/completions';"
   # View specific request details
   cat test_recordings/responses/abc123def456.json | jq '.request'
   ```
 3. **Different environment/provider**
   ```bash
   # Ensure consistent test environment
   pytest tests/integration/ --stack-config=starter --text-model=llama3.2:3b
   ```
 ### Recording Failures
 **Symptom:**
 ```
 sqlite3.OperationalError: database is locked
 ```
 **Solutions:**
 1. **Concurrent access** - Multiple test processes
   ```bash
   # Run tests sequentially
   pytest tests/integration/ -n 1
   # Or use separate recording directories
   LLAMA_STACK_TEST_RECORDING_DIR=./recordings_$(date +%s) pytest ...
   ```
 2. **Incomplete recording cleanup**
   ```bash
   # Clear and restart recording
   rm -rf test_recordings/
   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_specific.py
   ```
 ### Serialization/Deserialization Errors
 **Symptom:**
 ```
 Failed to deserialize object of type llama_stack.apis.inference.OpenAIChatCompletion
 ```
 **Causes and Solutions:**
 1. **API response format changed**
   ```bash
   # Re-record with updated format
   rm test_recordings/responses/abc123*.json
   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_failing.py
   ```
 2. **Missing dependencies for deserialization**
   ```bash
   # Ensure all required packages installed
   uv install --group dev
   ```
 3. **Version mismatch between record and replay**
   ```bash
   # Check Python environment consistency
   uv run python -c "import llama_stack; print(llama_stack.__version__)"
   ```
 ## Server Connection Issues
 ### "Connection refused" Errors
 **Symptom:**
 ```
 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
 ```
 **Diagnosis and Solutions:**
 1. **Server not running**
   ```bash
   # Check if server is running
   curl http://localhost:5001/v1/health
   # Start server manually for debugging
   llama stack run --template starter --port 5001
   ```
 2. **Port conflicts**
   ```bash
   # Check what's using the port
   lsof -i :5001
   # Use different port
   pytest tests/integration/ --stack-config=server:starter:8322
   ```
 3. **Server startup timeout**
   ```bash
   # Increase startup timeout or check server logs
   tail -f server.log
   # Manual server management
   llama stack run --template starter &
   sleep 30  # Wait for startup
   pytest tests/integration/
   ```
 ### Auto-Server Startup Issues
 **Symptom:**
 ```
 Server failed to respond within 30 seconds
 ```
 **Solutions:**
 1. **Check server logs**
   ```bash
   # Server logs are written to server.log
   tail -f server.log
   # Look for startup errors
   grep -i error server.log
   ```
 2. **Dependencies missing**
   ```bash
   # Ensure all dependencies installed
   uv install --group dev
   # Check specific provider requirements
   pip list | grep -i fireworks
   ```
 3. **Resource constraints**
   ```bash
   # Check system resources
   htop
   df -h
   # Use lighter config for testing
   pytest tests/integration/ --stack-config=starter
   ```
 ## Provider and Model Issues
 ### "Model not found" Errors
 **Symptom:**
 ```
 Model 'meta-llama/Llama-3.1-8B-Instruct' not found
 ```
 **Solutions:**
 1. **Check available models**
   ```bash
   # List models for current provider
   uv run llama stack list-models
   # Use available model
   pytest tests/integration/ --text-model=llama3.2:3b
   ```
 2. **Model not downloaded for local providers**
   ```bash
   # Download missing model
   ollama pull llama3.2:3b
   # Verify model available
   ollama list
   ```
 3. **Provider configuration issues**
   ```bash
   # Check provider setup
   uv run llama stack list-providers
   # Verify API keys set
   echo $FIREWORKS_API_KEY
   ```
 ### Provider Authentication Failures
 **Symptom:**
 ```
 HTTP 401: Invalid API key
 ```
 **Solutions:**
 1. **Missing API keys**
   ```bash
   # Set required API key
   export FIREWORKS_API_KEY=your_key_here
   export OPENAI_API_KEY=your_key_here
   # Verify key is set
   echo $FIREWORKS_API_KEY
   ```
 2. **Invalid API keys**
   ```bash
   # Test API key directly
   curl -H "Authorization: Bearer $FIREWORKS_API_KEY" \
        https://api.fireworks.ai/inference/v1/models
   ```
 3. **API key environment issues**
   ```bash
   # Pass environment explicitly
   pytest tests/integration/ --env FIREWORKS_API_KEY=your_key
   ```
 ## Parametrization Issues
 ### "No tests ran matching the given pattern"
 **Symptom:**
 ```
 collected 0 items
 ```
 **Causes and Solutions:**
 1. **No models specified**
   ```bash
   # Specify required models
   pytest tests/integration/inference/ --text-model=llama3.2:3b
   ```
 2. **Model/provider mismatch**
   ```bash
   # Use compatible model for provider
   pytest tests/integration/ \
     --stack-config=starter \
     --text-model=llama3.2:3b  # Available in Ollama
   ```
 3. **Missing fixtures**
   ```bash
   # Check test requirements
   pytest tests/integration/inference/test_embedding.py --collect-only
   ```
 ### Excessive Test Combinations
 **Symptom:**
 Tests run for too many parameter combinations, taking too long.
 **Solutions:**
 1. **Limit model combinations**
   ```bash
   # Test single model instead of list
   pytest tests/integration/ --text-model=llama3.2:3b
   ```
 2. **Use specific test selection**
   ```bash
   # Run specific test pattern
   pytest tests/integration/ -k "basic and not vision"
   ```
 3. **Separate test runs**
   ```bash
   # Split by functionality
   pytest tests/integration/inference/ --text-model=model1
   pytest tests/integration/agents/ --text-model=model2
   ```
 ## Performance Issues
 ### Slow Test Execution
 **Symptom:**
 Tests take much longer than expected.
 **Diagnosis and Solutions:**
 1. **Using LIVE mode instead of REPLAY**
   ```bash
   # Verify recording mode
   echo $LLAMA_STACK_TEST_INFERENCE_MODE
   # Force replay mode
   LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
   ```
 2. **Network latency to providers**
   ```bash
   # Use local providers for development
   pytest tests/integration/ --stack-config=starter
   ```
 3. **Large recording files**
   ```bash
   # Check recording directory size
   du -sh test_recordings/
   # Clean up old recordings
   find test_recordings/ -name "*.json" -mtime +30 -delete
   ```
 ### Memory Usage Issues
 **Symptom:**
 ```
 MemoryError: Unable to allocate memory
 ```
 **Solutions:**
 1. **Large recordings in memory**
   ```bash
   # Run tests in smaller batches
   pytest tests/integration/inference/ -k "not batch"
   ```
 2. **Model memory requirements**
   ```bash
   # Use smaller models for testing
   pytest tests/integration/ --text-model=llama3.2:3b  # Instead of 70B
   ```
 ## Environment Issues
 ### Python Environment Problems
 **Symptom:**
 ```
 ModuleNotFoundError: No module named 'llama_stack'
 ```
 **Solutions:**
 1. **Wrong Python environment**
   ```bash
   # Verify uv environment
   uv run python -c "import llama_stack; print('OK')"
   # Reinstall if needed
   uv install --group dev
   ```
 2. **Development installation issues**
   ```bash
   # Reinstall in development mode
   pip install -e .
   # Verify installation
   python -c "import llama_stack; print(llama_stack.__file__)"
   ```
 ### Path and Import Issues
 **Symptom:**
 ```
 ImportError: cannot import name 'LlamaStackClient'
 ```
 **Solutions:**
 1. **PYTHONPATH issues**
   ```bash
   # Run from project root
   cd /path/to/llama-stack
   uv run pytest tests/integration/
   ```
 2. **Relative import issues**
   ```bash
   # Use absolute imports in tests
   from llama_stack_client import LlamaStackClient  # Not relative
   ```
 ## Debugging Techniques
 ### Verbose Logging
 Enable detailed logging to understand what's happening:
 ```bash
 # Enable debug logging
 LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/inference/test_failing.py -v -s
 # Enable request/response logging
 LLAMA_STACK_TEST_INFERENCE_MODE=live \
 LLAMA_STACK_LOG_LEVEL=DEBUG \
 pytest tests/integration/inference/test_failing.py -v -s
 ```
 ### Interactive Debugging
 Drop into debugger when tests fail:
 ```bash
 # Run with pdb on failure
 pytest tests/integration/inference/test_failing.py --pdb
 # Or add breakpoint in test code
 def test_something(llama_stack_client):
    import pdb; pdb.set_trace()
    # ... test code
 ```
 ### Isolation Testing
 Run tests in isolation to identify interactions:
 ```bash
 # Run single test
 pytest tests/integration/inference/test_embedding.py::test_basic_embeddings
 # Run without recordings
 rm -rf test_recordings/
 LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/inference/test_failing.py
 ```
 ### Recording Inspection
 Examine recordings to understand what's stored:
 ```bash
 # Check recording database
 sqlite3 test_recordings/index.sqlite ".tables"
 sqlite3 test_recordings/index.sqlite ".schema recordings"
 sqlite3 test_recordings/index.sqlite "SELECT * FROM recordings LIMIT 5;"
 # Examine specific recording
 find test_recordings/responses/ -name "*.json" | head -1 | xargs cat | jq '.'
 # Compare request hashes
 python -c "
 from llama_stack.testing.inference_recorder import normalize_request
 print(normalize_request('POST', 'http://localhost:11434/v1/chat/completions', {}, {'model': 'llama3.2:3b', 'messages': [{'role': 'user', 'content': 'Hello'}]}))
 "
 ```
 ## Getting Help
 ### Information to Gather
 When reporting issues, include:
 1. **Environment details:**
   ```bash
   uv run python --version
   uv run python -c "import llama_stack; print(llama_stack.__version__)"
   uv list
   ```
 2. **Test command and output:**
   ```bash
   # Full command that failed
   pytest tests/integration/inference/test_failing.py -v
   # Error message and stack trace
   ```
 3. **Configuration details:**
   ```bash
   # Stack configuration used
   echo $LLAMA_STACK_TEST_INFERENCE_MODE
   ls -la test_recordings/
   ```
 4. **Provider status:**
   ```bash
   uv run llama stack list-providers
   uv run llama stack list-models
   ```
 ### Common Solutions Summary
 | Issue | Quick Fix |
 |-------|-----------|
 | Missing recordings | `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...` |
 | Connection refused | Check server: `curl http://localhost:5001/v1/health` |
 | No tests collected | Add model: `--text-model=llama3.2:3b` |
 | Authentication error | Set API key: `export PROVIDER_API_KEY=...` |
 | Serialization error | Re-record: `rm recordings/*.json && record mode` |
 | Slow tests | Use replay: `LLAMA_STACK_TEST_INFERENCE_MODE=replay` |
 Most testing issues stem from configuration mismatches or missing recordings. The record-replay system is designed to be forgiving, but requires consistent environment setup for optimal performance.
--- a/docs/source/contributing/testing/writing-tests.md
+++ b/docs/source/contributing/testing/writing-tests.md
@ -0,0 +1,125 @@
 # Writing Tests
 How to write effective tests for Llama Stack.
 ## Basic Test Pattern
 ```python
 def test_basic_completion(llama_stack_client, text_model_id):
    """Test basic text completion functionality."""
    response = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Hello"),
    )
    # Test structure, not AI output quality
    assert response.completion_message is not None
    assert isinstance(response.completion_message.content, str)
    assert len(response.completion_message.content) > 0
 ```
 ## Parameterized Tests
 ```python
@pytest.mark.parametrize("temperature", [0.0, 0.5, 1.0])
 def test_completion_temperature(llama_stack_client, text_model_id, temperature):
    response = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Hello"),
        sampling_params={"temperature": temperature}
    )
    assert response.completion_message is not None
 ```
 ## Provider-Specific Tests
 ```python
 def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
    query_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["What is machine learning?"],
        task_type="query"
    )
    passage_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["Machine learning is a subset of AI..."],
        task_type="passage"
    )
    assert query_response.embeddings != passage_response.embeddings
 ```
 ## Fixtures
 ```python
@pytest.fixture(scope="session")
 def agent_config(llama_stack_client, text_model_id):
    """Reusable agent configuration."""
    return {
        "model": text_model_id,
        "instructions": "You are a helpful assistant",
        "tools": [],
        "enable_session_persistence": False,
    }
@pytest.fixture(scope="function")
 def fresh_session(llama_stack_client):
    """Each test gets fresh state."""
    session = llama_stack_client.create_session()
    yield session
    session.delete()
 ```
 ## Common Test Patterns
 ### Streaming Tests
 ```python
 def test_streaming_completion(llama_stack_client, text_model_id):
    stream = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Count to 5"),
        stream=True
    )
    chunks = list(stream)
    assert len(chunks) > 1
    assert all(hasattr(chunk, 'delta') for chunk in chunks)
 ```
 ### Error Testing
 ```python
 def test_invalid_model_error(llama_stack_client):
    with pytest.raises(Exception) as exc_info:
        llama_stack_client.inference.completion(
            model_id="nonexistent-model",
            content=CompletionMessage(role="user", content="Hello")
        )
    assert "model" in str(exc_info.value).lower()
 ```
 ## What NOT to Test
 ```python
 # BAD: Testing AI output quality
 def test_completion_quality(llama_stack_client, text_model_id):
    response = llama_stack_client.inference.completion(...)
    assert "correct answer" in response.content  # Fragile!
 # GOOD: Testing response structure
 def test_completion_structure(llama_stack_client, text_model_id):
    response = llama_stack_client.inference.completion(...)
    assert isinstance(response.completion_message.content, str)
    assert len(response.completion_message.content) > 0
 ```
 ## Best Practices
 - Test API contracts, not AI output quality
 - Use descriptive test names
 - Keep tests simple and focused
 - Record new interactions only when needed
 - Use appropriate fixture scopes (session vs function)