rewrote all slop

2025-08-15 06:00:48 +00:00 · 2025-08-14 16:51:13 -07:00 · 2025-08-14 16:51:13 -07:00 · 1e2bbd08da
commit 1e2bbd08da
parent f4281ce66a
9 changed files with 452 additions and 930 deletions
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@ -4,11 +4,11 @@
 ## Adding a New Provider
-See the [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+See:
 - [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
 - [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
 - [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 See the [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
 See the [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 ```{toctree}
 :maxdepth: 1
 :hidden:
@ -19,12 +19,17 @@ new_vector_database
 ## Testing
-Llama Stack uses a record-replay testing system for reliable, cost-effective testing. See the [Testing Documentation](testing.md) for comprehensive guides on writing and running tests.
+
 ```{include} ../../../tests/README.md
 ```
 ### Advanced Topics
 For developers who need deeper understanding of the testing system internals:
 ```{toctree}
 :maxdepth: 1
 :hidden:
 :caption: Testing
-testing
+testing/record-replay
 testing/troubleshooting
 ```
--- a/docs/source/contributing/testing.md
+++ b/docs/source/contributing/testing.md
@ -1,40 +0,0 @@
 # Testing
 Llama Stack uses a record-replay system for reliable, fast, and cost-effective testing of AI applications.
 ## Testing Documentation
 ```{toctree}
 :maxdepth: 1
 testing/index
 testing/integration-testing
 testing/record-replay
 testing/writing-tests
 testing/troubleshooting
 ```
 ## Quick Start
 ```bash
 # Run tests with existing recordings
 uv run pytest tests/integration/
 # Test against live APIs
 FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
 ```
 For detailed information, see the [Testing Overview](testing/index.md).
 ---
 ## Original Documentation
 ```{include} ../../../tests/README.md
 ```
 ```{include} ../../../tests/unit/README.md
 ```
 ```{include} ../../../tests/integration/README.md
 ```
--- a/docs/source/contributing/testing/index.md
+++ b/docs/source/contributing/testing/index.md
@ -1,103 +0,0 @@
 # Testing in Llama Stack
 Llama Stack uses a record-replay testing system to handle AI API costs, non-deterministic responses, and multiple provider integrations.
 ## Core Problems
 Testing AI applications creates three challenges:
 - **API costs** accumulate quickly during development and CI
 - **Non-deterministic responses** make tests unreliable
 - **Multiple providers** require testing the same logic across different APIs
 ## Solution
 Record real API responses once, replay them for fast, deterministic tests.
 ## Architecture Overview
 ### Test Types
 - **Unit tests** (`tests/unit/`) - Test components in isolation with mocks
 - **Integration tests** (`tests/integration/`) - Test complete workflows with record-replay
 ### Core Components
 #### Record-Replay System
 Captures API calls and replays them deterministically:
 ```python
 # Record real API responses
 with inference_recording(mode=InferenceMode.RECORD, storage_dir="recordings"):
    response = await client.chat.completions.create(...)
 # Replay cached responses
 with inference_recording(mode=InferenceMode.REPLAY, storage_dir="recordings"):
    response = await client.chat.completions.create(...)  # No API call made
 ```
 #### Provider Testing
 Write tests once, run against any provider:
 ```bash
 # Same test, different providers
 pytest tests/integration/inference/ --stack-config=openai --text-model=gpt-4
 pytest tests/integration/inference/ --stack-config=starter --text-model=llama3.2:3b
 ```
 #### Test Parametrization
 Generate test combinations from CLI arguments:
 ```bash
 # Creates test for each model/provider combination
 pytest tests/integration/ \
    --stack-config=inference=fireworks \
    --text-model=llama-3.1-8b,llama-3.1-70b
 ```
 ## How It Works
 ### Recording Storage
 Recordings use SQLite for lookup and JSON for storage:
 ```
 recordings/
 ├── index.sqlite          # Fast lookup by request hash
 └── responses/
    ├── abc123def456.json  # Individual response files
    └── def789ghi012.json
 ```
 ### Why Record-Replay?
 Mocking AI APIs is brittle. Real API responses:
 - Include edge cases and realistic data structures
 - Preserve streaming behavior
 - Can be inspected and debugged
 ### Why Test All Providers?
 One test verifies behavior across all providers, catching integration bugs early.
 ## Workflow
 1. **Develop tests** in `LIVE` mode against real APIs
 2. **Record responses** with `RECORD` mode
 3. **Commit recordings** for deterministic CI
 4. **Tests replay** cached responses in CI
 ## Quick Start
 ```bash
 # Run tests with existing recordings
 uv run pytest tests/integration/
 # Test against live APIs
 FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
 ```
 See [Integration Testing](integration-testing.md) for usage details and [Record-Replay](record-replay.md) for system internals.
--- a/docs/source/contributing/testing/integration-testing.md
+++ b/docs/source/contributing/testing/integration-testing.md
@ -1,136 +0,0 @@
 # Integration Testing Guide
 Practical usage of Llama Stack's integration testing system.
 ## Basic Usage
 ```bash
 # Run all integration tests
 uv run pytest tests/integration/
 # Run specific test suites
 uv run pytest tests/integration/inference/
 uv run pytest tests/integration/agents/
 ```
 ## Live API Testing
 ```bash
 # Auto-start server
 export FIREWORKS_API_KEY=your_key
 pytest tests/integration/inference/ \
    --stack-config=server:fireworks \
    --text-model=meta-llama/Llama-3.1-8B-Instruct
 # Library client
 export TOGETHER_API_KEY=your_key
 pytest tests/integration/inference/ \
    --stack-config=starter \
    --text-model=meta-llama/Llama-3.1-8B-Instruct
 ```
 ## Configuration
 ### Stack Config
 ```bash
 --stack-config=server:fireworks          # Auto-start server
 --stack-config=server:together:8322      # Custom port
 --stack-config=starter                   # Template
 --stack-config=/path/to/run.yaml         # Config file
 --stack-config=inference=fireworks       # Adhoc providers
 --stack-config=http://localhost:5001     # Existing server
 ```
 ### Models
 ```bash
 --text-model=meta-llama/Llama-3.1-8B-Instruct
 --vision-model=meta-llama/Llama-3.2-11B-Vision-Instruct
 --embedding-model=sentence-transformers/all-MiniLM-L6-v2
 ```
 ### Environment
 ```bash
 --env FIREWORKS_API_KEY=your_key
 --env OPENAI_BASE_URL=http://localhost:11434/v1
 ```
 ## Test Scenarios
 ### New Provider Testing
 ```bash
 # Test new provider
 pytest tests/integration/inference/ \
    --stack-config=inference=your-new-provider \
    --text-model=your-model-id
 ```
 ### Multiple Models
 ```bash
 # Test multiple models
 pytest tests/integration/inference/ \
    --text-model=llama-3.1-8b,llama-3.1-70b
 ```
 ### Local Development
 ```bash
 # Test with local Ollama
 pytest tests/integration/inference/ \
    --stack-config=starter \
    --text-model=llama3.2:3b
 ```
 ## Recording Modes
 ```bash
 # Live API calls (default)
 LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
 # Record new responses
 LLAMA_STACK_TEST_INFERENCE_MODE=record \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/inference/test_new.py
 # Replay cached responses
 LLAMA_STACK_TEST_INFERENCE_MODE=replay \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/
 ```
 ## Recording Management
 ```bash
 # View recordings
 sqlite3 recordings/index.sqlite "SELECT * FROM recordings;"
 cat recordings/responses/abc123.json
 # Re-record tests
 rm -rf recordings/
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_specific.py
 ```
 ## Debugging
 ```bash
 # Verbose output
 pytest -vvs tests/integration/inference/
 # Debug logging
 LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py
 # Custom port for conflicts
 pytest tests/integration/ --stack-config=server:fireworks:8322
 ```
 ## Best Practices
 - Use existing recordings for development
 - Record new interactions only when needed
 - Test across multiple providers
 - Use descriptive test names
 - Commit recordings to version control
--- a/docs/source/contributing/testing/record-replay.md
+++ b/docs/source/contributing/testing/record-replay.md
@ -1,32 +1,46 @@
 # Record-Replay System
-The record-replay system captures real API interactions and replays them deterministically for fast, reliable testing.
+Understanding how Llama Stack captures and replays API interactions for testing.
 ## Overview
 The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
 The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
 ## How It Works
 ### Request Hashing
-API requests are hashed to enable consistent lookup:
+Every API request gets converted to a deterministic hash for lookup:
 ```python
 def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
    normalized = {
        "method": method.upper(),
-        "endpoint": urlparse(url).path,
+        "endpoint": urlparse(url).path,  # Just the path, not full URL
-        "body": body
+        "body": body                     # Request parameters
    }
    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
 ```
-Hashing is precise - different whitespace or float precision produces different hashes.
+**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
 ```python
 # These produce DIFFERENT hashes:
 {"content": "Hello world"}
 {"content": "Hello   world\n"}
 {"temperature": 0.7}
 {"temperature": 0.7000001}
 ```
 ### Client Interception
-The system patches OpenAI and Ollama client methods to intercept API calls before they leave the client.
+The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
-## Storage
+### Storage Architecture
-Recordings use SQLite for indexing and JSON for storage:
+Recordings use a two-tier storage system optimized for both speed and debuggability:
 ```
 recordings/
@ -36,36 +50,120 @@ recordings/
    └── def789ghi012.json
 ```
 **SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.
 **JSON files** store complete request/response pairs in human-readable format for debugging.
 ## Recording Modes
 ### LIVE Mode
-Direct API calls, no recording/replay:
+
 Direct API calls with no recording or replay:
 ```python
 with inference_recording(mode=InferenceMode.LIVE):
    response = await client.chat.completions.create(...)
 ```
 Use for initial development and debugging against real APIs.
 ### RECORD Mode
-Captures API interactions:
+
 Captures API interactions while passing through real responses:
 ```python
 with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
-    # Response captured AND returned
+    # Real API call made, response captured AND returned
 ```
 The recording process:
 1. Request intercepted and hashed
 2. Real API call executed
 3. Response captured and serialized
 4. Recording stored to disk
 5. Original response returned to caller
 ### REPLAY Mode
-Uses stored recordings:
+
 Returns stored responses instead of making API calls:
 ```python
 with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
-    # Returns cached response, no API call
+    # No API call made, cached response returned instantly
 ```
 The replay process:
 1. Request intercepted and hashed
 2. Hash looked up in SQLite index
 3. Response loaded from JSON file
 4. Response deserialized and returned
 5. Error if no recording found
 ## Streaming Support
-Streaming responses are captured completely before any chunks are yielded, then replayed as an async generator that matches the original API behavior.
+Streaming APIs present a unique challenge: how do you capture an async generator?
-## Environment Variables
+### The Problem
 ```python
 # How do you record this?
 async for chunk in client.chat.completions.create(stream=True):
    process(chunk)
 ```
 ### The Solution
 The system captures all chunks immediately before yielding any:
 ```python
 async def handle_streaming_record(response):
    # Capture complete stream first
    chunks = []
    async for chunk in response:
        chunks.append(chunk)
    # Store complete recording
    storage.store_recording(request_hash, request_data, {
        "body": chunks,
        "is_streaming": True
    })
    # Return generator that replays captured chunks
    async def replay_stream():
        for chunk in chunks:
            yield chunk
    return replay_stream()
 ```
 This ensures:
 - **Complete capture** - The entire stream is saved atomically
 - **Interface preservation** - The returned object behaves like the original API
 - **Deterministic replay** - Same chunks in the same order every time
 ## Serialization
 API responses contain complex Pydantic objects that need careful serialization:
 ```python
 def _serialize_response(response):
    if hasattr(response, "model_dump"):
        # Preserve type information for proper deserialization
        return {
            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
            "__data__": response.model_dump(mode="json")
        }
    return response
 ```
 This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
 ## Environment Integration
 ### Environment Variables
 Control recording behavior globally:
 ```bash
 export LLAMA_STACK_TEST_INFERENCE_MODE=replay
@ -73,8 +171,64 @@ export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
 pytest tests/integration/
 ```
-## Common Issues
+### Pytest Integration
- **"No recorded response found"** - Re-record with `RECORD` mode
+The system integrates automatically based on environment variables, requiring no changes to test code.
- **Serialization errors** - Response types changed, re-record
+
- **Hash mismatches** - Request parameters changed slightly
+## Debugging Recordings
 ### Inspecting Storage
 ```bash
 # See what's recorded
 sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
 # View specific response
 cat recordings/responses/abc123def456.json | jq '.response.body'
 # Find recordings by endpoint
 sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
 ```
 ### Common Issues
 **Hash mismatches:** Request parameters changed slightly between record and replay
 ```bash
 # Compare request details
 cat recordings/responses/abc123.json | jq '.request'
 ```
 **Serialization errors:** Response types changed between versions
 ```bash
 # Re-record with updated types
 rm recordings/responses/failing_hash.json
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
 ```
 **Missing recordings:** New test or changed parameters
 ```bash
 # Record the missing interaction
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
 ```
 ## Design Decisions
 ### Why Not Mocks?
 Traditional mocking breaks down with AI APIs because:
 - Response structures are complex and evolve frequently
 - Streaming behavior is hard to mock correctly
 - Edge cases in real APIs get missed
 - Mocks become brittle maintenance burdens
 ### Why Precise Hashing?
 Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
 ### Why JSON + SQLite?
 - **JSON** - Human readable, diff-friendly, easy to inspect and modify
 - **SQLite** - Fast indexed lookups without loading response bodies
 - **Hybrid** - Best of both worlds for different use cases
 This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
--- a/docs/source/contributing/testing/troubleshooting.md
+++ b/docs/source/contributing/testing/troubleshooting.md
@ -1,528 +1,140 @@
-# Testing Troubleshooting Guide
+# Common Testing Issues
-This guide covers common issues encountered when working with Llama Stack's testing infrastructure and how to resolve them.
+The most frequent problems when working with Llama Stack's testing system.
-## Quick Diagnosis
+## Missing Recordings
-### Test Status Quick Check
+**Error:**
 ```bash
 # Check if tests can run at all
 uv run pytest tests/integration/inference/test_embedding.py::test_basic_embeddings -v
 # Check available models and providers
 uv run llama stack list-providers
 uv run llama stack list-models
 # Verify server connectivity
 curl http://localhost:5001/v1/health
 ```
 ## Recording and Replay Issues
 ### "No recorded response found for request hash"
 **Symptom:**
 ```
 RuntimeError: No recorded response found for request hash: abc123def456
 Endpoint: /v1/chat/completions
 Model: meta-llama/Llama-3.1-8B-Instruct
 ```
-**Causes and Solutions:**
+**Cause:** You're running a test that needs an API interaction that hasn't been recorded yet.
-1. **Missing recording** - Most common cause
+**Solution:**
-   ```bash
+```bash
-   # Record the missing interaction
+# Record the missing interaction
-   LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
-   LLAMA_STACK_TEST_RECORDING_DIR=./test_recordings \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
-   pytest tests/integration/inference/test_failing.py -v
+pytest tests/integration/inference/test_your_test.py
   ```
 2. **Request parameters changed**
   ```bash
   # Check what changed by comparing requests
   sqlite3 test_recordings/index.sqlite \
   "SELECT request_hash, endpoint, model, timestamp FROM recordings WHERE endpoint='/v1/chat/completions';"
   # View specific request details
   cat test_recordings/responses/abc123def456.json | jq '.request'
   ```
 3. **Different environment/provider**
   ```bash
   # Ensure consistent test environment
   pytest tests/integration/ --stack-config=starter --text-model=llama3.2:3b
   ```
 ### Recording Failures
 **Symptom:**
 ```
 sqlite3.OperationalError: database is locked
 ```
-**Solutions:**
+## API Key Issues
-1. **Concurrent access** - Multiple test processes
+**Error:**
   ```bash
   # Run tests sequentially
   pytest tests/integration/ -n 1
   # Or use separate recording directories
   LLAMA_STACK_TEST_RECORDING_DIR=./recordings_$(date +%s) pytest ...
   ```
 2. **Incomplete recording cleanup**
   ```bash
   # Clear and restart recording
   rm -rf test_recordings/
   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_specific.py
   ```
 ### Serialization/Deserialization Errors
 **Symptom:**
 ```
 Failed to deserialize object of type llama_stack.apis.inference.OpenAIChatCompletion
 ```
 **Causes and Solutions:**
 1. **API response format changed**
   ```bash
   # Re-record with updated format
   rm test_recordings/responses/abc123*.json
   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_failing.py
   ```
 2. **Missing dependencies for deserialization**
   ```bash
   # Ensure all required packages installed
   uv install --group dev
   ```
 3. **Version mismatch between record and replay**
   ```bash
   # Check Python environment consistency
   uv run python -c "import llama_stack; print(llama_stack.__version__)"
   ```
 ## Server Connection Issues
 ### "Connection refused" Errors
 **Symptom:**
 ```
 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
 ```
 **Diagnosis and Solutions:**
 1. **Server not running**
   ```bash
   # Check if server is running
   curl http://localhost:5001/v1/health
   # Start server manually for debugging
   llama stack run --template starter --port 5001
   ```
 2. **Port conflicts**
   ```bash
   # Check what's using the port
   lsof -i :5001
   # Use different port
   pytest tests/integration/ --stack-config=server:starter:8322
   ```
 3. **Server startup timeout**
   ```bash
   # Increase startup timeout or check server logs
   tail -f server.log
   # Manual server management
   llama stack run --template starter &
   sleep 30  # Wait for startup
   pytest tests/integration/
   ```
 ### Auto-Server Startup Issues
 **Symptom:**
 ```
 Server failed to respond within 30 seconds
 ```
 **Solutions:**
 1. **Check server logs**
   ```bash
   # Server logs are written to server.log
   tail -f server.log
   # Look for startup errors
   grep -i error server.log
   ```
 2. **Dependencies missing**
   ```bash
   # Ensure all dependencies installed
   uv install --group dev
   # Check specific provider requirements
   pip list | grep -i fireworks
   ```
 3. **Resource constraints**
   ```bash
   # Check system resources
   htop
   df -h
   # Use lighter config for testing
   pytest tests/integration/ --stack-config=starter
   ```
 ## Provider and Model Issues
 ### "Model not found" Errors
 **Symptom:**
 ```
 Model 'meta-llama/Llama-3.1-8B-Instruct' not found
 ```
 **Solutions:**
 1. **Check available models**
   ```bash
   # List models for current provider
   uv run llama stack list-models
   # Use available model
   pytest tests/integration/ --text-model=llama3.2:3b
   ```
 2. **Model not downloaded for local providers**
   ```bash
   # Download missing model
   ollama pull llama3.2:3b
   # Verify model available
   ollama list
   ```
 3. **Provider configuration issues**
   ```bash
   # Check provider setup
   uv run llama stack list-providers
   # Verify API keys set
   echo $FIREWORKS_API_KEY
   ```
 ### Provider Authentication Failures
 **Symptom:**
 ```
 HTTP 401: Invalid API key
 ```
-**Solutions:**
+**Cause:** Missing or invalid API key for the provider you're testing.
-1. **Missing API keys**
+**Solution:**
-   ```bash
+```bash
-   # Set required API key
+# Set the required API key
-   export FIREWORKS_API_KEY=your_key_here
+export FIREWORKS_API_KEY=your_key_here
-   export OPENAI_API_KEY=your_key_here
+export OPENAI_API_KEY=your_key_here
-   # Verify key is set
+# Verify it's set
-   echo $FIREWORKS_API_KEY
+echo $FIREWORKS_API_KEY
-   ```
+```
-2. **Invalid API keys**
+## Model Not Found
   ```bash
   # Test API key directly
   curl -H "Authorization: Bearer $FIREWORKS_API_KEY" \
        https://api.fireworks.ai/inference/v1/models
   ```
-3. **API key environment issues**
+**Error:**
-   ```bash
+```
-   # Pass environment explicitly
+Model 'meta-llama/Llama-3.1-8B-Instruct' not found
-   pytest tests/integration/ --env FIREWORKS_API_KEY=your_key
+```
   ```
-## Parametrization Issues
+**Cause:** Model isn't available with the current provider or hasn't been downloaded locally.
-### "No tests ran matching the given pattern"
+**For local providers (Ollama):**
 ```bash
 # Download the model
 ollama pull llama3.2:3b
-**Symptom:**
+# Use the downloaded model
 pytest tests/integration/ --text-model=llama3.2:3b
 ```
 **For remote providers:**
 ```bash
 # Check what models are available
 uv run llama stack list-models
 # Use an available model
 pytest tests/integration/ --text-model=available-model-id
 ```
 ## Server Connection Issues
 **Error:**
 ```
 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
 ```
 **Cause:** Server isn't running or is on a different port.
 **Solution:**
 ```bash
 # Check if server is running
 curl http://localhost:5001/v1/health
 # Start server manually
 llama stack run --template starter --port 5001
 # Or use auto-server with custom port
 pytest tests/integration/ --stack-config=server:starter:8322
 ```
 ## Request Hash Mismatches
 **Problem:** Tests worked before but now fail with "No recorded response found" even though you didn't change the test.
 **Cause:** Request parameters changed slightly (different whitespace, float precision, etc.). The hashing is intentionally precise.
 **Solution:**
 ```bash
 # Check what's in your recordings
 sqlite3 recordings/index.sqlite "SELECT endpoint, model FROM recordings;"
 # Re-record if the request legitimately changed
 rm recordings/responses/old_hash.json
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/your_test.py
 ```
 ## No Tests Collected
 **Error:**
 ```
 collected 0 items
 ```
-**Causes and Solutions:**
+**Cause:** No models specified for tests that require model fixtures.
 1. **No models specified**
   ```bash
   # Specify required models
   pytest tests/integration/inference/ --text-model=llama3.2:3b
   ```
 2. **Model/provider mismatch**
   ```bash
   # Use compatible model for provider
   pytest tests/integration/ \
     --stack-config=starter \
     --text-model=llama3.2:3b  # Available in Ollama
   ```
 3. **Missing fixtures**
   ```bash
   # Check test requirements
   pytest tests/integration/inference/test_embedding.py --collect-only
   ```
 ### Excessive Test Combinations
 **Symptom:**
 Tests run for too many parameter combinations, taking too long.
 **Solutions:**
 1. **Limit model combinations**
   ```bash
   # Test single model instead of list
   pytest tests/integration/ --text-model=llama3.2:3b
   ```
 2. **Use specific test selection**
   ```bash
   # Run specific test pattern
   pytest tests/integration/ -k "basic and not vision"
   ```
 3. **Separate test runs**
   ```bash
   # Split by functionality
   pytest tests/integration/inference/ --text-model=model1
   pytest tests/integration/agents/ --text-model=model2
   ```
 ## Performance Issues
 ### Slow Test Execution
 **Symptom:**
 Tests take much longer than expected.
 **Diagnosis and Solutions:**
 1. **Using LIVE mode instead of REPLAY**
   ```bash
   # Verify recording mode
   echo $LLAMA_STACK_TEST_INFERENCE_MODE
   # Force replay mode
   LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
   ```
 2. **Network latency to providers**
   ```bash
   # Use local providers for development
   pytest tests/integration/ --stack-config=starter
   ```
 3. **Large recording files**
   ```bash
   # Check recording directory size
   du -sh test_recordings/
   # Clean up old recordings
   find test_recordings/ -name "*.json" -mtime +30 -delete
   ```
 ### Memory Usage Issues
 **Symptom:**
 ```
 MemoryError: Unable to allocate memory
 ```
 **Solutions:**
 1. **Large recordings in memory**
   ```bash
   # Run tests in smaller batches
   pytest tests/integration/inference/ -k "not batch"
   ```
 2. **Model memory requirements**
   ```bash
   # Use smaller models for testing
   pytest tests/integration/ --text-model=llama3.2:3b  # Instead of 70B
   ```
 ## Environment Issues
 ### Python Environment Problems
 **Symptom:**
 ```
 ModuleNotFoundError: No module named 'llama_stack'
 ```
 **Solutions:**
 1. **Wrong Python environment**
   ```bash
   # Verify uv environment
   uv run python -c "import llama_stack; print('OK')"
   # Reinstall if needed
   uv install --group dev
   ```
 2. **Development installation issues**
   ```bash
   # Reinstall in development mode
   pip install -e .
   # Verify installation
   python -c "import llama_stack; print(llama_stack.__file__)"
   ```
 ### Path and Import Issues
 **Symptom:**
 ```
 ImportError: cannot import name 'LlamaStackClient'
 ```
 **Solutions:**
 1. **PYTHONPATH issues**
   ```bash
   # Run from project root
   cd /path/to/llama-stack
   uv run pytest tests/integration/
   ```
 2. **Relative import issues**
   ```bash
   # Use absolute imports in tests
   from llama_stack_client import LlamaStackClient  # Not relative
   ```
 ## Debugging Techniques
 ### Verbose Logging
 Enable detailed logging to understand what's happening:
 **Solution:**
 ```bash
-# Enable debug logging
+# Specify required models
-LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/inference/test_failing.py -v -s
+pytest tests/integration/inference/ --text-model=llama3.2:3b
-
+pytest tests/integration/embedding/ --embedding-model=all-MiniLM-L6-v2
 # Enable request/response logging
 LLAMA_STACK_TEST_INFERENCE_MODE=live \
 LLAMA_STACK_LOG_LEVEL=DEBUG \
 pytest tests/integration/inference/test_failing.py -v -s
 ```
 ### Interactive Debugging
 Drop into debugger when tests fail:
 ```bash
 # Run with pdb on failure
 pytest tests/integration/inference/test_failing.py --pdb
 # Or add breakpoint in test code
 def test_something(llama_stack_client):
    import pdb; pdb.set_trace()
    # ... test code
 ```
 ### Isolation Testing
 Run tests in isolation to identify interactions:
 ```bash
 # Run single test
 pytest tests/integration/inference/test_embedding.py::test_basic_embeddings
 # Run without recordings
 rm -rf test_recordings/
 LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/inference/test_failing.py
 ```
 ### Recording Inspection
 Examine recordings to understand what's stored:
 ```bash
 # Check recording database
 sqlite3 test_recordings/index.sqlite ".tables"
 sqlite3 test_recordings/index.sqlite ".schema recordings"
 sqlite3 test_recordings/index.sqlite "SELECT * FROM recordings LIMIT 5;"
 # Examine specific recording
 find test_recordings/responses/ -name "*.json" | head -1 | xargs cat | jq '.'
 # Compare request hashes
 python -c "
 from llama_stack.testing.inference_recorder import normalize_request
 print(normalize_request('POST', 'http://localhost:11434/v1/chat/completions', {}, {'model': 'llama3.2:3b', 'messages': [{'role': 'user', 'content': 'Hello'}]}))
 "
 ```
 ## Getting Help
 ### Information to Gather
 When reporting issues, include:
-1. **Environment details:**
+```bash
-   ```bash
+# Environment info
-   uv run python --version
+uv run python --version
-   uv run python -c "import llama_stack; print(llama_stack.__version__)"
+uv run python -c "import llama_stack; print(llama_stack.__version__)"
   uv list
   ```
-2. **Test command and output:**
+# Test command that failed
-   ```bash
+pytest tests/integration/your_test.py -v
   # Full command that failed
   pytest tests/integration/inference/test_failing.py -v
-   # Error message and stack trace
+# Stack configuration
-   ```
+echo $LLAMA_STACK_TEST_INFERENCE_MODE
 ls -la recordings/
 ```
-3. **Configuration details:**
+Most issues are solved by re-recording interactions or checking API keys/model availability.
   ```bash
   # Stack configuration used
   echo $LLAMA_STACK_TEST_INFERENCE_MODE
   ls -la test_recordings/
   ```
 4. **Provider status:**
   ```bash
   uv run llama stack list-providers
   uv run llama stack list-models
   ```
 ### Common Solutions Summary
 | Issue | Quick Fix |
 |-------|-----------|
 | Missing recordings | `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...` |
 | Connection refused | Check server: `curl http://localhost:5001/v1/health` |
 | No tests collected | Add model: `--text-model=llama3.2:3b` |
 | Authentication error | Set API key: `export PROVIDER_API_KEY=...` |
 | Serialization error | Re-record: `rm recordings/*.json && record mode` |
 | Slow tests | Use replay: `LLAMA_STACK_TEST_INFERENCE_MODE=replay` |
 Most testing issues stem from configuration mismatches or missing recordings. The record-replay system is designed to be forgiving, but requires consistent environment setup for optimal performance.
--- a/docs/source/contributing/testing/writing-tests.md
+++ b/docs/source/contributing/testing/writing-tests.md
@ -1,125 +0,0 @@
 # Writing Tests
 How to write effective tests for Llama Stack.
 ## Basic Test Pattern
 ```python
 def test_basic_completion(llama_stack_client, text_model_id):
    """Test basic text completion functionality."""
    response = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Hello"),
    )
    # Test structure, not AI output quality
    assert response.completion_message is not None
    assert isinstance(response.completion_message.content, str)
    assert len(response.completion_message.content) > 0
 ```
 ## Parameterized Tests
 ```python
@pytest.mark.parametrize("temperature", [0.0, 0.5, 1.0])
 def test_completion_temperature(llama_stack_client, text_model_id, temperature):
    response = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Hello"),
        sampling_params={"temperature": temperature}
    )
    assert response.completion_message is not None
 ```
 ## Provider-Specific Tests
 ```python
 def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
    query_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["What is machine learning?"],
        task_type="query"
    )
    passage_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["Machine learning is a subset of AI..."],
        task_type="passage"
    )
    assert query_response.embeddings != passage_response.embeddings
 ```
 ## Fixtures
 ```python
@pytest.fixture(scope="session")
 def agent_config(llama_stack_client, text_model_id):
    """Reusable agent configuration."""
    return {
        "model": text_model_id,
        "instructions": "You are a helpful assistant",
        "tools": [],
        "enable_session_persistence": False,
    }
@pytest.fixture(scope="function")
 def fresh_session(llama_stack_client):
    """Each test gets fresh state."""
    session = llama_stack_client.create_session()
    yield session
    session.delete()
 ```
 ## Common Test Patterns
 ### Streaming Tests
 ```python
 def test_streaming_completion(llama_stack_client, text_model_id):
    stream = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Count to 5"),
        stream=True
    )
    chunks = list(stream)
    assert len(chunks) > 1
    assert all(hasattr(chunk, 'delta') for chunk in chunks)
 ```
 ### Error Testing
 ```python
 def test_invalid_model_error(llama_stack_client):
    with pytest.raises(Exception) as exc_info:
        llama_stack_client.inference.completion(
            model_id="nonexistent-model",
            content=CompletionMessage(role="user", content="Hello")
        )
    assert "model" in str(exc_info.value).lower()
 ```
 ## What NOT to Test
 ```python
 # BAD: Testing AI output quality
 def test_completion_quality(llama_stack_client, text_model_id):
    response = llama_stack_client.inference.completion(...)
    assert "correct answer" in response.content  # Fragile!
 # GOOD: Testing response structure
 def test_completion_structure(llama_stack_client, text_model_id):
    response = llama_stack_client.inference.completion(...)
    assert isinstance(response.completion_message.content, str)
    assert len(response.completion_message.content) > 0
 ```
 ## Best Practices
 - Test API contracts, not AI output quality
 - Use descriptive test names
 - Keep tests simple and focused
 - Record new interactions only when needed
 - Use appropriate fixture scopes (session vs function)
--- a/tests/README.md
+++ b/tests/README.md
@ -1,9 +1,64 @@
-# Llama Stack Tests
+There are two obvious types of tests:
-Llama Stack has multiple layers of testing done to ensure continuous functionality and prevent regressions to the codebase.
+| Type | Location | Purpose |
 |------|----------|---------|
 | **Unit** | [`tests/unit/`](unit/README.md) | Fast, isolated component testing |
 | **Integration** | [`tests/integration/`](integration/README.md) | End-to-end workflows with record-replay |
-| Testing Type | Details |
+Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.
-|--------------|---------|
+
-| Unit | [unit/README.md](unit/README.md) |
+### Record-replay for integration tests
-| Integration | [integration/README.md](integration/README.md) |
+
-| Verification | [verifications/README.md](verifications/README.md) |
+Testing AI applications end-to-end creates some challenges:
 - **API costs** accumulate quickly during development and CI
 - **Non-deterministic responses** make tests unreliable
 - **Multiple providers** require testing the same logic across different APIs
 Our solution: **Record real API responses once, replay them for fast, deterministic tests.** This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.
 This gives you:
 - Cost control - No repeated API calls during development
 - Speed - Instant test execution with cached responses
 - Reliability - Consistent results regardless of external service state
 - Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.
 ### Testing Quick Start
 You can run the unit tests with:
 ```bash
 uv run --group unit pytest -sv tests/unit/
 ```
 For running integration tests, you must provide a few things:
 - A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
  - a URL which points to a Llama Stack distribution server
  - a distribution name (e.g., `starter`) or a path to a `run.yaml` file
  - a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
 - Whether you are using replay or live mode for inference. This is specified with the LLAMA_STACK_TEST_INFERENCE_MODE environment variable. The default mode currently is "live" -- that is certainly surprising, but we will fix this soon.
 - Any API keys you need to use should be set in the environment, or can be passed in with the --env option.
 You can run the integration tests in replay mode with:
 ```bash
 # Run all tests with existing recordings
 LLAMA_STACK_TEST_INFERENCE_MODE=replay \
  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter
 ```
 If you don't specify LLAMA_STACK_TEST_INFERENCE_MODE, by default it will be in "live" mode -- that is, it will make real API calls.
 ```bash
 # Test against live APIs
 FIREWORKS_API_KEY=your_key pytest -sv tests/integration/inference --stack-config=starter
 ```
 ### Next Steps
 - [Integration Testing Guide](integration/README.md) - Detailed usage and configuration
 - [Unit Testing Guide](unit/README.md) - Fast component testing
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@ -1,6 +1,23 @@
-# Llama Stack Integration Tests
+# Integration Testing Guide
-We use `pytest` for parameterizing and running tests. You can see all options with:
+Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
 ## Quick Start
 ```bash
 # Run all integration tests with existing recordings
 uv run pytest tests/integration/
 # Test against live APIs with auto-server
 export FIREWORKS_API_KEY=your_key
 pytest tests/integration/inference/ \
    --stack-config=server:fireworks \
    --text-model=meta-llama/Llama-3.1-8B-Instruct
 ```
 ## Configuration Options
 You can see all options with:
 ```bash
 cd tests/integration
@ -114,3 +131,86 @@ pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
   --embedding-model=$EMBEDDING_MODELS
 ```
 ## Recording Modes
 The testing system supports three modes controlled by environment variables:
 ### LIVE Mode (Default)
 Tests make real API calls:
 ```bash
 LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
 ```
 ### RECORD Mode
 Captures API interactions for later replay:
 ```bash
 LLAMA_STACK_TEST_INFERENCE_MODE=record \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/inference/test_new_feature.py
 ```
 ### REPLAY Mode
 Uses cached responses instead of making API calls:
 ```bash
 LLAMA_STACK_TEST_INFERENCE_MODE=replay \
 LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
 pytest tests/integration/
 ```
 ## Managing Recordings
 ### Viewing Recordings
 ```bash
 # See what's recorded
 sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
 # Inspect specific response
 cat recordings/responses/abc123.json | jq '.'
 ```
 ### Re-recording Tests
 ```bash
 # Re-record specific tests
 rm -rf recordings/
 LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_modified.py
 ```
 ## Writing Tests
 ### Basic Test Pattern
 ```python
 def test_basic_completion(llama_stack_client, text_model_id):
    response = llama_stack_client.inference.completion(
        model_id=text_model_id,
        content=CompletionMessage(role="user", content="Hello"),
    )
    # Test structure, not AI output quality
    assert response.completion_message is not None
    assert isinstance(response.completion_message.content, str)
    assert len(response.completion_message.content) > 0
 ```
 ### Provider-Specific Tests
 ```python
 def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
    query_response = llama_stack_client.inference.embeddings(
        model_id=embedding_model_id,
        contents=["What is machine learning?"],
        task_type="query"
    )
    assert query_response.embeddings is not None
 ```
 ## Best Practices
 - **Test API contracts, not AI output quality** - Focus on response structure, not content
 - **Use existing recordings for development** - Fast iteration without API costs
 - **Record new interactions only when needed** - Adding new functionality
 - **Test across providers** - Ensure compatibility
 - **Commit recordings to version control** - Deterministic CI builds