From 1e2bbd08da7eb7c9b10e702206759467f38c3ed0 Mon Sep 17 00:00:00 2001
From: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Date: Thu, 14 Aug 2025 16:51:13 -0700
Subject: [PATCH] rewrote all slop

---
 docs/source/contributing/index.md             |  23 +-
 docs/source/contributing/testing.md           |  40 --
 docs/source/contributing/testing/index.md     | 103 ---
 .../testing/integration-testing.md            | 136 ----
 .../contributing/testing/record-replay.md     | 192 +++++-
 .../contributing/testing/troubleshooting.md   | 590 +++---------------
 .../contributing/testing/writing-tests.md     | 125 ----
 tests/README.md                               |  69 +-
 tests/integration/README.md                   | 104 ++-
 9 files changed, 452 insertions(+), 930 deletions(-)
 delete mode 100644 docs/source/contributing/testing.md
 delete mode 100644 docs/source/contributing/testing/index.md
 delete mode 100644 docs/source/contributing/testing/integration-testing.md
 delete mode 100644 docs/source/contributing/testing/writing-tests.md

diff --git a/docs/source/contributing/index.md b/docs/source/contributing/index.md
index 9f3fd8ea4..228258cdf 100644
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@@ -4,11 +4,11 @@
 
 ## Adding a New Provider
 
-See the [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+See:
+- [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+- [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
+- [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 
-See the [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
-
-See the [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 ```{toctree}
 :maxdepth: 1
 :hidden:
@@ -19,12 +19,17 @@ new_vector_database
 
 ## Testing
 
-Llama Stack uses a record-replay testing system for reliable, cost-effective testing. See the [Testing Documentation](testing.md) for comprehensive guides on writing and running tests.
+
+```{include} ../../../tests/README.md
+```
+
+### Advanced Topics
+
+For developers who need deeper understanding of the testing system internals:
 
 ```{toctree}
 :maxdepth: 1
-:hidden:
-:caption: Testing
 
-testing
-```
\ No newline at end of file
+testing/record-replay
+testing/troubleshooting
+```
diff --git a/docs/source/contributing/testing.md b/docs/source/contributing/testing.md
deleted file mode 100644
index 32318c3b9..000000000
--- a/docs/source/contributing/testing.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Testing
-
-Llama Stack uses a record-replay system for reliable, fast, and cost-effective testing of AI applications.
-
-## Testing Documentation
-
-```{toctree}
-:maxdepth: 1
-
-testing/index
-testing/integration-testing
-testing/record-replay
-testing/writing-tests
-testing/troubleshooting
-```
-
-## Quick Start
-
-```bash
-# Run tests with existing recordings
-uv run pytest tests/integration/
-
-# Test against live APIs
-FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
-```
-
-For detailed information, see the [Testing Overview](testing/index.md).
-
----
-
-## Original Documentation
-
-```{include} ../../../tests/README.md
-```
-
-```{include} ../../../tests/unit/README.md
-```
-
-```{include} ../../../tests/integration/README.md
-```
diff --git a/docs/source/contributing/testing/index.md b/docs/source/contributing/testing/index.md
deleted file mode 100644
index e8cb0f02c..000000000
--- a/docs/source/contributing/testing/index.md
+++ /dev/null
@@ -1,103 +0,0 @@
-# Testing in Llama Stack
-
-Llama Stack uses a record-replay testing system to handle AI API costs, non-deterministic responses, and multiple provider integrations.
-
-## Core Problems
-
-Testing AI applications creates three challenges:
-
-- **API costs** accumulate quickly during development and CI
-- **Non-deterministic responses** make tests unreliable
-- **Multiple providers** require testing the same logic across different APIs
-
-## Solution
-
-Record real API responses once, replay them for fast, deterministic tests.
-
-## Architecture Overview
-
-### Test Types
-
-- **Unit tests** (`tests/unit/`) - Test components in isolation with mocks
-- **Integration tests** (`tests/integration/`) - Test complete workflows with record-replay
-
-### Core Components
-
-#### Record-Replay System
-
-Captures API calls and replays them deterministically:
-
-```python
-# Record real API responses
-with inference_recording(mode=InferenceMode.RECORD, storage_dir="recordings"):
-    response = await client.chat.completions.create(...)
-
-# Replay cached responses
-with inference_recording(mode=InferenceMode.REPLAY, storage_dir="recordings"):
-    response = await client.chat.completions.create(...)  # No API call made
-```
-
-#### Provider Testing
-
-Write tests once, run against any provider:
-
-```bash
-# Same test, different providers
-pytest tests/integration/inference/ --stack-config=openai --text-model=gpt-4
-pytest tests/integration/inference/ --stack-config=starter --text-model=llama3.2:3b
-```
-
-#### Test Parametrization
-
-Generate test combinations from CLI arguments:
-
-```bash
-# Creates test for each model/provider combination
-pytest tests/integration/ \
-    --stack-config=inference=fireworks \
-    --text-model=llama-3.1-8b,llama-3.1-70b
-```
-
-## How It Works
-
-### Recording Storage
-
-Recordings use SQLite for lookup and JSON for storage:
-
-```
-recordings/
-├── index.sqlite          # Fast lookup by request hash
-└── responses/
-    ├── abc123def456.json  # Individual response files
-    └── def789ghi012.json
-```
-
-### Why Record-Replay?
-
-Mocking AI APIs is brittle. Real API responses:
-- Include edge cases and realistic data structures
-- Preserve streaming behavior
-- Can be inspected and debugged
-
-### Why Test All Providers?
-
-One test verifies behavior across all providers, catching integration bugs early.
-
-## Workflow
-
-1. **Develop tests** in `LIVE` mode against real APIs
-2. **Record responses** with `RECORD` mode
-3. **Commit recordings** for deterministic CI
-4. **Tests replay** cached responses in CI
-
-## Quick Start
-
-```bash
-# Run tests with existing recordings
-uv run pytest tests/integration/
-
-# Test against live APIs
-FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
-```
-
-See [Integration Testing](integration-testing.md) for usage details and [Record-Replay](record-replay.md) for system internals.
\ No newline at end of file
diff --git a/docs/source/contributing/testing/integration-testing.md b/docs/source/contributing/testing/integration-testing.md
deleted file mode 100644
index 17f869d9b..000000000
--- a/docs/source/contributing/testing/integration-testing.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# Integration Testing Guide
-
-Practical usage of Llama Stack's integration testing system.
-
-## Basic Usage
-
-```bash
-# Run all integration tests
-uv run pytest tests/integration/
-
-# Run specific test suites
-uv run pytest tests/integration/inference/
-uv run pytest tests/integration/agents/
-```
-
-## Live API Testing
-
-```bash
-# Auto-start server
-export FIREWORKS_API_KEY=your_key
-pytest tests/integration/inference/ \
-    --stack-config=server:fireworks \
-    --text-model=meta-llama/Llama-3.1-8B-Instruct
-
-# Library client
-export TOGETHER_API_KEY=your_key
-pytest tests/integration/inference/ \
-    --stack-config=starter \
-    --text-model=meta-llama/Llama-3.1-8B-Instruct
-```
-
-## Configuration
-
-### Stack Config
-
-```bash
---stack-config=server:fireworks          # Auto-start server
---stack-config=server:together:8322      # Custom port
---stack-config=starter                   # Template
---stack-config=/path/to/run.yaml         # Config file
---stack-config=inference=fireworks       # Adhoc providers
---stack-config=http://localhost:5001     # Existing server
-```
-
-### Models
-
-```bash
---text-model=meta-llama/Llama-3.1-8B-Instruct
---vision-model=meta-llama/Llama-3.2-11B-Vision-Instruct
---embedding-model=sentence-transformers/all-MiniLM-L6-v2
-```
-
-### Environment
-
-```bash
---env FIREWORKS_API_KEY=your_key
---env OPENAI_BASE_URL=http://localhost:11434/v1
-```
-
-## Test Scenarios
-
-### New Provider Testing
-
-```bash
-# Test new provider
-pytest tests/integration/inference/ \
-    --stack-config=inference=your-new-provider \
-    --text-model=your-model-id
-```
-
-### Multiple Models
-
-```bash
-# Test multiple models
-pytest tests/integration/inference/ \
-    --text-model=llama-3.1-8b,llama-3.1-70b
-```
-
-### Local Development
-
-```bash
-# Test with local Ollama
-pytest tests/integration/inference/ \
-    --stack-config=starter \
-    --text-model=llama3.2:3b
-```
-
-## Recording Modes
-
-```bash
-# Live API calls (default)
-LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
-
-# Record new responses
-LLAMA_STACK_TEST_INFERENCE_MODE=record \
-LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
-pytest tests/integration/inference/test_new.py
-
-# Replay cached responses
-LLAMA_STACK_TEST_INFERENCE_MODE=replay \
-LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
-pytest tests/integration/
-```
-
-## Recording Management
-
-```bash
-# View recordings
-sqlite3 recordings/index.sqlite "SELECT * FROM recordings;"
-cat recordings/responses/abc123.json
-
-# Re-record tests
-rm -rf recordings/
-LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_specific.py
-```
-
-## Debugging
-
-```bash
-# Verbose output
-pytest -vvs tests/integration/inference/
-
-# Debug logging
-LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py
-
-# Custom port for conflicts
-pytest tests/integration/ --stack-config=server:fireworks:8322
-```
-
-## Best Practices
-
-- Use existing recordings for development
-- Record new interactions only when needed
-- Test across multiple providers
-- Use descriptive test names
-- Commit recordings to version control
\ No newline at end of file
diff --git a/docs/source/contributing/testing/record-replay.md b/docs/source/contributing/testing/record-replay.md
index 90ee4c7a1..7ec2fc871 100644
--- a/docs/source/contributing/testing/record-replay.md
+++ b/docs/source/contributing/testing/record-replay.md
@@ -1,32 +1,46 @@
 # Record-Replay System
 
-The record-replay system captures real API interactions and replays them deterministically for fast, reliable testing.
+Understanding how Llama Stack captures and replays API interactions for testing.
+
+## Overview
+
+The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
+
+The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
 
 ## How It Works
 
 ### Request Hashing
 
-API requests are hashed to enable consistent lookup:
+Every API request gets converted to a deterministic hash for lookup:
 
 ```python
 def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
     normalized = {
         "method": method.upper(),
-        "endpoint": urlparse(url).path,
-        "body": body
+        "endpoint": urlparse(url).path,  # Just the path, not full URL
+        "body": body                     # Request parameters
     }
     return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
 ```
 
-Hashing is precise - different whitespace or float precision produces different hashes.
+**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
+
+```python
+# These produce DIFFERENT hashes:
+{"content": "Hello world"}
+{"content": "Hello   world\n"}
+{"temperature": 0.7}
+{"temperature": 0.7000001}
+```
 
 ### Client Interception
 
-The system patches OpenAI and Ollama client methods to intercept API calls before they leave the client.
+The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
 
-## Storage
+### Storage Architecture
 
-Recordings use SQLite for indexing and JSON for storage:
+Recordings use a two-tier storage system optimized for both speed and debuggability:
 
 ```
 recordings/
@@ -36,36 +50,120 @@ recordings/
     └── def789ghi012.json
 ```
 
+**SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.
+
+**JSON files** store complete request/response pairs in human-readable format for debugging.
+
 ## Recording Modes
 
 ### LIVE Mode
-Direct API calls, no recording/replay:
+
+Direct API calls with no recording or replay:
+
 ```python
 with inference_recording(mode=InferenceMode.LIVE):
     response = await client.chat.completions.create(...)
 ```
 
+Use for initial development and debugging against real APIs.
+
 ### RECORD Mode
-Captures API interactions:
+
+Captures API interactions while passing through real responses:
+
 ```python
 with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
     response = await client.chat.completions.create(...)
-    # Response captured AND returned
+    # Real API call made, response captured AND returned
 ```
 
+The recording process:
+1. Request intercepted and hashed
+2. Real API call executed
+3. Response captured and serialized
+4. Recording stored to disk
+5. Original response returned to caller
+
 ### REPLAY Mode
-Uses stored recordings:
+
+Returns stored responses instead of making API calls:
+
 ```python
 with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
     response = await client.chat.completions.create(...)
-    # Returns cached response, no API call
+    # No API call made, cached response returned instantly
 ```
 
+The replay process:
+1. Request intercepted and hashed
+2. Hash looked up in SQLite index
+3. Response loaded from JSON file
+4. Response deserialized and returned
+5. Error if no recording found
+
 ## Streaming Support
 
-Streaming responses are captured completely before any chunks are yielded, then replayed as an async generator that matches the original API behavior.
+Streaming APIs present a unique challenge: how do you capture an async generator?
 
-## Environment Variables
+### The Problem
+
+```python
+# How do you record this?
+async for chunk in client.chat.completions.create(stream=True):
+    process(chunk)
+```
+
+### The Solution
+
+The system captures all chunks immediately before yielding any:
+
+```python
+async def handle_streaming_record(response):
+    # Capture complete stream first
+    chunks = []
+    async for chunk in response:
+        chunks.append(chunk)
+
+    # Store complete recording
+    storage.store_recording(request_hash, request_data, {
+        "body": chunks,
+        "is_streaming": True
+    })
+
+    # Return generator that replays captured chunks
+    async def replay_stream():
+        for chunk in chunks:
+            yield chunk
+    return replay_stream()
+```
+
+This ensures:
+- **Complete capture** - The entire stream is saved atomically
+- **Interface preservation** - The returned object behaves like the original API
+- **Deterministic replay** - Same chunks in the same order every time
+
+## Serialization
+
+API responses contain complex Pydantic objects that need careful serialization:
+
+```python
+def _serialize_response(response):
+    if hasattr(response, "model_dump"):
+        # Preserve type information for proper deserialization
+        return {
+            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
+            "__data__": response.model_dump(mode="json")
+        }
+    return response
+```
+
+This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
+
+## Environment Integration
+
+### Environment Variables
+
+Control recording behavior globally:
 
 ```bash
 export LLAMA_STACK_TEST_INFERENCE_MODE=replay
@@ -73,8 +171,64 @@ export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
 pytest tests/integration/
 ```
 
-## Common Issues
+### Pytest Integration
 
-- **"No recorded response found"** - Re-record with `RECORD` mode
-- **Serialization errors** - Response types changed, re-record
-- **Hash mismatches** - Request parameters changed slightly
\ No newline at end of file
+The system integrates automatically based on environment variables, requiring no changes to test code.
+
+## Debugging Recordings
+
+### Inspecting Storage
+
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
+
+# View specific response
+cat recordings/responses/abc123def456.json | jq '.response.body'
+
+# Find recordings by endpoint
+sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
+```
+
+### Common Issues
+
+**Hash mismatches:** Request parameters changed slightly between record and replay
+```bash
+# Compare request details
+cat recordings/responses/abc123.json | jq '.request'
+```
+
+**Serialization errors:** Response types changed between versions
+```bash
+# Re-record with updated types
+rm recordings/responses/failing_hash.json
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
+```
+
+**Missing recordings:** New test or changed parameters
+```bash
+# Record the missing interaction
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
+```
+
+## Design Decisions
+
+### Why Not Mocks?
+
+Traditional mocking breaks down with AI APIs because:
+- Response structures are complex and evolve frequently
+- Streaming behavior is hard to mock correctly
+- Edge cases in real APIs get missed
+- Mocks become brittle maintenance burdens
+
+### Why Precise Hashing?
+
+Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
+
+### Why JSON + SQLite?
+
+- **JSON** - Human readable, diff-friendly, easy to inspect and modify
+- **SQLite** - Fast indexed lookups without loading response bodies
+- **Hybrid** - Best of both worlds for different use cases
+
+This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
\ No newline at end of file
diff --git a/docs/source/contributing/testing/troubleshooting.md b/docs/source/contributing/testing/troubleshooting.md
index e894d6e4f..170f45b35 100644
--- a/docs/source/contributing/testing/troubleshooting.md
+++ b/docs/source/contributing/testing/troubleshooting.md
@@ -1,528 +1,140 @@
-# Testing Troubleshooting Guide
+# Common Testing Issues
 
-This guide covers common issues encountered when working with Llama Stack's testing infrastructure and how to resolve them.
+The most frequent problems when working with Llama Stack's testing system.
 
-## Quick Diagnosis
+## Missing Recordings
 
-### Test Status Quick Check
-
-```bash
-# Check if tests can run at all
-uv run pytest tests/integration/inference/test_embedding.py::test_basic_embeddings -v
-
-# Check available models and providers
-uv run llama stack list-providers
-uv run llama stack list-models
-
-# Verify server connectivity
-curl http://localhost:5001/v1/health
-```
-
-## Recording and Replay Issues
-
-### "No recorded response found for request hash"
-
-**Symptom:**
+**Error:**
 ```
 RuntimeError: No recorded response found for request hash: abc123def456
 Endpoint: /v1/chat/completions
 Model: meta-llama/Llama-3.1-8B-Instruct
 ```
 
-**Causes and Solutions:**
+**Cause:** You're running a test that needs an API interaction that hasn't been recorded yet.
 
-1. **Missing recording** - Most common cause
-   ```bash
-   # Record the missing interaction
-   LLAMA_STACK_TEST_INFERENCE_MODE=record \
-   LLAMA_STACK_TEST_RECORDING_DIR=./test_recordings \
-   pytest tests/integration/inference/test_failing.py -v
-   ```
-
-2. **Request parameters changed**
-   ```bash
-   # Check what changed by comparing requests
-   sqlite3 test_recordings/index.sqlite \
-   "SELECT request_hash, endpoint, model, timestamp FROM recordings WHERE endpoint='/v1/chat/completions';"
-
-   # View specific request details
-   cat test_recordings/responses/abc123def456.json | jq '.request'
-   ```
-
-3. **Different environment/provider**
-   ```bash
-   # Ensure consistent test environment
-   pytest tests/integration/ --stack-config=starter --text-model=llama3.2:3b
-   ```
-
-### Recording Failures
-
-**Symptom:**
-```
-sqlite3.OperationalError: database is locked
+**Solution:**
+```bash
+# Record the missing interaction
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
+pytest tests/integration/inference/test_your_test.py
 ```
 
-**Solutions:**
+## API Key Issues
 
-1. **Concurrent access** - Multiple test processes
-   ```bash
-   # Run tests sequentially
-   pytest tests/integration/ -n 1
-
-   # Or use separate recording directories
-   LLAMA_STACK_TEST_RECORDING_DIR=./recordings_$(date +%s) pytest ...
-   ```
-
-2. **Incomplete recording cleanup**
-   ```bash
-   # Clear and restart recording
-   rm -rf test_recordings/
-   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_specific.py
-   ```
-
-### Serialization/Deserialization Errors
-
-**Symptom:**
-```
-Failed to deserialize object of type llama_stack.apis.inference.OpenAIChatCompletion
-```
-
-**Causes and Solutions:**
-
-1. **API response format changed**
-   ```bash
-   # Re-record with updated format
-   rm test_recordings/responses/abc123*.json
-   LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_failing.py
-   ```
-
-2. **Missing dependencies for deserialization**
-   ```bash
-   # Ensure all required packages installed
-   uv install --group dev
-   ```
-
-3. **Version mismatch between record and replay**
-   ```bash
-   # Check Python environment consistency
-   uv run python -c "import llama_stack; print(llama_stack.__version__)"
-   ```
-
-## Server Connection Issues
-
-### "Connection refused" Errors
-
-**Symptom:**
-```
-requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
-```
-
-**Diagnosis and Solutions:**
-
-1. **Server not running**
-   ```bash
-   # Check if server is running
-   curl http://localhost:5001/v1/health
-
-   # Start server manually for debugging
-   llama stack run --template starter --port 5001
-   ```
-
-2. **Port conflicts**
-   ```bash
-   # Check what's using the port
-   lsof -i :5001
-
-   # Use different port
-   pytest tests/integration/ --stack-config=server:starter:8322
-   ```
-
-3. **Server startup timeout**
-   ```bash
-   # Increase startup timeout or check server logs
-   tail -f server.log
-
-   # Manual server management
-   llama stack run --template starter &
-   sleep 30  # Wait for startup
-   pytest tests/integration/
-   ```
-
-### Auto-Server Startup Issues
-
-**Symptom:**
-```
-Server failed to respond within 30 seconds
-```
-
-**Solutions:**
-
-1. **Check server logs**
-   ```bash
-   # Server logs are written to server.log
-   tail -f server.log
-
-   # Look for startup errors
-   grep -i error server.log
-   ```
-
-2. **Dependencies missing**
-   ```bash
-   # Ensure all dependencies installed
-   uv install --group dev
-
-   # Check specific provider requirements
-   pip list | grep -i fireworks
-   ```
-
-3. **Resource constraints**
-   ```bash
-   # Check system resources
-   htop
-   df -h
-
-   # Use lighter config for testing
-   pytest tests/integration/ --stack-config=starter
-   ```
-
-## Provider and Model Issues
-
-### "Model not found" Errors
-
-**Symptom:**
-```
-Model 'meta-llama/Llama-3.1-8B-Instruct' not found
-```
-
-**Solutions:**
-
-1. **Check available models**
-   ```bash
-   # List models for current provider
-   uv run llama stack list-models
-
-   # Use available model
-   pytest tests/integration/ --text-model=llama3.2:3b
-   ```
-
-2. **Model not downloaded for local providers**
-   ```bash
-   # Download missing model
-   ollama pull llama3.2:3b
-
-   # Verify model available
-   ollama list
-   ```
-
-3. **Provider configuration issues**
-   ```bash
-   # Check provider setup
-   uv run llama stack list-providers
-
-   # Verify API keys set
-   echo $FIREWORKS_API_KEY
-   ```
-
-### Provider Authentication Failures
-
-**Symptom:**
+**Error:**
 ```
 HTTP 401: Invalid API key
 ```
 
-**Solutions:**
+**Cause:** Missing or invalid API key for the provider you're testing.
 
-1. **Missing API keys**
-   ```bash
-   # Set required API key
-   export FIREWORKS_API_KEY=your_key_here
-   export OPENAI_API_KEY=your_key_here
+**Solution:**
+```bash
+# Set the required API key
+export FIREWORKS_API_KEY=your_key_here
+export OPENAI_API_KEY=your_key_here
 
-   # Verify key is set
-   echo $FIREWORKS_API_KEY
-   ```
+# Verify it's set
+echo $FIREWORKS_API_KEY
+```
 
-2. **Invalid API keys**
-   ```bash
-   # Test API key directly
-   curl -H "Authorization: Bearer $FIREWORKS_API_KEY" \
-        https://api.fireworks.ai/inference/v1/models
-   ```
+## Model Not Found
 
-3. **API key environment issues**
-   ```bash
-   # Pass environment explicitly
-   pytest tests/integration/ --env FIREWORKS_API_KEY=your_key
-   ```
+**Error:**
+```
+Model 'meta-llama/Llama-3.1-8B-Instruct' not found
+```
 
-## Parametrization Issues
+**Cause:** Model isn't available with the current provider or hasn't been downloaded locally.
 
-### "No tests ran matching the given pattern"
+**For local providers (Ollama):**
+```bash
+# Download the model
+ollama pull llama3.2:3b
 
-**Symptom:**
+# Use the downloaded model
+pytest tests/integration/ --text-model=llama3.2:3b
+```
+
+**For remote providers:**
+```bash
+# Check what models are available
+uv run llama stack list-models
+
+# Use an available model
+pytest tests/integration/ --text-model=available-model-id
+```
+
+## Server Connection Issues
+
+**Error:**
+```
+requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
+```
+
+**Cause:** Server isn't running or is on a different port.
+
+**Solution:**
+```bash
+# Check if server is running
+curl http://localhost:5001/v1/health
+
+# Start server manually
+llama stack run --template starter --port 5001
+
+# Or use auto-server with custom port
+pytest tests/integration/ --stack-config=server:starter:8322
+```
+
+## Request Hash Mismatches
+
+**Problem:** Tests worked before but now fail with "No recorded response found" even though you didn't change the test.
+
+**Cause:** Request parameters changed slightly (different whitespace, float precision, etc.). The hashing is intentionally precise.
+
+**Solution:**
+```bash
+# Check what's in your recordings
+sqlite3 recordings/index.sqlite "SELECT endpoint, model FROM recordings;"
+
+# Re-record if the request legitimately changed
+rm recordings/responses/old_hash.json
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/your_test.py
+```
+
+## No Tests Collected
+
+**Error:**
 ```
 collected 0 items
 ```
 
-**Causes and Solutions:**
-
-1. **No models specified**
-   ```bash
-   # Specify required models
-   pytest tests/integration/inference/ --text-model=llama3.2:3b
-   ```
-
-2. **Model/provider mismatch**
-   ```bash
-   # Use compatible model for provider
-   pytest tests/integration/ \
-     --stack-config=starter \
-     --text-model=llama3.2:3b  # Available in Ollama
-   ```
-
-3. **Missing fixtures**
-   ```bash
-   # Check test requirements
-   pytest tests/integration/inference/test_embedding.py --collect-only
-   ```
-
-### Excessive Test Combinations
-
-**Symptom:**
-Tests run for too many parameter combinations, taking too long.
-
-**Solutions:**
-
-1. **Limit model combinations**
-   ```bash
-   # Test single model instead of list
-   pytest tests/integration/ --text-model=llama3.2:3b
-   ```
-
-2. **Use specific test selection**
-   ```bash
-   # Run specific test pattern
-   pytest tests/integration/ -k "basic and not vision"
-   ```
-
-3. **Separate test runs**
-   ```bash
-   # Split by functionality
-   pytest tests/integration/inference/ --text-model=model1
-   pytest tests/integration/agents/ --text-model=model2
-   ```
-
-## Performance Issues
-
-### Slow Test Execution
-
-**Symptom:**
-Tests take much longer than expected.
-
-**Diagnosis and Solutions:**
-
-1. **Using LIVE mode instead of REPLAY**
-   ```bash
-   # Verify recording mode
-   echo $LLAMA_STACK_TEST_INFERENCE_MODE
-
-   # Force replay mode
-   LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
-   ```
-
-2. **Network latency to providers**
-   ```bash
-   # Use local providers for development
-   pytest tests/integration/ --stack-config=starter
-   ```
-
-3. **Large recording files**
-   ```bash
-   # Check recording directory size
-   du -sh test_recordings/
-
-   # Clean up old recordings
-   find test_recordings/ -name "*.json" -mtime +30 -delete
-   ```
-
-### Memory Usage Issues
-
-**Symptom:**
-```
-MemoryError: Unable to allocate memory
-```
-
-**Solutions:**
-
-1. **Large recordings in memory**
-   ```bash
-   # Run tests in smaller batches
-   pytest tests/integration/inference/ -k "not batch"
-   ```
-
-2. **Model memory requirements**
-   ```bash
-   # Use smaller models for testing
-   pytest tests/integration/ --text-model=llama3.2:3b  # Instead of 70B
-   ```
-
-## Environment Issues
-
-### Python Environment Problems
-
-**Symptom:**
-```
-ModuleNotFoundError: No module named 'llama_stack'
-```
-
-**Solutions:**
-
-1. **Wrong Python environment**
-   ```bash
-   # Verify uv environment
-   uv run python -c "import llama_stack; print('OK')"
-
-   # Reinstall if needed
-   uv install --group dev
-   ```
-
-2. **Development installation issues**
-   ```bash
-   # Reinstall in development mode
-   pip install -e .
-
-   # Verify installation
-   python -c "import llama_stack; print(llama_stack.__file__)"
-   ```
-
-### Path and Import Issues
-
-**Symptom:**
-```
-ImportError: cannot import name 'LlamaStackClient'
-```
-
-**Solutions:**
-
-1. **PYTHONPATH issues**
-   ```bash
-   # Run from project root
-   cd /path/to/llama-stack
-   uv run pytest tests/integration/
-   ```
-
-2. **Relative import issues**
-   ```bash
-   # Use absolute imports in tests
-   from llama_stack_client import LlamaStackClient  # Not relative
-   ```
-
-## Debugging Techniques
-
-### Verbose Logging
-
-Enable detailed logging to understand what's happening:
+**Cause:** No models specified for tests that require model fixtures.
 
+**Solution:**
 ```bash
-# Enable debug logging
-LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/inference/test_failing.py -v -s
-
-# Enable request/response logging
-LLAMA_STACK_TEST_INFERENCE_MODE=live \
-LLAMA_STACK_LOG_LEVEL=DEBUG \
-pytest tests/integration/inference/test_failing.py -v -s
-```
-
-### Interactive Debugging
-
-Drop into debugger when tests fail:
-
-```bash
-# Run with pdb on failure
-pytest tests/integration/inference/test_failing.py --pdb
-
-# Or add breakpoint in test code
-def test_something(llama_stack_client):
-    import pdb; pdb.set_trace()
-    # ... test code
-```
-
-### Isolation Testing
-
-Run tests in isolation to identify interactions:
-
-```bash
-# Run single test
-pytest tests/integration/inference/test_embedding.py::test_basic_embeddings
-
-# Run without recordings
-rm -rf test_recordings/
-LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/inference/test_failing.py
-```
-
-### Recording Inspection
-
-Examine recordings to understand what's stored:
-
-```bash
-# Check recording database
-sqlite3 test_recordings/index.sqlite ".tables"
-sqlite3 test_recordings/index.sqlite ".schema recordings"
-sqlite3 test_recordings/index.sqlite "SELECT * FROM recordings LIMIT 5;"
-
-# Examine specific recording
-find test_recordings/responses/ -name "*.json" | head -1 | xargs cat | jq '.'
-
-# Compare request hashes
-python -c "
-from llama_stack.testing.inference_recorder import normalize_request
-print(normalize_request('POST', 'http://localhost:11434/v1/chat/completions', {}, {'model': 'llama3.2:3b', 'messages': [{'role': 'user', 'content': 'Hello'}]}))
-"
+# Specify required models
+pytest tests/integration/inference/ --text-model=llama3.2:3b
+pytest tests/integration/embedding/ --embedding-model=all-MiniLM-L6-v2
 ```
 
 ## Getting Help
 
-### Information to Gather
-
 When reporting issues, include:
 
-1. **Environment details:**
-   ```bash
-   uv run python --version
-   uv run python -c "import llama_stack; print(llama_stack.__version__)"
-   uv list
-   ```
+```bash
+# Environment info
+uv run python --version
+uv run python -c "import llama_stack; print(llama_stack.__version__)"
 
-2. **Test command and output:**
-   ```bash
-   # Full command that failed
-   pytest tests/integration/inference/test_failing.py -v
+# Test command that failed
+pytest tests/integration/your_test.py -v
 
-   # Error message and stack trace
-   ```
+# Stack configuration
+echo $LLAMA_STACK_TEST_INFERENCE_MODE
+ls -la recordings/
+```
 
-3. **Configuration details:**
-   ```bash
-   # Stack configuration used
-   echo $LLAMA_STACK_TEST_INFERENCE_MODE
-   ls -la test_recordings/
-   ```
-
-4. **Provider status:**
-   ```bash
-   uv run llama stack list-providers
-   uv run llama stack list-models
-   ```
-
-### Common Solutions Summary
-
-| Issue | Quick Fix |
-|-------|-----------|
-| Missing recordings | `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...` |
-| Connection refused | Check server: `curl http://localhost:5001/v1/health` |
-| No tests collected | Add model: `--text-model=llama3.2:3b` |
-| Authentication error | Set API key: `export PROVIDER_API_KEY=...` |
-| Serialization error | Re-record: `rm recordings/*.json && record mode` |
-| Slow tests | Use replay: `LLAMA_STACK_TEST_INFERENCE_MODE=replay` |
-
-Most testing issues stem from configuration mismatches or missing recordings. The record-replay system is designed to be forgiving, but requires consistent environment setup for optimal performance.
\ No newline at end of file
+Most issues are solved by re-recording interactions or checking API keys/model availability.
\ No newline at end of file
diff --git a/docs/source/contributing/testing/writing-tests.md b/docs/source/contributing/testing/writing-tests.md
deleted file mode 100644
index 0f1c937c0..000000000
--- a/docs/source/contributing/testing/writing-tests.md
+++ /dev/null
@@ -1,125 +0,0 @@
-# Writing Tests
-
-How to write effective tests for Llama Stack.
-
-## Basic Test Pattern
-
-```python
-def test_basic_completion(llama_stack_client, text_model_id):
-    """Test basic text completion functionality."""
-    response = llama_stack_client.inference.completion(
-        model_id=text_model_id,
-        content=CompletionMessage(role="user", content="Hello"),
-    )
-    
-    # Test structure, not AI output quality
-    assert response.completion_message is not None
-    assert isinstance(response.completion_message.content, str)
-    assert len(response.completion_message.content) > 0
-```
-
-## Parameterized Tests
-
-```python
-@pytest.mark.parametrize("temperature", [0.0, 0.5, 1.0])
-def test_completion_temperature(llama_stack_client, text_model_id, temperature):
-    response = llama_stack_client.inference.completion(
-        model_id=text_model_id,
-        content=CompletionMessage(role="user", content="Hello"),
-        sampling_params={"temperature": temperature}
-    )
-    assert response.completion_message is not None
-```
-
-## Provider-Specific Tests
-
-```python
-def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
-    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
-        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
-    
-    query_response = llama_stack_client.inference.embeddings(
-        model_id=embedding_model_id,
-        contents=["What is machine learning?"],
-        task_type="query"
-    )
-    
-    passage_response = llama_stack_client.inference.embeddings(
-        model_id=embedding_model_id,
-        contents=["Machine learning is a subset of AI..."],
-        task_type="passage"
-    )
-    
-    assert query_response.embeddings != passage_response.embeddings
-```
-
-## Fixtures
-
-```python
-@pytest.fixture(scope="session")
-def agent_config(llama_stack_client, text_model_id):
-    """Reusable agent configuration."""
-    return {
-        "model": text_model_id,
-        "instructions": "You are a helpful assistant",
-        "tools": [],
-        "enable_session_persistence": False,
-    }
-
-@pytest.fixture(scope="function")
-def fresh_session(llama_stack_client):
-    """Each test gets fresh state."""
-    session = llama_stack_client.create_session()
-    yield session
-    session.delete()
-```
-
-## Common Test Patterns
-
-### Streaming Tests
-```python
-def test_streaming_completion(llama_stack_client, text_model_id):
-    stream = llama_stack_client.inference.completion(
-        model_id=text_model_id,
-        content=CompletionMessage(role="user", content="Count to 5"),
-        stream=True
-    )
-    
-    chunks = list(stream)
-    assert len(chunks) > 1
-    assert all(hasattr(chunk, 'delta') for chunk in chunks)
-```
-
-### Error Testing
-```python
-def test_invalid_model_error(llama_stack_client):
-    with pytest.raises(Exception) as exc_info:
-        llama_stack_client.inference.completion(
-            model_id="nonexistent-model",
-            content=CompletionMessage(role="user", content="Hello")
-        )
-    assert "model" in str(exc_info.value).lower()
-```
-
-## What NOT to Test
-
-```python
-# BAD: Testing AI output quality
-def test_completion_quality(llama_stack_client, text_model_id):
-    response = llama_stack_client.inference.completion(...)
-    assert "correct answer" in response.content  # Fragile!
-
-# GOOD: Testing response structure
-def test_completion_structure(llama_stack_client, text_model_id):
-    response = llama_stack_client.inference.completion(...)
-    assert isinstance(response.completion_message.content, str)
-    assert len(response.completion_message.content) > 0
-```
-
-## Best Practices
-
-- Test API contracts, not AI output quality
-- Use descriptive test names
-- Keep tests simple and focused
-- Record new interactions only when needed
-- Use appropriate fixture scopes (session vs function)
\ No newline at end of file
diff --git a/tests/README.md b/tests/README.md
index ed7064bfb..225cea944 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -1,9 +1,64 @@
-# Llama Stack Tests
+There are two obvious types of tests:
 
-Llama Stack has multiple layers of testing done to ensure continuous functionality and prevent regressions to the codebase.
+| Type | Location | Purpose |
+|------|----------|---------|
+| **Unit** | [`tests/unit/`](unit/README.md) | Fast, isolated component testing |
+| **Integration** | [`tests/integration/`](integration/README.md) | End-to-end workflows with record-replay |
 
-| Testing Type | Details |
-|--------------|---------|
-| Unit | [unit/README.md](unit/README.md) |
-| Integration | [integration/README.md](integration/README.md) |
-| Verification | [verifications/README.md](verifications/README.md) |
+Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.
+
+### Record-replay for integration tests
+
+Testing AI applications end-to-end creates some challenges:
+- **API costs** accumulate quickly during development and CI
+- **Non-deterministic responses** make tests unreliable
+- **Multiple providers** require testing the same logic across different APIs
+
+Our solution: **Record real API responses once, replay them for fast, deterministic tests.** This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.
+
+This gives you:
+- Cost control - No repeated API calls during development
+- Speed - Instant test execution with cached responses
+- Reliability - Consistent results regardless of external service state
+- Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.
+
+### Testing Quick Start
+
+You can run the unit tests with:
+```bash
+uv run --group unit pytest -sv tests/unit/
+```
+
+For running integration tests, you must provide a few things:
+
+- A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
+  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
+  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
+  - a URL which points to a Llama Stack distribution server
+  - a distribution name (e.g., `starter`) or a path to a `run.yaml` file
+  - a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
+
+- Whether you are using replay or live mode for inference. This is specified with the LLAMA_STACK_TEST_INFERENCE_MODE environment variable. The default mode currently is "live" -- that is certainly surprising, but we will fix this soon.
+
+- Any API keys you need to use should be set in the environment, or can be passed in with the --env option.
+
+You can run the integration tests in replay mode with:
+```bash
+# Run all tests with existing recordings
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+  uv run --group test \
+  pytest -sv tests/integration/ --stack-config=starter
+```
+
+If you don't specify LLAMA_STACK_TEST_INFERENCE_MODE, by default it will be in "live" mode -- that is, it will make real API calls.
+
+```bash
+# Test against live APIs
+FIREWORKS_API_KEY=your_key pytest -sv tests/integration/inference --stack-config=starter
+```
+
+### Next Steps
+
+- [Integration Testing Guide](integration/README.md) - Detailed usage and configuration
+- [Unit Testing Guide](unit/README.md) - Fast component testing
diff --git a/tests/integration/README.md b/tests/integration/README.md
index 664116bea..666ebaeb6 100644
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@@ -1,6 +1,23 @@
-# Llama Stack Integration Tests
+# Integration Testing Guide
 
-We use `pytest` for parameterizing and running tests. You can see all options with:
+Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
+
+## Quick Start
+
+```bash
+# Run all integration tests with existing recordings
+uv run pytest tests/integration/
+
+# Test against live APIs with auto-server
+export FIREWORKS_API_KEY=your_key
+pytest tests/integration/inference/ \
+    --stack-config=server:fireworks \
+    --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+## Configuration Options
+
+You can see all options with:
 ```bash
 cd tests/integration
 
@@ -114,3 +131,86 @@ pytest -s -v tests/integration/vector_io/ \
    --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
    --embedding-model=$EMBEDDING_MODELS
 ```
+
+## Recording Modes
+
+The testing system supports three modes controlled by environment variables:
+
+### LIVE Mode (Default)
+Tests make real API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
+```
+
+### RECORD Mode
+Captures API interactions for later replay:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
+pytest tests/integration/inference/test_new_feature.py
+```
+
+### REPLAY Mode
+Uses cached responses instead of making API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
+pytest tests/integration/
+```
+
+## Managing Recordings
+
+### Viewing Recordings
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
+
+# Inspect specific response
+cat recordings/responses/abc123.json | jq '.'
+```
+
+### Re-recording Tests
+```bash
+# Re-record specific tests
+rm -rf recordings/
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_modified.py
+```
+
+## Writing Tests
+
+### Basic Test Pattern
+```python
+def test_basic_completion(llama_stack_client, text_model_id):
+    response = llama_stack_client.inference.completion(
+        model_id=text_model_id,
+        content=CompletionMessage(role="user", content="Hello"),
+    )
+
+    # Test structure, not AI output quality
+    assert response.completion_message is not None
+    assert isinstance(response.completion_message.content, str)
+    assert len(response.completion_message.content) > 0
+```
+
+### Provider-Specific Tests
+```python
+def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
+    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
+        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
+
+    query_response = llama_stack_client.inference.embeddings(
+        model_id=embedding_model_id,
+        contents=["What is machine learning?"],
+        task_type="query"
+    )
+
+    assert query_response.embeddings is not None
+```
+
+## Best Practices
+
+- **Test API contracts, not AI output quality** - Focus on response structure, not content
+- **Use existing recordings for development** - Fast iteration without API costs
+- **Record new interactions only when needed** - Adding new functionality
+- **Test across providers** - Ensure compatibility
+- **Commit recordings to version control** - Deterministic CI builds