# Record-Replay System

Understanding how Llama Stack captures and replays API interactions for testing.

## Overview

The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?

The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.

## How It Works

### Request Hashing

Every API request gets converted to a deterministic hash for lookup:

```python
def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
    normalized = {
        "method": method.upper(),
        "endpoint": urlparse(url).path,  # Just the path, not full URL
        "body": body,  # Request parameters
    }
    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
```

**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.

```python
# These produce DIFFERENT hashes:
{"content": "Hello world"}
{"content": "Hello   world\n"}
{"temperature": 0.7}
{"temperature": 0.7000001}
```

### Client Interception

The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.

### Storage Architecture

Recordings use a two-tier storage system optimized for both speed and debuggability:

```
recordings/
├── index.sqlite          # Fast lookup by request hash
└── responses/
    ├── abc123def456.json  # Individual response files
    └── def789ghi012.json
```

**SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.

**JSON files** store complete request/response pairs in human-readable format for debugging.

## Recording Modes

### LIVE Mode

Direct API calls with no recording or replay:

```python
with inference_recording(mode=InferenceMode.LIVE):
    response = await client.chat.completions.create(...)
```

Use for initial development and debugging against real APIs.

### RECORD Mode

Captures API interactions while passing through real responses:

```python
with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # Real API call made, response captured AND returned
```

The recording process:
1. Request intercepted and hashed
2. Real API call executed
3. Response captured and serialized
4. Recording stored to disk
5. Original response returned to caller

### REPLAY Mode

Returns stored responses instead of making API calls:

```python
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
    response = await client.chat.completions.create(...)
    # No API call made, cached response returned instantly
```

The replay process:
1. Request intercepted and hashed
2. Hash looked up in SQLite index
3. Response loaded from JSON file
4. Response deserialized and returned
5. Error if no recording found

## Streaming Support

Streaming APIs present a unique challenge: how do you capture an async generator?

### The Problem

```python
# How do you record this?
async for chunk in client.chat.completions.create(stream=True):
    process(chunk)
```

### The Solution

The system captures all chunks immediately before yielding any:

```python
async def handle_streaming_record(response):
    # Capture complete stream first
    chunks = []
    async for chunk in response:
        chunks.append(chunk)

    # Store complete recording
    storage.store_recording(
        request_hash, request_data, {"body": chunks, "is_streaming": True}
    )

    # Return generator that replays captured chunks
    async def replay_stream():
        for chunk in chunks:
            yield chunk

    return replay_stream()
```

This ensures:
- **Complete capture** - The entire stream is saved atomically
- **Interface preservation** - The returned object behaves like the original API
- **Deterministic replay** - Same chunks in the same order every time

## Serialization

API responses contain complex Pydantic objects that need careful serialization:

```python
def _serialize_response(response):
    if hasattr(response, "model_dump"):
        # Preserve type information for proper deserialization
        return {
            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
            "__data__": response.model_dump(mode="json"),
        }
    return response
```

This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.

## Environment Integration

### Environment Variables

Control recording behavior globally:

```bash
export LLAMA_STACK_TEST_INFERENCE_MODE=replay
export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
pytest tests/integration/
```

### Pytest Integration

The system integrates automatically based on environment variables, requiring no changes to test code.

## Debugging Recordings

### Inspecting Storage

```bash
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"

# View specific response
cat recordings/responses/abc123def456.json | jq '.response.body'

# Find recordings by endpoint
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
```

### Common Issues

**Hash mismatches:** Request parameters changed slightly between record and replay
```bash
# Compare request details
cat recordings/responses/abc123.json | jq '.request'
```

**Serialization errors:** Response types changed between versions
```bash
# Re-record with updated types
rm recordings/responses/failing_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
```

**Missing recordings:** New test or changed parameters
```bash
# Record the missing interaction
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
```

## Design Decisions

### Why Not Mocks?

Traditional mocking breaks down with AI APIs because:
- Response structures are complex and evolve frequently
- Streaming behavior is hard to mock correctly
- Edge cases in real APIs get missed
- Mocks become brittle maintenance burdens

### Why Precise Hashing?

Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.

### Why JSON + SQLite?

- **JSON** - Human readable, diff-friendly, easy to inspect and modify
- **SQLite** - Fast indexed lookups without loading response bodies
- **Hybrid** - Best of both worlds for different use cases

This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.