mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-08-16 06:27:58 +00:00
docs(tests): Add a bunch of documentation for our testing systems (#3139)
# What does this PR do? Creates a structured testing documentation section with multiple detailed pages: - Testing overview explaining the record-replay architecture - Integration testing guide with practical usage examples - Record-replay system technical documentation - Guide for writing effective tests - Troubleshooting guide for common testing issues Hopefully this makes things a bit easier.
This commit is contained in:
parent
81ecaf6221
commit
f66ae3b3b1
5 changed files with 456 additions and 89 deletions
|
@ -4,11 +4,11 @@
|
|||
|
||||
## Adding a New Provider
|
||||
|
||||
See the [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
|
||||
See:
|
||||
- [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
|
||||
- [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
|
||||
- [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
|
||||
|
||||
See the [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
|
||||
|
||||
See the [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
:hidden:
|
||||
|
@ -19,11 +19,16 @@ new_vector_database
|
|||
|
||||
## Testing
|
||||
|
||||
See the [Test Page](testing.md) which describes how to test your changes.
|
||||
|
||||
```{include} ../../../tests/README.md
|
||||
```
|
||||
|
||||
### Advanced Topics
|
||||
|
||||
For developers who need deeper understanding of the testing system internals:
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
:hidden:
|
||||
:caption: Testing
|
||||
|
||||
testing
|
||||
```
|
||||
testing/record-replay
|
||||
```
|
||||
|
|
|
@ -1,8 +0,0 @@
|
|||
```{include} ../../../tests/README.md
|
||||
```
|
||||
|
||||
```{include} ../../../tests/unit/README.md
|
||||
```
|
||||
|
||||
```{include} ../../../tests/integration/README.md
|
||||
```
|
234
docs/source/contributing/testing/record-replay.md
Normal file
234
docs/source/contributing/testing/record-replay.md
Normal file
|
@ -0,0 +1,234 @@
|
|||
# Record-Replay System
|
||||
|
||||
Understanding how Llama Stack captures and replays API interactions for testing.
|
||||
|
||||
## Overview
|
||||
|
||||
The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
|
||||
|
||||
The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Request Hashing
|
||||
|
||||
Every API request gets converted to a deterministic hash for lookup:
|
||||
|
||||
```python
|
||||
def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
|
||||
normalized = {
|
||||
"method": method.upper(),
|
||||
"endpoint": urlparse(url).path, # Just the path, not full URL
|
||||
"body": body, # Request parameters
|
||||
}
|
||||
return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
|
||||
```
|
||||
|
||||
**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
|
||||
|
||||
```python
|
||||
# These produce DIFFERENT hashes:
|
||||
{"content": "Hello world"}
|
||||
{"content": "Hello world\n"}
|
||||
{"temperature": 0.7}
|
||||
{"temperature": 0.7000001}
|
||||
```
|
||||
|
||||
### Client Interception
|
||||
|
||||
The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
|
||||
|
||||
### Storage Architecture
|
||||
|
||||
Recordings use a two-tier storage system optimized for both speed and debuggability:
|
||||
|
||||
```
|
||||
recordings/
|
||||
├── index.sqlite # Fast lookup by request hash
|
||||
└── responses/
|
||||
├── abc123def456.json # Individual response files
|
||||
└── def789ghi012.json
|
||||
```
|
||||
|
||||
**SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.
|
||||
|
||||
**JSON files** store complete request/response pairs in human-readable format for debugging.
|
||||
|
||||
## Recording Modes
|
||||
|
||||
### LIVE Mode
|
||||
|
||||
Direct API calls with no recording or replay:
|
||||
|
||||
```python
|
||||
with inference_recording(mode=InferenceMode.LIVE):
|
||||
response = await client.chat.completions.create(...)
|
||||
```
|
||||
|
||||
Use for initial development and debugging against real APIs.
|
||||
|
||||
### RECORD Mode
|
||||
|
||||
Captures API interactions while passing through real responses:
|
||||
|
||||
```python
|
||||
with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
|
||||
response = await client.chat.completions.create(...)
|
||||
# Real API call made, response captured AND returned
|
||||
```
|
||||
|
||||
The recording process:
|
||||
1. Request intercepted and hashed
|
||||
2. Real API call executed
|
||||
3. Response captured and serialized
|
||||
4. Recording stored to disk
|
||||
5. Original response returned to caller
|
||||
|
||||
### REPLAY Mode
|
||||
|
||||
Returns stored responses instead of making API calls:
|
||||
|
||||
```python
|
||||
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
|
||||
response = await client.chat.completions.create(...)
|
||||
# No API call made, cached response returned instantly
|
||||
```
|
||||
|
||||
The replay process:
|
||||
1. Request intercepted and hashed
|
||||
2. Hash looked up in SQLite index
|
||||
3. Response loaded from JSON file
|
||||
4. Response deserialized and returned
|
||||
5. Error if no recording found
|
||||
|
||||
## Streaming Support
|
||||
|
||||
Streaming APIs present a unique challenge: how do you capture an async generator?
|
||||
|
||||
### The Problem
|
||||
|
||||
```python
|
||||
# How do you record this?
|
||||
async for chunk in client.chat.completions.create(stream=True):
|
||||
process(chunk)
|
||||
```
|
||||
|
||||
### The Solution
|
||||
|
||||
The system captures all chunks immediately before yielding any:
|
||||
|
||||
```python
|
||||
async def handle_streaming_record(response):
|
||||
# Capture complete stream first
|
||||
chunks = []
|
||||
async for chunk in response:
|
||||
chunks.append(chunk)
|
||||
|
||||
# Store complete recording
|
||||
storage.store_recording(
|
||||
request_hash, request_data, {"body": chunks, "is_streaming": True}
|
||||
)
|
||||
|
||||
# Return generator that replays captured chunks
|
||||
async def replay_stream():
|
||||
for chunk in chunks:
|
||||
yield chunk
|
||||
|
||||
return replay_stream()
|
||||
```
|
||||
|
||||
This ensures:
|
||||
- **Complete capture** - The entire stream is saved atomically
|
||||
- **Interface preservation** - The returned object behaves like the original API
|
||||
- **Deterministic replay** - Same chunks in the same order every time
|
||||
|
||||
## Serialization
|
||||
|
||||
API responses contain complex Pydantic objects that need careful serialization:
|
||||
|
||||
```python
|
||||
def _serialize_response(response):
|
||||
if hasattr(response, "model_dump"):
|
||||
# Preserve type information for proper deserialization
|
||||
return {
|
||||
"__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
|
||||
"__data__": response.model_dump(mode="json"),
|
||||
}
|
||||
return response
|
||||
```
|
||||
|
||||
This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
|
||||
|
||||
## Environment Integration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Control recording behavior globally:
|
||||
|
||||
```bash
|
||||
export LLAMA_STACK_TEST_INFERENCE_MODE=replay
|
||||
export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
|
||||
pytest tests/integration/
|
||||
```
|
||||
|
||||
### Pytest Integration
|
||||
|
||||
The system integrates automatically based on environment variables, requiring no changes to test code.
|
||||
|
||||
## Debugging Recordings
|
||||
|
||||
### Inspecting Storage
|
||||
|
||||
```bash
|
||||
# See what's recorded
|
||||
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
|
||||
|
||||
# View specific response
|
||||
cat recordings/responses/abc123def456.json | jq '.response.body'
|
||||
|
||||
# Find recordings by endpoint
|
||||
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Hash mismatches:** Request parameters changed slightly between record and replay
|
||||
```bash
|
||||
# Compare request details
|
||||
cat recordings/responses/abc123.json | jq '.request'
|
||||
```
|
||||
|
||||
**Serialization errors:** Response types changed between versions
|
||||
```bash
|
||||
# Re-record with updated types
|
||||
rm recordings/responses/failing_hash.json
|
||||
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
|
||||
```
|
||||
|
||||
**Missing recordings:** New test or changed parameters
|
||||
```bash
|
||||
# Record the missing interaction
|
||||
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Why Not Mocks?
|
||||
|
||||
Traditional mocking breaks down with AI APIs because:
|
||||
- Response structures are complex and evolve frequently
|
||||
- Streaming behavior is hard to mock correctly
|
||||
- Edge cases in real APIs get missed
|
||||
- Mocks become brittle maintenance burdens
|
||||
|
||||
### Why Precise Hashing?
|
||||
|
||||
Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
|
||||
|
||||
### Why JSON + SQLite?
|
||||
|
||||
- **JSON** - Human readable, diff-friendly, easy to inspect and modify
|
||||
- **SQLite** - Fast indexed lookups without loading response bodies
|
||||
- **Hybrid** - Best of both worlds for different use cases
|
||||
|
||||
This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
|
Loading…
Add table
Add a link
Reference in a new issue