Merge 053ca90ce6 into 81ecaf6221

2025-08-15 06:00:48 +00:00 · 2025-08-14 21:20:20 -07:00 · 2025-08-14 21:20:20 -07:00 · 08383c614d
commit 08383c614d
parent 81ecaf6221 053ca90ce6
5 changed files with 434 additions and 26 deletions
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@ -4,11 +4,11 @@

 ## Adding a New Provider

-See the [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+See:
+- [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+- [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
+- [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.

-See the [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
-
-See the [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 ```{toctree}
 :maxdepth: 1
 :hidden:
@ -19,11 +19,16 @@ new_vector_database

 ## Testing

-See the [Test Page](testing.md) which describes how to test your changes.
+
+```{include} ../../../tests/README.md
+```
+
+### Advanced Topics
+
+For developers who need deeper understanding of the testing system internals:
+
 ```{toctree}
 :maxdepth: 1
-:hidden:
-:caption: Testing

-testing
+testing/record-replay
 ```
--- a/docs/source/contributing/testing.md
+++ b/docs/source/contributing/testing.md
@ -1,8 +0,0 @@
-```{include} ../../../tests/README.md
-```
-
-```{include} ../../../tests/unit/README.md
-```
-
-```{include} ../../../tests/integration/README.md
-```
--- a/docs/source/contributing/testing/record-replay.md
+++ b/docs/source/contributing/testing/record-replay.md
@ -0,0 +1,234 @@
+# Record-Replay System
+
+Understanding how Llama Stack captures and replays API interactions for testing.
+
+## Overview
+
+The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
+
+The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
+
+## How It Works
+
+### Request Hashing
+
+Every API request gets converted to a deterministic hash for lookup:
+
+```python
+def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
+    normalized = {
+        "method": method.upper(),
+        "endpoint": urlparse(url).path,  # Just the path, not full URL
+        "body": body                     # Request parameters
+    }
+    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
+```
+
+**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
+
+```python
+# These produce DIFFERENT hashes:
+{"content": "Hello world"}
+{"content": "Hello   world\n"}
+{"temperature": 0.7}
+{"temperature": 0.7000001}
+```
+
+### Client Interception
+
+The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
+
+### Storage Architecture
+
+Recordings use a two-tier storage system optimized for both speed and debuggability:
+
+```
+recordings/
+├── index.sqlite          # Fast lookup by request hash
+└── responses/
+    ├── abc123def456.json  # Individual response files
+    └── def789ghi012.json
+```
+
+**SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.
+
+**JSON files** store complete request/response pairs in human-readable format for debugging.
+
+## Recording Modes
+
+### LIVE Mode
+
+Direct API calls with no recording or replay:
+
+```python
+with inference_recording(mode=InferenceMode.LIVE):
+    response = await client.chat.completions.create(...)
+```
+
+Use for initial development and debugging against real APIs.
+
+### RECORD Mode
+
+Captures API interactions while passing through real responses:
+
+```python
+with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
+    response = await client.chat.completions.create(...)
+    # Real API call made, response captured AND returned
+```
+
+The recording process:
+1. Request intercepted and hashed
+2. Real API call executed
+3. Response captured and serialized
+4. Recording stored to disk
+5. Original response returned to caller
+
+### REPLAY Mode
+
+Returns stored responses instead of making API calls:
+
+```python
+with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
+    response = await client.chat.completions.create(...)
+    # No API call made, cached response returned instantly
+```
+
+The replay process:
+1. Request intercepted and hashed
+2. Hash looked up in SQLite index
+3. Response loaded from JSON file
+4. Response deserialized and returned
+5. Error if no recording found
+
+## Streaming Support
+
+Streaming APIs present a unique challenge: how do you capture an async generator?
+
+### The Problem
+
+```python
+# How do you record this?
+async for chunk in client.chat.completions.create(stream=True):
+    process(chunk)
+```
+
+### The Solution
+
+The system captures all chunks immediately before yielding any:
+
+```python
+async def handle_streaming_record(response):
+    # Capture complete stream first
+    chunks = []
+    async for chunk in response:
+        chunks.append(chunk)
+
+    # Store complete recording
+    storage.store_recording(request_hash, request_data, {
+        "body": chunks,
+        "is_streaming": True
+    })
+
+    # Return generator that replays captured chunks
+    async def replay_stream():
+        for chunk in chunks:
+            yield chunk
+    return replay_stream()
+```
+
+This ensures:
+- **Complete capture** - The entire stream is saved atomically
+- **Interface preservation** - The returned object behaves like the original API
+- **Deterministic replay** - Same chunks in the same order every time
+
+## Serialization
+
+API responses contain complex Pydantic objects that need careful serialization:
+
+```python
+def _serialize_response(response):
+    if hasattr(response, "model_dump"):
+        # Preserve type information for proper deserialization
+        return {
+            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
+            "__data__": response.model_dump(mode="json")
+        }
+    return response
+```
+
+This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
+
+## Environment Integration
+
+### Environment Variables
+
+Control recording behavior globally:
+
+```bash
+export LLAMA_STACK_TEST_INFERENCE_MODE=replay
+export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
+pytest tests/integration/
+```
+
+### Pytest Integration
+
+The system integrates automatically based on environment variables, requiring no changes to test code.
+
+## Debugging Recordings
+
+### Inspecting Storage
+
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
+
+# View specific response
+cat recordings/responses/abc123def456.json | jq '.response.body'
+
+# Find recordings by endpoint
+sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
+```
+
+### Common Issues
+
+**Hash mismatches:** Request parameters changed slightly between record and replay
+```bash
+# Compare request details
+cat recordings/responses/abc123.json | jq '.request'
+```
+
+**Serialization errors:** Response types changed between versions
+```bash
+# Re-record with updated types
+rm recordings/responses/failing_hash.json
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
+```
+
+**Missing recordings:** New test or changed parameters
+```bash
+# Record the missing interaction
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
+```
+
+## Design Decisions
+
+### Why Not Mocks?
+
+Traditional mocking breaks down with AI APIs because:
+- Response structures are complex and evolve frequently
+- Streaming behavior is hard to mock correctly
+- Edge cases in real APIs get missed
+- Mocks become brittle maintenance burdens
+
+### Why Precise Hashing?
+
+Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
+
+### Why JSON + SQLite?
+
+- **JSON** - Human readable, diff-friendly, easy to inspect and modify
+- **SQLite** - Fast indexed lookups without loading response bodies
+- **Hybrid** - Best of both worlds for different use cases
+
+This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
--- a/tests/README.md
+++ b/tests/README.md
@ -1,9 +1,86 @@
-# Llama Stack Tests
+There are two obvious types of tests:

-Llama Stack has multiple layers of testing done to ensure continuous functionality and prevent regressions to the codebase.
+| Type | Location | Purpose |
+|------|----------|---------|
+| **Unit** | [`tests/unit/`](unit/README.md) | Fast, isolated component testing |
+| **Integration** | [`tests/integration/`](integration/README.md) | End-to-end workflows with record-replay |

-| Testing Type | Details |
-|--------------|---------|
-| Unit | [unit/README.md](unit/README.md) |
-| Integration | [integration/README.md](integration/README.md) |
-| Verification | [verifications/README.md](verifications/README.md) |
+Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.
+
+### Record-replay for integration tests
+
+Testing AI applications end-to-end creates some challenges:
+- **API costs** accumulate quickly during development and CI
+- **Non-deterministic responses** make tests unreliable
+- **Multiple providers** require testing the same logic across different APIs
+
+Our solution: **Record real API responses once, replay them for fast, deterministic tests.** This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.
+
+This gives you:
+- Cost control - No repeated API calls during development
+- Speed - Instant test execution with cached responses
+- Reliability - Consistent results regardless of external service state
+- Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.
+
+### Testing Quick Start
+
+You can run the unit tests with:
+```bash
+uv run --group unit pytest -sv tests/unit/
+```
+
+For running integration tests, you must provide a few things:
+
+- A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
+  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
+  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
+  - a URL which points to a Llama Stack distribution server
+  - a distribution name (e.g., `starter`) or a path to a `run.yaml` file
+  - a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
+
+- Whether you are using replay or live mode for inference. This is specified with the LLAMA_STACK_TEST_INFERENCE_MODE environment variable. The default mode currently is "live" -- that is certainly surprising, but we will fix this soon.
+
+- Any API keys you need to use should be set in the environment, or can be passed in with the --env option.
+
+You can run the integration tests in replay mode with:
+```bash
+# Run all tests with existing recordings
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+  uv run --group test \
+  pytest -sv tests/integration/ --stack-config=starter
+```
+
+If you don't specify LLAMA_STACK_TEST_INFERENCE_MODE, by default it will be in "live" mode -- that is, it will make real API calls.
+
+```bash
+# Test against live APIs
+FIREWORKS_API_KEY=your_key pytest -sv tests/integration/inference --stack-config=starter
+```
+
+### Re-recording tests
+
+If you want to re-record tests, you can do so with:
+
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+  uv run --group test \
+  pytest -sv tests/integration/ --stack-config=starter -k "<appropriate test name>"
+```
+
+This will record new API responses and overwrite the existing recordings.
+
+
+```{warning}
+
+You must be careful when re-recording. CI workflows assume a specific setup for running the replay-mode tests. You must re-record the tests in the same way as the CI workflows. This means
+- you need Ollama running and serving some specific models.
+- you are using the `starter` distribution.
+```
+
+
+### Next Steps
+
+- [Integration Testing Guide](integration/README.md) - Detailed usage and configuration
+- [Unit Testing Guide](unit/README.md) - Fast component testing
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@ -1,6 +1,23 @@
-# Llama Stack Integration Tests
+# Integration Testing Guide

-We use `pytest` for parameterizing and running tests. You can see all options with:
+Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
+
+## Quick Start
+
+```bash
+# Run all integration tests with existing recordings
+uv run pytest tests/integration/
+
+# Test against live APIs with auto-server
+export FIREWORKS_API_KEY=your_key
+pytest tests/integration/inference/ \
+    --stack-config=server:fireworks \
+    --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+## Configuration Options
+
+You can see all options with:
 ```bash
 cd tests/integration

@ -114,3 +131,86 @@ pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
   --embedding-model=$EMBEDDING_MODELS
 ```
+
+## Recording Modes
+
+The testing system supports three modes controlled by environment variables:
+
+### LIVE Mode (Default)
+Tests make real API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
+```
+
+### RECORD Mode
+Captures API interactions for later replay:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
+pytest tests/integration/inference/test_new_feature.py
+```
+
+### REPLAY Mode
+Uses cached responses instead of making API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
+pytest tests/integration/
+```
+
+## Managing Recordings
+
+### Viewing Recordings
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
+
+# Inspect specific response
+cat recordings/responses/abc123.json | jq '.'
+```
+
+### Re-recording Tests
+```bash
+# Re-record specific tests
+rm -rf recordings/
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_modified.py
+```
+
+## Writing Tests
+
+### Basic Test Pattern
+```python
+def test_basic_completion(llama_stack_client, text_model_id):
+    response = llama_stack_client.inference.completion(
+        model_id=text_model_id,
+        content=CompletionMessage(role="user", content="Hello"),
+    )
+
+    # Test structure, not AI output quality
+    assert response.completion_message is not None
+    assert isinstance(response.completion_message.content, str)
+    assert len(response.completion_message.content) > 0
+```
+
+### Provider-Specific Tests
+```python
+def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
+    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
+        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
+
+    query_response = llama_stack_client.inference.embeddings(
+        model_id=embedding_model_id,
+        contents=["What is machine learning?"],
+        task_type="query"
+    )
+
+    assert query_response.embeddings is not None
+```
+
+## Best Practices
+
+- **Test API contracts, not AI output quality** - Focus on response structure, not content
+- **Use existing recordings for development** - Fast iteration without API costs
+- **Record new interactions only when needed** - Adding new functionality
+- **Test across providers** - Ensure compatibility
+- **Commit recordings to version control** - Deterministic CI builds