rewrote all slop

This commit is contained in:
Ashwin Bharambe 2025-08-14 16:51:13 -07:00
parent f4281ce66a
commit 1e2bbd08da
9 changed files with 452 additions and 930 deletions

View file

@ -4,11 +4,11 @@
## Adding a New Provider ## Adding a New Provider
See the [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack. See:
- [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
- [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
- [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
See the [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
See the [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
```{toctree} ```{toctree}
:maxdepth: 1 :maxdepth: 1
:hidden: :hidden:
@ -19,12 +19,17 @@ new_vector_database
## Testing ## Testing
Llama Stack uses a record-replay testing system for reliable, cost-effective testing. See the [Testing Documentation](testing.md) for comprehensive guides on writing and running tests.
```{include} ../../../tests/README.md
```
### Advanced Topics
For developers who need deeper understanding of the testing system internals:
```{toctree} ```{toctree}
:maxdepth: 1 :maxdepth: 1
:hidden:
:caption: Testing
testing testing/record-replay
testing/troubleshooting
``` ```

View file

@ -1,40 +0,0 @@
# Testing
Llama Stack uses a record-replay system for reliable, fast, and cost-effective testing of AI applications.
## Testing Documentation
```{toctree}
:maxdepth: 1
testing/index
testing/integration-testing
testing/record-replay
testing/writing-tests
testing/troubleshooting
```
## Quick Start
```bash
# Run tests with existing recordings
uv run pytest tests/integration/
# Test against live APIs
FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
```
For detailed information, see the [Testing Overview](testing/index.md).
---
## Original Documentation
```{include} ../../../tests/README.md
```
```{include} ../../../tests/unit/README.md
```
```{include} ../../../tests/integration/README.md
```

View file

@ -1,103 +0,0 @@
# Testing in Llama Stack
Llama Stack uses a record-replay testing system to handle AI API costs, non-deterministic responses, and multiple provider integrations.
## Core Problems
Testing AI applications creates three challenges:
- **API costs** accumulate quickly during development and CI
- **Non-deterministic responses** make tests unreliable
- **Multiple providers** require testing the same logic across different APIs
## Solution
Record real API responses once, replay them for fast, deterministic tests.
## Architecture Overview
### Test Types
- **Unit tests** (`tests/unit/`) - Test components in isolation with mocks
- **Integration tests** (`tests/integration/`) - Test complete workflows with record-replay
### Core Components
#### Record-Replay System
Captures API calls and replays them deterministically:
```python
# Record real API responses
with inference_recording(mode=InferenceMode.RECORD, storage_dir="recordings"):
response = await client.chat.completions.create(...)
# Replay cached responses
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="recordings"):
response = await client.chat.completions.create(...) # No API call made
```
#### Provider Testing
Write tests once, run against any provider:
```bash
# Same test, different providers
pytest tests/integration/inference/ --stack-config=openai --text-model=gpt-4
pytest tests/integration/inference/ --stack-config=starter --text-model=llama3.2:3b
```
#### Test Parametrization
Generate test combinations from CLI arguments:
```bash
# Creates test for each model/provider combination
pytest tests/integration/ \
--stack-config=inference=fireworks \
--text-model=llama-3.1-8b,llama-3.1-70b
```
## How It Works
### Recording Storage
Recordings use SQLite for lookup and JSON for storage:
```
recordings/
├── index.sqlite # Fast lookup by request hash
└── responses/
├── abc123def456.json # Individual response files
└── def789ghi012.json
```
### Why Record-Replay?
Mocking AI APIs is brittle. Real API responses:
- Include edge cases and realistic data structures
- Preserve streaming behavior
- Can be inspected and debugged
### Why Test All Providers?
One test verifies behavior across all providers, catching integration bugs early.
## Workflow
1. **Develop tests** in `LIVE` mode against real APIs
2. **Record responses** with `RECORD` mode
3. **Commit recordings** for deterministic CI
4. **Tests replay** cached responses in CI
## Quick Start
```bash
# Run tests with existing recordings
uv run pytest tests/integration/
# Test against live APIs
FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
```
See [Integration Testing](integration-testing.md) for usage details and [Record-Replay](record-replay.md) for system internals.

View file

@ -1,136 +0,0 @@
# Integration Testing Guide
Practical usage of Llama Stack's integration testing system.
## Basic Usage
```bash
# Run all integration tests
uv run pytest tests/integration/
# Run specific test suites
uv run pytest tests/integration/inference/
uv run pytest tests/integration/agents/
```
## Live API Testing
```bash
# Auto-start server
export FIREWORKS_API_KEY=your_key
pytest tests/integration/inference/ \
--stack-config=server:fireworks \
--text-model=meta-llama/Llama-3.1-8B-Instruct
# Library client
export TOGETHER_API_KEY=your_key
pytest tests/integration/inference/ \
--stack-config=starter \
--text-model=meta-llama/Llama-3.1-8B-Instruct
```
## Configuration
### Stack Config
```bash
--stack-config=server:fireworks # Auto-start server
--stack-config=server:together:8322 # Custom port
--stack-config=starter # Template
--stack-config=/path/to/run.yaml # Config file
--stack-config=inference=fireworks # Adhoc providers
--stack-config=http://localhost:5001 # Existing server
```
### Models
```bash
--text-model=meta-llama/Llama-3.1-8B-Instruct
--vision-model=meta-llama/Llama-3.2-11B-Vision-Instruct
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
### Environment
```bash
--env FIREWORKS_API_KEY=your_key
--env OPENAI_BASE_URL=http://localhost:11434/v1
```
## Test Scenarios
### New Provider Testing
```bash
# Test new provider
pytest tests/integration/inference/ \
--stack-config=inference=your-new-provider \
--text-model=your-model-id
```
### Multiple Models
```bash
# Test multiple models
pytest tests/integration/inference/ \
--text-model=llama-3.1-8b,llama-3.1-70b
```
### Local Development
```bash
# Test with local Ollama
pytest tests/integration/inference/ \
--stack-config=starter \
--text-model=llama3.2:3b
```
## Recording Modes
```bash
# Live API calls (default)
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
# Record new responses
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
pytest tests/integration/inference/test_new.py
# Replay cached responses
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
pytest tests/integration/
```
## Recording Management
```bash
# View recordings
sqlite3 recordings/index.sqlite "SELECT * FROM recordings;"
cat recordings/responses/abc123.json
# Re-record tests
rm -rf recordings/
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_specific.py
```
## Debugging
```bash
# Verbose output
pytest -vvs tests/integration/inference/
# Debug logging
LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py
# Custom port for conflicts
pytest tests/integration/ --stack-config=server:fireworks:8322
```
## Best Practices
- Use existing recordings for development
- Record new interactions only when needed
- Test across multiple providers
- Use descriptive test names
- Commit recordings to version control

View file

@ -1,32 +1,46 @@
# Record-Replay System # Record-Replay System
The record-replay system captures real API interactions and replays them deterministically for fast, reliable testing. Understanding how Llama Stack captures and replays API interactions for testing.
## Overview
The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
## How It Works ## How It Works
### Request Hashing ### Request Hashing
API requests are hashed to enable consistent lookup: Every API request gets converted to a deterministic hash for lookup:
```python ```python
def normalize_request(method: str, url: str, headers: dict, body: dict) -> str: def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
normalized = { normalized = {
"method": method.upper(), "method": method.upper(),
"endpoint": urlparse(url).path, "endpoint": urlparse(url).path, # Just the path, not full URL
"body": body "body": body # Request parameters
} }
return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest() return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
``` ```
Hashing is precise - different whitespace or float precision produces different hashes. **Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
```python
# These produce DIFFERENT hashes:
{"content": "Hello world"}
{"content": "Hello world\n"}
{"temperature": 0.7}
{"temperature": 0.7000001}
```
### Client Interception ### Client Interception
The system patches OpenAI and Ollama client methods to intercept API calls before they leave the client. The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
## Storage ### Storage Architecture
Recordings use SQLite for indexing and JSON for storage: Recordings use a two-tier storage system optimized for both speed and debuggability:
``` ```
recordings/ recordings/
@ -36,36 +50,120 @@ recordings/
└── def789ghi012.json └── def789ghi012.json
``` ```
**SQLite index** enables O(log n) hash lookups and metadata queries without loading response bodies.
**JSON files** store complete request/response pairs in human-readable format for debugging.
## Recording Modes ## Recording Modes
### LIVE Mode ### LIVE Mode
Direct API calls, no recording/replay:
Direct API calls with no recording or replay:
```python ```python
with inference_recording(mode=InferenceMode.LIVE): with inference_recording(mode=InferenceMode.LIVE):
response = await client.chat.completions.create(...) response = await client.chat.completions.create(...)
``` ```
Use for initial development and debugging against real APIs.
### RECORD Mode ### RECORD Mode
Captures API interactions:
Captures API interactions while passing through real responses:
```python ```python
with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"): with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
response = await client.chat.completions.create(...) response = await client.chat.completions.create(...)
# Response captured AND returned # Real API call made, response captured AND returned
``` ```
The recording process:
1. Request intercepted and hashed
2. Real API call executed
3. Response captured and serialized
4. Recording stored to disk
5. Original response returned to caller
### REPLAY Mode ### REPLAY Mode
Uses stored recordings:
Returns stored responses instead of making API calls:
```python ```python
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"): with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
response = await client.chat.completions.create(...) response = await client.chat.completions.create(...)
# Returns cached response, no API call # No API call made, cached response returned instantly
``` ```
The replay process:
1. Request intercepted and hashed
2. Hash looked up in SQLite index
3. Response loaded from JSON file
4. Response deserialized and returned
5. Error if no recording found
## Streaming Support ## Streaming Support
Streaming responses are captured completely before any chunks are yielded, then replayed as an async generator that matches the original API behavior. Streaming APIs present a unique challenge: how do you capture an async generator?
## Environment Variables ### The Problem
```python
# How do you record this?
async for chunk in client.chat.completions.create(stream=True):
process(chunk)
```
### The Solution
The system captures all chunks immediately before yielding any:
```python
async def handle_streaming_record(response):
# Capture complete stream first
chunks = []
async for chunk in response:
chunks.append(chunk)
# Store complete recording
storage.store_recording(request_hash, request_data, {
"body": chunks,
"is_streaming": True
})
# Return generator that replays captured chunks
async def replay_stream():
for chunk in chunks:
yield chunk
return replay_stream()
```
This ensures:
- **Complete capture** - The entire stream is saved atomically
- **Interface preservation** - The returned object behaves like the original API
- **Deterministic replay** - Same chunks in the same order every time
## Serialization
API responses contain complex Pydantic objects that need careful serialization:
```python
def _serialize_response(response):
if hasattr(response, "model_dump"):
# Preserve type information for proper deserialization
return {
"__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
"__data__": response.model_dump(mode="json")
}
return response
```
This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
## Environment Integration
### Environment Variables
Control recording behavior globally:
```bash ```bash
export LLAMA_STACK_TEST_INFERENCE_MODE=replay export LLAMA_STACK_TEST_INFERENCE_MODE=replay
@ -73,8 +171,64 @@ export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
pytest tests/integration/ pytest tests/integration/
``` ```
## Common Issues ### Pytest Integration
- **"No recorded response found"** - Re-record with `RECORD` mode The system integrates automatically based on environment variables, requiring no changes to test code.
- **Serialization errors** - Response types changed, re-record
- **Hash mismatches** - Request parameters changed slightly ## Debugging Recordings
### Inspecting Storage
```bash
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
# View specific response
cat recordings/responses/abc123def456.json | jq '.response.body'
# Find recordings by endpoint
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
```
### Common Issues
**Hash mismatches:** Request parameters changed slightly between record and replay
```bash
# Compare request details
cat recordings/responses/abc123.json | jq '.request'
```
**Serialization errors:** Response types changed between versions
```bash
# Re-record with updated types
rm recordings/responses/failing_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
```
**Missing recordings:** New test or changed parameters
```bash
# Record the missing interaction
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
```
## Design Decisions
### Why Not Mocks?
Traditional mocking breaks down with AI APIs because:
- Response structures are complex and evolve frequently
- Streaming behavior is hard to mock correctly
- Edge cases in real APIs get missed
- Mocks become brittle maintenance burdens
### Why Precise Hashing?
Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
### Why JSON + SQLite?
- **JSON** - Human readable, diff-friendly, easy to inspect and modify
- **SQLite** - Fast indexed lookups without loading response bodies
- **Hybrid** - Best of both worlds for different use cases
This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.

View file

@ -1,528 +1,140 @@
# Testing Troubleshooting Guide # Common Testing Issues
This guide covers common issues encountered when working with Llama Stack's testing infrastructure and how to resolve them. The most frequent problems when working with Llama Stack's testing system.
## Quick Diagnosis ## Missing Recordings
### Test Status Quick Check **Error:**
```bash
# Check if tests can run at all
uv run pytest tests/integration/inference/test_embedding.py::test_basic_embeddings -v
# Check available models and providers
uv run llama stack list-providers
uv run llama stack list-models
# Verify server connectivity
curl http://localhost:5001/v1/health
```
## Recording and Replay Issues
### "No recorded response found for request hash"
**Symptom:**
``` ```
RuntimeError: No recorded response found for request hash: abc123def456 RuntimeError: No recorded response found for request hash: abc123def456
Endpoint: /v1/chat/completions Endpoint: /v1/chat/completions
Model: meta-llama/Llama-3.1-8B-Instruct Model: meta-llama/Llama-3.1-8B-Instruct
``` ```
**Causes and Solutions:** **Cause:** You're running a test that needs an API interaction that hasn't been recorded yet.
1. **Missing recording** - Most common cause **Solution:**
```bash ```bash
# Record the missing interaction # Record the missing interaction
LLAMA_STACK_TEST_INFERENCE_MODE=record \ LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=./test_recordings \ LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
pytest tests/integration/inference/test_failing.py -v pytest tests/integration/inference/test_your_test.py
```
2. **Request parameters changed**
```bash
# Check what changed by comparing requests
sqlite3 test_recordings/index.sqlite \
"SELECT request_hash, endpoint, model, timestamp FROM recordings WHERE endpoint='/v1/chat/completions';"
# View specific request details
cat test_recordings/responses/abc123def456.json | jq '.request'
```
3. **Different environment/provider**
```bash
# Ensure consistent test environment
pytest tests/integration/ --stack-config=starter --text-model=llama3.2:3b
```
### Recording Failures
**Symptom:**
```
sqlite3.OperationalError: database is locked
``` ```
**Solutions:** ## API Key Issues
1. **Concurrent access** - Multiple test processes **Error:**
```bash
# Run tests sequentially
pytest tests/integration/ -n 1
# Or use separate recording directories
LLAMA_STACK_TEST_RECORDING_DIR=./recordings_$(date +%s) pytest ...
```
2. **Incomplete recording cleanup**
```bash
# Clear and restart recording
rm -rf test_recordings/
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_specific.py
```
### Serialization/Deserialization Errors
**Symptom:**
```
Failed to deserialize object of type llama_stack.apis.inference.OpenAIChatCompletion
```
**Causes and Solutions:**
1. **API response format changed**
```bash
# Re-record with updated format
rm test_recordings/responses/abc123*.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_failing.py
```
2. **Missing dependencies for deserialization**
```bash
# Ensure all required packages installed
uv install --group dev
```
3. **Version mismatch between record and replay**
```bash
# Check Python environment consistency
uv run python -c "import llama_stack; print(llama_stack.__version__)"
```
## Server Connection Issues
### "Connection refused" Errors
**Symptom:**
```
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
```
**Diagnosis and Solutions:**
1. **Server not running**
```bash
# Check if server is running
curl http://localhost:5001/v1/health
# Start server manually for debugging
llama stack run --template starter --port 5001
```
2. **Port conflicts**
```bash
# Check what's using the port
lsof -i :5001
# Use different port
pytest tests/integration/ --stack-config=server:starter:8322
```
3. **Server startup timeout**
```bash
# Increase startup timeout or check server logs
tail -f server.log
# Manual server management
llama stack run --template starter &
sleep 30 # Wait for startup
pytest tests/integration/
```
### Auto-Server Startup Issues
**Symptom:**
```
Server failed to respond within 30 seconds
```
**Solutions:**
1. **Check server logs**
```bash
# Server logs are written to server.log
tail -f server.log
# Look for startup errors
grep -i error server.log
```
2. **Dependencies missing**
```bash
# Ensure all dependencies installed
uv install --group dev
# Check specific provider requirements
pip list | grep -i fireworks
```
3. **Resource constraints**
```bash
# Check system resources
htop
df -h
# Use lighter config for testing
pytest tests/integration/ --stack-config=starter
```
## Provider and Model Issues
### "Model not found" Errors
**Symptom:**
```
Model 'meta-llama/Llama-3.1-8B-Instruct' not found
```
**Solutions:**
1. **Check available models**
```bash
# List models for current provider
uv run llama stack list-models
# Use available model
pytest tests/integration/ --text-model=llama3.2:3b
```
2. **Model not downloaded for local providers**
```bash
# Download missing model
ollama pull llama3.2:3b
# Verify model available
ollama list
```
3. **Provider configuration issues**
```bash
# Check provider setup
uv run llama stack list-providers
# Verify API keys set
echo $FIREWORKS_API_KEY
```
### Provider Authentication Failures
**Symptom:**
``` ```
HTTP 401: Invalid API key HTTP 401: Invalid API key
``` ```
**Solutions:** **Cause:** Missing or invalid API key for the provider you're testing.
1. **Missing API keys** **Solution:**
```bash ```bash
# Set required API key # Set the required API key
export FIREWORKS_API_KEY=your_key_here export FIREWORKS_API_KEY=your_key_here
export OPENAI_API_KEY=your_key_here export OPENAI_API_KEY=your_key_here
# Verify key is set # Verify it's set
echo $FIREWORKS_API_KEY echo $FIREWORKS_API_KEY
``` ```
2. **Invalid API keys** ## Model Not Found
```bash
# Test API key directly
curl -H "Authorization: Bearer $FIREWORKS_API_KEY" \
https://api.fireworks.ai/inference/v1/models
```
3. **API key environment issues** **Error:**
```bash ```
# Pass environment explicitly Model 'meta-llama/Llama-3.1-8B-Instruct' not found
pytest tests/integration/ --env FIREWORKS_API_KEY=your_key ```
```
## Parametrization Issues **Cause:** Model isn't available with the current provider or hasn't been downloaded locally.
### "No tests ran matching the given pattern" **For local providers (Ollama):**
```bash
# Download the model
ollama pull llama3.2:3b
**Symptom:** # Use the downloaded model
pytest tests/integration/ --text-model=llama3.2:3b
```
**For remote providers:**
```bash
# Check what models are available
uv run llama stack list-models
# Use an available model
pytest tests/integration/ --text-model=available-model-id
```
## Server Connection Issues
**Error:**
```
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
```
**Cause:** Server isn't running or is on a different port.
**Solution:**
```bash
# Check if server is running
curl http://localhost:5001/v1/health
# Start server manually
llama stack run --template starter --port 5001
# Or use auto-server with custom port
pytest tests/integration/ --stack-config=server:starter:8322
```
## Request Hash Mismatches
**Problem:** Tests worked before but now fail with "No recorded response found" even though you didn't change the test.
**Cause:** Request parameters changed slightly (different whitespace, float precision, etc.). The hashing is intentionally precise.
**Solution:**
```bash
# Check what's in your recordings
sqlite3 recordings/index.sqlite "SELECT endpoint, model FROM recordings;"
# Re-record if the request legitimately changed
rm recordings/responses/old_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/your_test.py
```
## No Tests Collected
**Error:**
``` ```
collected 0 items collected 0 items
``` ```
**Causes and Solutions:** **Cause:** No models specified for tests that require model fixtures.
1. **No models specified**
```bash
# Specify required models
pytest tests/integration/inference/ --text-model=llama3.2:3b
```
2. **Model/provider mismatch**
```bash
# Use compatible model for provider
pytest tests/integration/ \
--stack-config=starter \
--text-model=llama3.2:3b # Available in Ollama
```
3. **Missing fixtures**
```bash
# Check test requirements
pytest tests/integration/inference/test_embedding.py --collect-only
```
### Excessive Test Combinations
**Symptom:**
Tests run for too many parameter combinations, taking too long.
**Solutions:**
1. **Limit model combinations**
```bash
# Test single model instead of list
pytest tests/integration/ --text-model=llama3.2:3b
```
2. **Use specific test selection**
```bash
# Run specific test pattern
pytest tests/integration/ -k "basic and not vision"
```
3. **Separate test runs**
```bash
# Split by functionality
pytest tests/integration/inference/ --text-model=model1
pytest tests/integration/agents/ --text-model=model2
```
## Performance Issues
### Slow Test Execution
**Symptom:**
Tests take much longer than expected.
**Diagnosis and Solutions:**
1. **Using LIVE mode instead of REPLAY**
```bash
# Verify recording mode
echo $LLAMA_STACK_TEST_INFERENCE_MODE
# Force replay mode
LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
```
2. **Network latency to providers**
```bash
# Use local providers for development
pytest tests/integration/ --stack-config=starter
```
3. **Large recording files**
```bash
# Check recording directory size
du -sh test_recordings/
# Clean up old recordings
find test_recordings/ -name "*.json" -mtime +30 -delete
```
### Memory Usage Issues
**Symptom:**
```
MemoryError: Unable to allocate memory
```
**Solutions:**
1. **Large recordings in memory**
```bash
# Run tests in smaller batches
pytest tests/integration/inference/ -k "not batch"
```
2. **Model memory requirements**
```bash
# Use smaller models for testing
pytest tests/integration/ --text-model=llama3.2:3b # Instead of 70B
```
## Environment Issues
### Python Environment Problems
**Symptom:**
```
ModuleNotFoundError: No module named 'llama_stack'
```
**Solutions:**
1. **Wrong Python environment**
```bash
# Verify uv environment
uv run python -c "import llama_stack; print('OK')"
# Reinstall if needed
uv install --group dev
```
2. **Development installation issues**
```bash
# Reinstall in development mode
pip install -e .
# Verify installation
python -c "import llama_stack; print(llama_stack.__file__)"
```
### Path and Import Issues
**Symptom:**
```
ImportError: cannot import name 'LlamaStackClient'
```
**Solutions:**
1. **PYTHONPATH issues**
```bash
# Run from project root
cd /path/to/llama-stack
uv run pytest tests/integration/
```
2. **Relative import issues**
```bash
# Use absolute imports in tests
from llama_stack_client import LlamaStackClient # Not relative
```
## Debugging Techniques
### Verbose Logging
Enable detailed logging to understand what's happening:
**Solution:**
```bash ```bash
# Enable debug logging # Specify required models
LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/inference/test_failing.py -v -s pytest tests/integration/inference/ --text-model=llama3.2:3b
pytest tests/integration/embedding/ --embedding-model=all-MiniLM-L6-v2
# Enable request/response logging
LLAMA_STACK_TEST_INFERENCE_MODE=live \
LLAMA_STACK_LOG_LEVEL=DEBUG \
pytest tests/integration/inference/test_failing.py -v -s
```
### Interactive Debugging
Drop into debugger when tests fail:
```bash
# Run with pdb on failure
pytest tests/integration/inference/test_failing.py --pdb
# Or add breakpoint in test code
def test_something(llama_stack_client):
import pdb; pdb.set_trace()
# ... test code
```
### Isolation Testing
Run tests in isolation to identify interactions:
```bash
# Run single test
pytest tests/integration/inference/test_embedding.py::test_basic_embeddings
# Run without recordings
rm -rf test_recordings/
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/inference/test_failing.py
```
### Recording Inspection
Examine recordings to understand what's stored:
```bash
# Check recording database
sqlite3 test_recordings/index.sqlite ".tables"
sqlite3 test_recordings/index.sqlite ".schema recordings"
sqlite3 test_recordings/index.sqlite "SELECT * FROM recordings LIMIT 5;"
# Examine specific recording
find test_recordings/responses/ -name "*.json" | head -1 | xargs cat | jq '.'
# Compare request hashes
python -c "
from llama_stack.testing.inference_recorder import normalize_request
print(normalize_request('POST', 'http://localhost:11434/v1/chat/completions', {}, {'model': 'llama3.2:3b', 'messages': [{'role': 'user', 'content': 'Hello'}]}))
"
``` ```
## Getting Help ## Getting Help
### Information to Gather
When reporting issues, include: When reporting issues, include:
1. **Environment details:** ```bash
```bash # Environment info
uv run python --version uv run python --version
uv run python -c "import llama_stack; print(llama_stack.__version__)" uv run python -c "import llama_stack; print(llama_stack.__version__)"
uv list
```
2. **Test command and output:** # Test command that failed
```bash pytest tests/integration/your_test.py -v
# Full command that failed
pytest tests/integration/inference/test_failing.py -v
# Error message and stack trace # Stack configuration
``` echo $LLAMA_STACK_TEST_INFERENCE_MODE
ls -la recordings/
```
3. **Configuration details:** Most issues are solved by re-recording interactions or checking API keys/model availability.
```bash
# Stack configuration used
echo $LLAMA_STACK_TEST_INFERENCE_MODE
ls -la test_recordings/
```
4. **Provider status:**
```bash
uv run llama stack list-providers
uv run llama stack list-models
```
### Common Solutions Summary
| Issue | Quick Fix |
|-------|-----------|
| Missing recordings | `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...` |
| Connection refused | Check server: `curl http://localhost:5001/v1/health` |
| No tests collected | Add model: `--text-model=llama3.2:3b` |
| Authentication error | Set API key: `export PROVIDER_API_KEY=...` |
| Serialization error | Re-record: `rm recordings/*.json && record mode` |
| Slow tests | Use replay: `LLAMA_STACK_TEST_INFERENCE_MODE=replay` |
Most testing issues stem from configuration mismatches or missing recordings. The record-replay system is designed to be forgiving, but requires consistent environment setup for optimal performance.

View file

@ -1,125 +0,0 @@
# Writing Tests
How to write effective tests for Llama Stack.
## Basic Test Pattern
```python
def test_basic_completion(llama_stack_client, text_model_id):
"""Test basic text completion functionality."""
response = llama_stack_client.inference.completion(
model_id=text_model_id,
content=CompletionMessage(role="user", content="Hello"),
)
# Test structure, not AI output quality
assert response.completion_message is not None
assert isinstance(response.completion_message.content, str)
assert len(response.completion_message.content) > 0
```
## Parameterized Tests
```python
@pytest.mark.parametrize("temperature", [0.0, 0.5, 1.0])
def test_completion_temperature(llama_stack_client, text_model_id, temperature):
response = llama_stack_client.inference.completion(
model_id=text_model_id,
content=CompletionMessage(role="user", content="Hello"),
sampling_params={"temperature": temperature}
)
assert response.completion_message is not None
```
## Provider-Specific Tests
```python
def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
pytest.skip(f"Model {embedding_model_id} doesn't support task types")
query_response = llama_stack_client.inference.embeddings(
model_id=embedding_model_id,
contents=["What is machine learning?"],
task_type="query"
)
passage_response = llama_stack_client.inference.embeddings(
model_id=embedding_model_id,
contents=["Machine learning is a subset of AI..."],
task_type="passage"
)
assert query_response.embeddings != passage_response.embeddings
```
## Fixtures
```python
@pytest.fixture(scope="session")
def agent_config(llama_stack_client, text_model_id):
"""Reusable agent configuration."""
return {
"model": text_model_id,
"instructions": "You are a helpful assistant",
"tools": [],
"enable_session_persistence": False,
}
@pytest.fixture(scope="function")
def fresh_session(llama_stack_client):
"""Each test gets fresh state."""
session = llama_stack_client.create_session()
yield session
session.delete()
```
## Common Test Patterns
### Streaming Tests
```python
def test_streaming_completion(llama_stack_client, text_model_id):
stream = llama_stack_client.inference.completion(
model_id=text_model_id,
content=CompletionMessage(role="user", content="Count to 5"),
stream=True
)
chunks = list(stream)
assert len(chunks) > 1
assert all(hasattr(chunk, 'delta') for chunk in chunks)
```
### Error Testing
```python
def test_invalid_model_error(llama_stack_client):
with pytest.raises(Exception) as exc_info:
llama_stack_client.inference.completion(
model_id="nonexistent-model",
content=CompletionMessage(role="user", content="Hello")
)
assert "model" in str(exc_info.value).lower()
```
## What NOT to Test
```python
# BAD: Testing AI output quality
def test_completion_quality(llama_stack_client, text_model_id):
response = llama_stack_client.inference.completion(...)
assert "correct answer" in response.content # Fragile!
# GOOD: Testing response structure
def test_completion_structure(llama_stack_client, text_model_id):
response = llama_stack_client.inference.completion(...)
assert isinstance(response.completion_message.content, str)
assert len(response.completion_message.content) > 0
```
## Best Practices
- Test API contracts, not AI output quality
- Use descriptive test names
- Keep tests simple and focused
- Record new interactions only when needed
- Use appropriate fixture scopes (session vs function)

View file

@ -1,9 +1,64 @@
# Llama Stack Tests There are two obvious types of tests:
Llama Stack has multiple layers of testing done to ensure continuous functionality and prevent regressions to the codebase. | Type | Location | Purpose |
|------|----------|---------|
| **Unit** | [`tests/unit/`](unit/README.md) | Fast, isolated component testing |
| **Integration** | [`tests/integration/`](integration/README.md) | End-to-end workflows with record-replay |
| Testing Type | Details | Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.
|--------------|---------|
| Unit | [unit/README.md](unit/README.md) | ### Record-replay for integration tests
| Integration | [integration/README.md](integration/README.md) |
| Verification | [verifications/README.md](verifications/README.md) | Testing AI applications end-to-end creates some challenges:
- **API costs** accumulate quickly during development and CI
- **Non-deterministic responses** make tests unreliable
- **Multiple providers** require testing the same logic across different APIs
Our solution: **Record real API responses once, replay them for fast, deterministic tests.** This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.
This gives you:
- Cost control - No repeated API calls during development
- Speed - Instant test execution with cached responses
- Reliability - Consistent results regardless of external service state
- Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.
### Testing Quick Start
You can run the unit tests with:
```bash
uv run --group unit pytest -sv tests/unit/
```
For running integration tests, you must provide a few things:
- A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
- **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
- a URL which points to a Llama Stack distribution server
- a distribution name (e.g., `starter`) or a path to a `run.yaml` file
- a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
- Whether you are using replay or live mode for inference. This is specified with the LLAMA_STACK_TEST_INFERENCE_MODE environment variable. The default mode currently is "live" -- that is certainly surprising, but we will fix this soon.
- Any API keys you need to use should be set in the environment, or can be passed in with the --env option.
You can run the integration tests in replay mode with:
```bash
# Run all tests with existing recordings
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter
```
If you don't specify LLAMA_STACK_TEST_INFERENCE_MODE, by default it will be in "live" mode -- that is, it will make real API calls.
```bash
# Test against live APIs
FIREWORKS_API_KEY=your_key pytest -sv tests/integration/inference --stack-config=starter
```
### Next Steps
- [Integration Testing Guide](integration/README.md) - Detailed usage and configuration
- [Unit Testing Guide](unit/README.md) - Fast component testing

View file

@ -1,6 +1,23 @@
# Llama Stack Integration Tests # Integration Testing Guide
We use `pytest` for parameterizing and running tests. You can see all options with: Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
## Quick Start
```bash
# Run all integration tests with existing recordings
uv run pytest tests/integration/
# Test against live APIs with auto-server
export FIREWORKS_API_KEY=your_key
pytest tests/integration/inference/ \
--stack-config=server:fireworks \
--text-model=meta-llama/Llama-3.1-8B-Instruct
```
## Configuration Options
You can see all options with:
```bash ```bash
cd tests/integration cd tests/integration
@ -114,3 +131,86 @@ pytest -s -v tests/integration/vector_io/ \
--stack-config=inference=sentence-transformers,vector_io=sqlite-vec \ --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
--embedding-model=$EMBEDDING_MODELS --embedding-model=$EMBEDDING_MODELS
``` ```
## Recording Modes
The testing system supports three modes controlled by environment variables:
### LIVE Mode (Default)
Tests make real API calls:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
```
### RECORD Mode
Captures API interactions for later replay:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
pytest tests/integration/inference/test_new_feature.py
```
### REPLAY Mode
Uses cached responses instead of making API calls:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
pytest tests/integration/
```
## Managing Recordings
### Viewing Recordings
```bash
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
# Inspect specific response
cat recordings/responses/abc123.json | jq '.'
```
### Re-recording Tests
```bash
# Re-record specific tests
rm -rf recordings/
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_modified.py
```
## Writing Tests
### Basic Test Pattern
```python
def test_basic_completion(llama_stack_client, text_model_id):
response = llama_stack_client.inference.completion(
model_id=text_model_id,
content=CompletionMessage(role="user", content="Hello"),
)
# Test structure, not AI output quality
assert response.completion_message is not None
assert isinstance(response.completion_message.content, str)
assert len(response.completion_message.content) > 0
```
### Provider-Specific Tests
```python
def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
pytest.skip(f"Model {embedding_model_id} doesn't support task types")
query_response = llama_stack_client.inference.embeddings(
model_id=embedding_model_id,
contents=["What is machine learning?"],
task_type="query"
)
assert query_response.embeddings is not None
```
## Best Practices
- **Test API contracts, not AI output quality** - Focus on response structure, not content
- **Use existing recordings for development** - Fast iteration without API costs
- **Record new interactions only when needed** - Adding new functionality
- **Test across providers** - Ensure compatibility
- **Commit recordings to version control** - Deterministic CI builds