mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-08-15 14:08:00 +00:00
docs(tests): Add a bunch of documentation for our testing systems
This commit is contained in:
parent
e1e161553c
commit
f4281ce66a
7 changed files with 1006 additions and 1 deletions
|
@ -19,7 +19,8 @@ new_vector_database
|
||||||
|
|
||||||
## Testing
|
## Testing
|
||||||
|
|
||||||
See the [Test Page](testing.md) which describes how to test your changes.
|
Llama Stack uses a record-replay testing system for reliable, cost-effective testing. See the [Testing Documentation](testing.md) for comprehensive guides on writing and running tests.
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
:hidden:
|
:hidden:
|
||||||
|
|
|
@ -1,3 +1,35 @@
|
||||||
|
# Testing
|
||||||
|
|
||||||
|
Llama Stack uses a record-replay system for reliable, fast, and cost-effective testing of AI applications.
|
||||||
|
|
||||||
|
## Testing Documentation
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
testing/index
|
||||||
|
testing/integration-testing
|
||||||
|
testing/record-replay
|
||||||
|
testing/writing-tests
|
||||||
|
testing/troubleshooting
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run tests with existing recordings
|
||||||
|
uv run pytest tests/integration/
|
||||||
|
|
||||||
|
# Test against live APIs
|
||||||
|
FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed information, see the [Testing Overview](testing/index.md).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Original Documentation
|
||||||
|
|
||||||
```{include} ../../../tests/README.md
|
```{include} ../../../tests/README.md
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
103
docs/source/contributing/testing/index.md
Normal file
103
docs/source/contributing/testing/index.md
Normal file
|
@ -0,0 +1,103 @@
|
||||||
|
# Testing in Llama Stack
|
||||||
|
|
||||||
|
Llama Stack uses a record-replay testing system to handle AI API costs, non-deterministic responses, and multiple provider integrations.
|
||||||
|
|
||||||
|
## Core Problems
|
||||||
|
|
||||||
|
Testing AI applications creates three challenges:
|
||||||
|
|
||||||
|
- **API costs** accumulate quickly during development and CI
|
||||||
|
- **Non-deterministic responses** make tests unreliable
|
||||||
|
- **Multiple providers** require testing the same logic across different APIs
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
Record real API responses once, replay them for fast, deterministic tests.
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
### Test Types
|
||||||
|
|
||||||
|
- **Unit tests** (`tests/unit/`) - Test components in isolation with mocks
|
||||||
|
- **Integration tests** (`tests/integration/`) - Test complete workflows with record-replay
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
#### Record-Replay System
|
||||||
|
|
||||||
|
Captures API calls and replays them deterministically:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Record real API responses
|
||||||
|
with inference_recording(mode=InferenceMode.RECORD, storage_dir="recordings"):
|
||||||
|
response = await client.chat.completions.create(...)
|
||||||
|
|
||||||
|
# Replay cached responses
|
||||||
|
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="recordings"):
|
||||||
|
response = await client.chat.completions.create(...) # No API call made
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Provider Testing
|
||||||
|
|
||||||
|
Write tests once, run against any provider:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Same test, different providers
|
||||||
|
pytest tests/integration/inference/ --stack-config=openai --text-model=gpt-4
|
||||||
|
pytest tests/integration/inference/ --stack-config=starter --text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Test Parametrization
|
||||||
|
|
||||||
|
Generate test combinations from CLI arguments:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Creates test for each model/provider combination
|
||||||
|
pytest tests/integration/ \
|
||||||
|
--stack-config=inference=fireworks \
|
||||||
|
--text-model=llama-3.1-8b,llama-3.1-70b
|
||||||
|
```
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
### Recording Storage
|
||||||
|
|
||||||
|
Recordings use SQLite for lookup and JSON for storage:
|
||||||
|
|
||||||
|
```
|
||||||
|
recordings/
|
||||||
|
├── index.sqlite # Fast lookup by request hash
|
||||||
|
└── responses/
|
||||||
|
├── abc123def456.json # Individual response files
|
||||||
|
└── def789ghi012.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why Record-Replay?
|
||||||
|
|
||||||
|
Mocking AI APIs is brittle. Real API responses:
|
||||||
|
- Include edge cases and realistic data structures
|
||||||
|
- Preserve streaming behavior
|
||||||
|
- Can be inspected and debugged
|
||||||
|
|
||||||
|
### Why Test All Providers?
|
||||||
|
|
||||||
|
One test verifies behavior across all providers, catching integration bugs early.
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. **Develop tests** in `LIVE` mode against real APIs
|
||||||
|
2. **Record responses** with `RECORD` mode
|
||||||
|
3. **Commit recordings** for deterministic CI
|
||||||
|
4. **Tests replay** cached responses in CI
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run tests with existing recordings
|
||||||
|
uv run pytest tests/integration/
|
||||||
|
|
||||||
|
# Test against live APIs
|
||||||
|
FIREWORKS_API_KEY=... pytest tests/integration/ --stack-config=server:fireworks
|
||||||
|
```
|
||||||
|
|
||||||
|
See [Integration Testing](integration-testing.md) for usage details and [Record-Replay](record-replay.md) for system internals.
|
136
docs/source/contributing/testing/integration-testing.md
Normal file
136
docs/source/contributing/testing/integration-testing.md
Normal file
|
@ -0,0 +1,136 @@
|
||||||
|
# Integration Testing Guide
|
||||||
|
|
||||||
|
Practical usage of Llama Stack's integration testing system.
|
||||||
|
|
||||||
|
## Basic Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all integration tests
|
||||||
|
uv run pytest tests/integration/
|
||||||
|
|
||||||
|
# Run specific test suites
|
||||||
|
uv run pytest tests/integration/inference/
|
||||||
|
uv run pytest tests/integration/agents/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Live API Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Auto-start server
|
||||||
|
export FIREWORKS_API_KEY=your_key
|
||||||
|
pytest tests/integration/inference/ \
|
||||||
|
--stack-config=server:fireworks \
|
||||||
|
--text-model=meta-llama/Llama-3.1-8B-Instruct
|
||||||
|
|
||||||
|
# Library client
|
||||||
|
export TOGETHER_API_KEY=your_key
|
||||||
|
pytest tests/integration/inference/ \
|
||||||
|
--stack-config=starter \
|
||||||
|
--text-model=meta-llama/Llama-3.1-8B-Instruct
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Stack Config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
--stack-config=server:fireworks # Auto-start server
|
||||||
|
--stack-config=server:together:8322 # Custom port
|
||||||
|
--stack-config=starter # Template
|
||||||
|
--stack-config=/path/to/run.yaml # Config file
|
||||||
|
--stack-config=inference=fireworks # Adhoc providers
|
||||||
|
--stack-config=http://localhost:5001 # Existing server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
--text-model=meta-llama/Llama-3.1-8B-Instruct
|
||||||
|
--vision-model=meta-llama/Llama-3.2-11B-Vision-Instruct
|
||||||
|
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
--env FIREWORKS_API_KEY=your_key
|
||||||
|
--env OPENAI_BASE_URL=http://localhost:11434/v1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Scenarios
|
||||||
|
|
||||||
|
### New Provider Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test new provider
|
||||||
|
pytest tests/integration/inference/ \
|
||||||
|
--stack-config=inference=your-new-provider \
|
||||||
|
--text-model=your-model-id
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multiple Models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test multiple models
|
||||||
|
pytest tests/integration/inference/ \
|
||||||
|
--text-model=llama-3.1-8b,llama-3.1-70b
|
||||||
|
```
|
||||||
|
|
||||||
|
### Local Development
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with local Ollama
|
||||||
|
pytest tests/integration/inference/ \
|
||||||
|
--stack-config=starter \
|
||||||
|
--text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recording Modes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Live API calls (default)
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
|
||||||
|
|
||||||
|
# Record new responses
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=record \
|
||||||
|
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
|
||||||
|
pytest tests/integration/inference/test_new.py
|
||||||
|
|
||||||
|
# Replay cached responses
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
|
||||||
|
LLAMA_STACK_TEST_RECORDING_DIR=./recordings \
|
||||||
|
pytest tests/integration/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recording Management
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View recordings
|
||||||
|
sqlite3 recordings/index.sqlite "SELECT * FROM recordings;"
|
||||||
|
cat recordings/responses/abc123.json
|
||||||
|
|
||||||
|
# Re-record tests
|
||||||
|
rm -rf recordings/
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/test_specific.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Debugging
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verbose output
|
||||||
|
pytest -vvs tests/integration/inference/
|
||||||
|
|
||||||
|
# Debug logging
|
||||||
|
LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py
|
||||||
|
|
||||||
|
# Custom port for conflicts
|
||||||
|
pytest tests/integration/ --stack-config=server:fireworks:8322
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
- Use existing recordings for development
|
||||||
|
- Record new interactions only when needed
|
||||||
|
- Test across multiple providers
|
||||||
|
- Use descriptive test names
|
||||||
|
- Commit recordings to version control
|
80
docs/source/contributing/testing/record-replay.md
Normal file
80
docs/source/contributing/testing/record-replay.md
Normal file
|
@ -0,0 +1,80 @@
|
||||||
|
# Record-Replay System
|
||||||
|
|
||||||
|
The record-replay system captures real API interactions and replays them deterministically for fast, reliable testing.
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
### Request Hashing
|
||||||
|
|
||||||
|
API requests are hashed to enable consistent lookup:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
|
||||||
|
normalized = {
|
||||||
|
"method": method.upper(),
|
||||||
|
"endpoint": urlparse(url).path,
|
||||||
|
"body": body
|
||||||
|
}
|
||||||
|
return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
|
||||||
|
```
|
||||||
|
|
||||||
|
Hashing is precise - different whitespace or float precision produces different hashes.
|
||||||
|
|
||||||
|
### Client Interception
|
||||||
|
|
||||||
|
The system patches OpenAI and Ollama client methods to intercept API calls before they leave the client.
|
||||||
|
|
||||||
|
## Storage
|
||||||
|
|
||||||
|
Recordings use SQLite for indexing and JSON for storage:
|
||||||
|
|
||||||
|
```
|
||||||
|
recordings/
|
||||||
|
├── index.sqlite # Fast lookup by request hash
|
||||||
|
└── responses/
|
||||||
|
├── abc123def456.json # Individual response files
|
||||||
|
└── def789ghi012.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recording Modes
|
||||||
|
|
||||||
|
### LIVE Mode
|
||||||
|
Direct API calls, no recording/replay:
|
||||||
|
```python
|
||||||
|
with inference_recording(mode=InferenceMode.LIVE):
|
||||||
|
response = await client.chat.completions.create(...)
|
||||||
|
```
|
||||||
|
|
||||||
|
### RECORD Mode
|
||||||
|
Captures API interactions:
|
||||||
|
```python
|
||||||
|
with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
|
||||||
|
response = await client.chat.completions.create(...)
|
||||||
|
# Response captured AND returned
|
||||||
|
```
|
||||||
|
|
||||||
|
### REPLAY Mode
|
||||||
|
Uses stored recordings:
|
||||||
|
```python
|
||||||
|
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
|
||||||
|
response = await client.chat.completions.create(...)
|
||||||
|
# Returns cached response, no API call
|
||||||
|
```
|
||||||
|
|
||||||
|
## Streaming Support
|
||||||
|
|
||||||
|
Streaming responses are captured completely before any chunks are yielded, then replayed as an async generator that matches the original API behavior.
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export LLAMA_STACK_TEST_INFERENCE_MODE=replay
|
||||||
|
export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
|
||||||
|
pytest tests/integration/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Issues
|
||||||
|
|
||||||
|
- **"No recorded response found"** - Re-record with `RECORD` mode
|
||||||
|
- **Serialization errors** - Response types changed, re-record
|
||||||
|
- **Hash mismatches** - Request parameters changed slightly
|
528
docs/source/contributing/testing/troubleshooting.md
Normal file
528
docs/source/contributing/testing/troubleshooting.md
Normal file
|
@ -0,0 +1,528 @@
|
||||||
|
# Testing Troubleshooting Guide
|
||||||
|
|
||||||
|
This guide covers common issues encountered when working with Llama Stack's testing infrastructure and how to resolve them.
|
||||||
|
|
||||||
|
## Quick Diagnosis
|
||||||
|
|
||||||
|
### Test Status Quick Check
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if tests can run at all
|
||||||
|
uv run pytest tests/integration/inference/test_embedding.py::test_basic_embeddings -v
|
||||||
|
|
||||||
|
# Check available models and providers
|
||||||
|
uv run llama stack list-providers
|
||||||
|
uv run llama stack list-models
|
||||||
|
|
||||||
|
# Verify server connectivity
|
||||||
|
curl http://localhost:5001/v1/health
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recording and Replay Issues
|
||||||
|
|
||||||
|
### "No recorded response found for request hash"
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
RuntimeError: No recorded response found for request hash: abc123def456
|
||||||
|
Endpoint: /v1/chat/completions
|
||||||
|
Model: meta-llama/Llama-3.1-8B-Instruct
|
||||||
|
```
|
||||||
|
|
||||||
|
**Causes and Solutions:**
|
||||||
|
|
||||||
|
1. **Missing recording** - Most common cause
|
||||||
|
```bash
|
||||||
|
# Record the missing interaction
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=record \
|
||||||
|
LLAMA_STACK_TEST_RECORDING_DIR=./test_recordings \
|
||||||
|
pytest tests/integration/inference/test_failing.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Request parameters changed**
|
||||||
|
```bash
|
||||||
|
# Check what changed by comparing requests
|
||||||
|
sqlite3 test_recordings/index.sqlite \
|
||||||
|
"SELECT request_hash, endpoint, model, timestamp FROM recordings WHERE endpoint='/v1/chat/completions';"
|
||||||
|
|
||||||
|
# View specific request details
|
||||||
|
cat test_recordings/responses/abc123def456.json | jq '.request'
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Different environment/provider**
|
||||||
|
```bash
|
||||||
|
# Ensure consistent test environment
|
||||||
|
pytest tests/integration/ --stack-config=starter --text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recording Failures
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
sqlite3.OperationalError: database is locked
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Concurrent access** - Multiple test processes
|
||||||
|
```bash
|
||||||
|
# Run tests sequentially
|
||||||
|
pytest tests/integration/ -n 1
|
||||||
|
|
||||||
|
# Or use separate recording directories
|
||||||
|
LLAMA_STACK_TEST_RECORDING_DIR=./recordings_$(date +%s) pytest ...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Incomplete recording cleanup**
|
||||||
|
```bash
|
||||||
|
# Clear and restart recording
|
||||||
|
rm -rf test_recordings/
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_specific.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Serialization/Deserialization Errors
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
Failed to deserialize object of type llama_stack.apis.inference.OpenAIChatCompletion
|
||||||
|
```
|
||||||
|
|
||||||
|
**Causes and Solutions:**
|
||||||
|
|
||||||
|
1. **API response format changed**
|
||||||
|
```bash
|
||||||
|
# Re-record with updated format
|
||||||
|
rm test_recordings/responses/abc123*.json
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/inference/test_failing.py
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Missing dependencies for deserialization**
|
||||||
|
```bash
|
||||||
|
# Ensure all required packages installed
|
||||||
|
uv install --group dev
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Version mismatch between record and replay**
|
||||||
|
```bash
|
||||||
|
# Check Python environment consistency
|
||||||
|
uv run python -c "import llama_stack; print(llama_stack.__version__)"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Server Connection Issues
|
||||||
|
|
||||||
|
### "Connection refused" Errors
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5001)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Diagnosis and Solutions:**
|
||||||
|
|
||||||
|
1. **Server not running**
|
||||||
|
```bash
|
||||||
|
# Check if server is running
|
||||||
|
curl http://localhost:5001/v1/health
|
||||||
|
|
||||||
|
# Start server manually for debugging
|
||||||
|
llama stack run --template starter --port 5001
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Port conflicts**
|
||||||
|
```bash
|
||||||
|
# Check what's using the port
|
||||||
|
lsof -i :5001
|
||||||
|
|
||||||
|
# Use different port
|
||||||
|
pytest tests/integration/ --stack-config=server:starter:8322
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Server startup timeout**
|
||||||
|
```bash
|
||||||
|
# Increase startup timeout or check server logs
|
||||||
|
tail -f server.log
|
||||||
|
|
||||||
|
# Manual server management
|
||||||
|
llama stack run --template starter &
|
||||||
|
sleep 30 # Wait for startup
|
||||||
|
pytest tests/integration/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Auto-Server Startup Issues
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
Server failed to respond within 30 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Check server logs**
|
||||||
|
```bash
|
||||||
|
# Server logs are written to server.log
|
||||||
|
tail -f server.log
|
||||||
|
|
||||||
|
# Look for startup errors
|
||||||
|
grep -i error server.log
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Dependencies missing**
|
||||||
|
```bash
|
||||||
|
# Ensure all dependencies installed
|
||||||
|
uv install --group dev
|
||||||
|
|
||||||
|
# Check specific provider requirements
|
||||||
|
pip list | grep -i fireworks
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Resource constraints**
|
||||||
|
```bash
|
||||||
|
# Check system resources
|
||||||
|
htop
|
||||||
|
df -h
|
||||||
|
|
||||||
|
# Use lighter config for testing
|
||||||
|
pytest tests/integration/ --stack-config=starter
|
||||||
|
```
|
||||||
|
|
||||||
|
## Provider and Model Issues
|
||||||
|
|
||||||
|
### "Model not found" Errors
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
Model 'meta-llama/Llama-3.1-8B-Instruct' not found
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Check available models**
|
||||||
|
```bash
|
||||||
|
# List models for current provider
|
||||||
|
uv run llama stack list-models
|
||||||
|
|
||||||
|
# Use available model
|
||||||
|
pytest tests/integration/ --text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Model not downloaded for local providers**
|
||||||
|
```bash
|
||||||
|
# Download missing model
|
||||||
|
ollama pull llama3.2:3b
|
||||||
|
|
||||||
|
# Verify model available
|
||||||
|
ollama list
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Provider configuration issues**
|
||||||
|
```bash
|
||||||
|
# Check provider setup
|
||||||
|
uv run llama stack list-providers
|
||||||
|
|
||||||
|
# Verify API keys set
|
||||||
|
echo $FIREWORKS_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provider Authentication Failures
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
HTTP 401: Invalid API key
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Missing API keys**
|
||||||
|
```bash
|
||||||
|
# Set required API key
|
||||||
|
export FIREWORKS_API_KEY=your_key_here
|
||||||
|
export OPENAI_API_KEY=your_key_here
|
||||||
|
|
||||||
|
# Verify key is set
|
||||||
|
echo $FIREWORKS_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Invalid API keys**
|
||||||
|
```bash
|
||||||
|
# Test API key directly
|
||||||
|
curl -H "Authorization: Bearer $FIREWORKS_API_KEY" \
|
||||||
|
https://api.fireworks.ai/inference/v1/models
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **API key environment issues**
|
||||||
|
```bash
|
||||||
|
# Pass environment explicitly
|
||||||
|
pytest tests/integration/ --env FIREWORKS_API_KEY=your_key
|
||||||
|
```
|
||||||
|
|
||||||
|
## Parametrization Issues
|
||||||
|
|
||||||
|
### "No tests ran matching the given pattern"
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
collected 0 items
|
||||||
|
```
|
||||||
|
|
||||||
|
**Causes and Solutions:**
|
||||||
|
|
||||||
|
1. **No models specified**
|
||||||
|
```bash
|
||||||
|
# Specify required models
|
||||||
|
pytest tests/integration/inference/ --text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Model/provider mismatch**
|
||||||
|
```bash
|
||||||
|
# Use compatible model for provider
|
||||||
|
pytest tests/integration/ \
|
||||||
|
--stack-config=starter \
|
||||||
|
--text-model=llama3.2:3b # Available in Ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Missing fixtures**
|
||||||
|
```bash
|
||||||
|
# Check test requirements
|
||||||
|
pytest tests/integration/inference/test_embedding.py --collect-only
|
||||||
|
```
|
||||||
|
|
||||||
|
### Excessive Test Combinations
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
Tests run for too many parameter combinations, taking too long.
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Limit model combinations**
|
||||||
|
```bash
|
||||||
|
# Test single model instead of list
|
||||||
|
pytest tests/integration/ --text-model=llama3.2:3b
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Use specific test selection**
|
||||||
|
```bash
|
||||||
|
# Run specific test pattern
|
||||||
|
pytest tests/integration/ -k "basic and not vision"
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Separate test runs**
|
||||||
|
```bash
|
||||||
|
# Split by functionality
|
||||||
|
pytest tests/integration/inference/ --text-model=model1
|
||||||
|
pytest tests/integration/agents/ --text-model=model2
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Slow Test Execution
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
Tests take much longer than expected.
|
||||||
|
|
||||||
|
**Diagnosis and Solutions:**
|
||||||
|
|
||||||
|
1. **Using LIVE mode instead of REPLAY**
|
||||||
|
```bash
|
||||||
|
# Verify recording mode
|
||||||
|
echo $LLAMA_STACK_TEST_INFERENCE_MODE
|
||||||
|
|
||||||
|
# Force replay mode
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Network latency to providers**
|
||||||
|
```bash
|
||||||
|
# Use local providers for development
|
||||||
|
pytest tests/integration/ --stack-config=starter
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Large recording files**
|
||||||
|
```bash
|
||||||
|
# Check recording directory size
|
||||||
|
du -sh test_recordings/
|
||||||
|
|
||||||
|
# Clean up old recordings
|
||||||
|
find test_recordings/ -name "*.json" -mtime +30 -delete
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory Usage Issues
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
MemoryError: Unable to allocate memory
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Large recordings in memory**
|
||||||
|
```bash
|
||||||
|
# Run tests in smaller batches
|
||||||
|
pytest tests/integration/inference/ -k "not batch"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Model memory requirements**
|
||||||
|
```bash
|
||||||
|
# Use smaller models for testing
|
||||||
|
pytest tests/integration/ --text-model=llama3.2:3b # Instead of 70B
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment Issues
|
||||||
|
|
||||||
|
### Python Environment Problems
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
ModuleNotFoundError: No module named 'llama_stack'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **Wrong Python environment**
|
||||||
|
```bash
|
||||||
|
# Verify uv environment
|
||||||
|
uv run python -c "import llama_stack; print('OK')"
|
||||||
|
|
||||||
|
# Reinstall if needed
|
||||||
|
uv install --group dev
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Development installation issues**
|
||||||
|
```bash
|
||||||
|
# Reinstall in development mode
|
||||||
|
pip install -e .
|
||||||
|
|
||||||
|
# Verify installation
|
||||||
|
python -c "import llama_stack; print(llama_stack.__file__)"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Path and Import Issues
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
```
|
||||||
|
ImportError: cannot import name 'LlamaStackClient'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
|
||||||
|
1. **PYTHONPATH issues**
|
||||||
|
```bash
|
||||||
|
# Run from project root
|
||||||
|
cd /path/to/llama-stack
|
||||||
|
uv run pytest tests/integration/
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Relative import issues**
|
||||||
|
```bash
|
||||||
|
# Use absolute imports in tests
|
||||||
|
from llama_stack_client import LlamaStackClient # Not relative
|
||||||
|
```
|
||||||
|
|
||||||
|
## Debugging Techniques
|
||||||
|
|
||||||
|
### Verbose Logging
|
||||||
|
|
||||||
|
Enable detailed logging to understand what's happening:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Enable debug logging
|
||||||
|
LLAMA_STACK_LOG_LEVEL=DEBUG pytest tests/integration/inference/test_failing.py -v -s
|
||||||
|
|
||||||
|
# Enable request/response logging
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=live \
|
||||||
|
LLAMA_STACK_LOG_LEVEL=DEBUG \
|
||||||
|
pytest tests/integration/inference/test_failing.py -v -s
|
||||||
|
```
|
||||||
|
|
||||||
|
### Interactive Debugging
|
||||||
|
|
||||||
|
Drop into debugger when tests fail:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run with pdb on failure
|
||||||
|
pytest tests/integration/inference/test_failing.py --pdb
|
||||||
|
|
||||||
|
# Or add breakpoint in test code
|
||||||
|
def test_something(llama_stack_client):
|
||||||
|
import pdb; pdb.set_trace()
|
||||||
|
# ... test code
|
||||||
|
```
|
||||||
|
|
||||||
|
### Isolation Testing
|
||||||
|
|
||||||
|
Run tests in isolation to identify interactions:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run single test
|
||||||
|
pytest tests/integration/inference/test_embedding.py::test_basic_embeddings
|
||||||
|
|
||||||
|
# Run without recordings
|
||||||
|
rm -rf test_recordings/
|
||||||
|
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/inference/test_failing.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recording Inspection
|
||||||
|
|
||||||
|
Examine recordings to understand what's stored:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check recording database
|
||||||
|
sqlite3 test_recordings/index.sqlite ".tables"
|
||||||
|
sqlite3 test_recordings/index.sqlite ".schema recordings"
|
||||||
|
sqlite3 test_recordings/index.sqlite "SELECT * FROM recordings LIMIT 5;"
|
||||||
|
|
||||||
|
# Examine specific recording
|
||||||
|
find test_recordings/responses/ -name "*.json" | head -1 | xargs cat | jq '.'
|
||||||
|
|
||||||
|
# Compare request hashes
|
||||||
|
python -c "
|
||||||
|
from llama_stack.testing.inference_recorder import normalize_request
|
||||||
|
print(normalize_request('POST', 'http://localhost:11434/v1/chat/completions', {}, {'model': 'llama3.2:3b', 'messages': [{'role': 'user', 'content': 'Hello'}]}))
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
### Information to Gather
|
||||||
|
|
||||||
|
When reporting issues, include:
|
||||||
|
|
||||||
|
1. **Environment details:**
|
||||||
|
```bash
|
||||||
|
uv run python --version
|
||||||
|
uv run python -c "import llama_stack; print(llama_stack.__version__)"
|
||||||
|
uv list
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Test command and output:**
|
||||||
|
```bash
|
||||||
|
# Full command that failed
|
||||||
|
pytest tests/integration/inference/test_failing.py -v
|
||||||
|
|
||||||
|
# Error message and stack trace
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Configuration details:**
|
||||||
|
```bash
|
||||||
|
# Stack configuration used
|
||||||
|
echo $LLAMA_STACK_TEST_INFERENCE_MODE
|
||||||
|
ls -la test_recordings/
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Provider status:**
|
||||||
|
```bash
|
||||||
|
uv run llama stack list-providers
|
||||||
|
uv run llama stack list-models
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Solutions Summary
|
||||||
|
|
||||||
|
| Issue | Quick Fix |
|
||||||
|
|-------|-----------|
|
||||||
|
| Missing recordings | `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...` |
|
||||||
|
| Connection refused | Check server: `curl http://localhost:5001/v1/health` |
|
||||||
|
| No tests collected | Add model: `--text-model=llama3.2:3b` |
|
||||||
|
| Authentication error | Set API key: `export PROVIDER_API_KEY=...` |
|
||||||
|
| Serialization error | Re-record: `rm recordings/*.json && record mode` |
|
||||||
|
| Slow tests | Use replay: `LLAMA_STACK_TEST_INFERENCE_MODE=replay` |
|
||||||
|
|
||||||
|
Most testing issues stem from configuration mismatches or missing recordings. The record-replay system is designed to be forgiving, but requires consistent environment setup for optimal performance.
|
125
docs/source/contributing/testing/writing-tests.md
Normal file
125
docs/source/contributing/testing/writing-tests.md
Normal file
|
@ -0,0 +1,125 @@
|
||||||
|
# Writing Tests
|
||||||
|
|
||||||
|
How to write effective tests for Llama Stack.
|
||||||
|
|
||||||
|
## Basic Test Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_basic_completion(llama_stack_client, text_model_id):
|
||||||
|
"""Test basic text completion functionality."""
|
||||||
|
response = llama_stack_client.inference.completion(
|
||||||
|
model_id=text_model_id,
|
||||||
|
content=CompletionMessage(role="user", content="Hello"),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test structure, not AI output quality
|
||||||
|
assert response.completion_message is not None
|
||||||
|
assert isinstance(response.completion_message.content, str)
|
||||||
|
assert len(response.completion_message.content) > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Parameterized Tests
|
||||||
|
|
||||||
|
```python
|
||||||
|
@pytest.mark.parametrize("temperature", [0.0, 0.5, 1.0])
|
||||||
|
def test_completion_temperature(llama_stack_client, text_model_id, temperature):
|
||||||
|
response = llama_stack_client.inference.completion(
|
||||||
|
model_id=text_model_id,
|
||||||
|
content=CompletionMessage(role="user", content="Hello"),
|
||||||
|
sampling_params={"temperature": temperature}
|
||||||
|
)
|
||||||
|
assert response.completion_message is not None
|
||||||
|
```
|
||||||
|
|
||||||
|
## Provider-Specific Tests
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
|
||||||
|
if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
|
||||||
|
pytest.skip(f"Model {embedding_model_id} doesn't support task types")
|
||||||
|
|
||||||
|
query_response = llama_stack_client.inference.embeddings(
|
||||||
|
model_id=embedding_model_id,
|
||||||
|
contents=["What is machine learning?"],
|
||||||
|
task_type="query"
|
||||||
|
)
|
||||||
|
|
||||||
|
passage_response = llama_stack_client.inference.embeddings(
|
||||||
|
model_id=embedding_model_id,
|
||||||
|
contents=["Machine learning is a subset of AI..."],
|
||||||
|
task_type="passage"
|
||||||
|
)
|
||||||
|
|
||||||
|
assert query_response.embeddings != passage_response.embeddings
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fixtures
|
||||||
|
|
||||||
|
```python
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def agent_config(llama_stack_client, text_model_id):
|
||||||
|
"""Reusable agent configuration."""
|
||||||
|
return {
|
||||||
|
"model": text_model_id,
|
||||||
|
"instructions": "You are a helpful assistant",
|
||||||
|
"tools": [],
|
||||||
|
"enable_session_persistence": False,
|
||||||
|
}
|
||||||
|
|
||||||
|
@pytest.fixture(scope="function")
|
||||||
|
def fresh_session(llama_stack_client):
|
||||||
|
"""Each test gets fresh state."""
|
||||||
|
session = llama_stack_client.create_session()
|
||||||
|
yield session
|
||||||
|
session.delete()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Test Patterns
|
||||||
|
|
||||||
|
### Streaming Tests
|
||||||
|
```python
|
||||||
|
def test_streaming_completion(llama_stack_client, text_model_id):
|
||||||
|
stream = llama_stack_client.inference.completion(
|
||||||
|
model_id=text_model_id,
|
||||||
|
content=CompletionMessage(role="user", content="Count to 5"),
|
||||||
|
stream=True
|
||||||
|
)
|
||||||
|
|
||||||
|
chunks = list(stream)
|
||||||
|
assert len(chunks) > 1
|
||||||
|
assert all(hasattr(chunk, 'delta') for chunk in chunks)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Error Testing
|
||||||
|
```python
|
||||||
|
def test_invalid_model_error(llama_stack_client):
|
||||||
|
with pytest.raises(Exception) as exc_info:
|
||||||
|
llama_stack_client.inference.completion(
|
||||||
|
model_id="nonexistent-model",
|
||||||
|
content=CompletionMessage(role="user", content="Hello")
|
||||||
|
)
|
||||||
|
assert "model" in str(exc_info.value).lower()
|
||||||
|
```
|
||||||
|
|
||||||
|
## What NOT to Test
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BAD: Testing AI output quality
|
||||||
|
def test_completion_quality(llama_stack_client, text_model_id):
|
||||||
|
response = llama_stack_client.inference.completion(...)
|
||||||
|
assert "correct answer" in response.content # Fragile!
|
||||||
|
|
||||||
|
# GOOD: Testing response structure
|
||||||
|
def test_completion_structure(llama_stack_client, text_model_id):
|
||||||
|
response = llama_stack_client.inference.completion(...)
|
||||||
|
assert isinstance(response.completion_message.content, str)
|
||||||
|
assert len(response.completion_message.content) > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
- Test API contracts, not AI output quality
|
||||||
|
- Use descriptive test names
|
||||||
|
- Keep tests simple and focused
|
||||||
|
- Record new interactions only when needed
|
||||||
|
- Use appropriate fixture scopes (session vs function)
|
Loading…
Add table
Add a link
Reference in a new issue