mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-11 19:56:03 +00:00
- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
620 lines
15 KiB
Markdown
620 lines
15 KiB
Markdown
# Llama Stack - Architecture Insights for Developers
|
|
|
|
## Why This Architecture Works
|
|
|
|
### Problem It Solves
|
|
Without Llama Stack, building AI applications requires:
|
|
- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
|
|
- Rewriting code to switch providers
|
|
- Duplicating logic for common patterns (safety checks, vector search, etc.)
|
|
- Managing complex dependencies manually
|
|
|
|
### Solution: The Three Pillars
|
|
```
|
|
Single, Unified API Interface
|
|
↓
|
|
Multiple Provider Implementations
|
|
↓
|
|
Pre-configured Distributions
|
|
```
|
|
|
|
**Result**: Write once, run anywhere (locally, cloud, on-device)
|
|
|
|
---
|
|
|
|
## The Genius of the Plugin Architecture
|
|
|
|
### How It Works
|
|
1. **Define Abstract Interface** (Protocol in `apis/`)
|
|
```python
|
|
class Inference(Protocol):
|
|
async def post_chat_completion(...) -> AsyncIterator[...]: ...
|
|
```
|
|
|
|
2. **Multiple Implementations** (in `providers/`)
|
|
- Local: Meta Reference, vLLM, Ollama
|
|
- Cloud: OpenAI, Anthropic, Groq, Bedrock
|
|
- Each implements same interface
|
|
|
|
3. **Runtime Selection** (via YAML config)
|
|
```yaml
|
|
providers:
|
|
inference:
|
|
- provider_type: remote::openai
|
|
```
|
|
|
|
4. **Zero Code Changes** to switch providers!
|
|
|
|
### Why This Beats Individual SDKs
|
|
- **Single SDK** vs 30+ provider SDKs
|
|
- **Same API** vs learning each provider's quirks
|
|
- **Easy migration** - change 1 config value
|
|
- **Testing** - same tests work across all providers
|
|
|
|
---
|
|
|
|
## The Request Routing Intelligence
|
|
|
|
### Two Clever Routing Strategies
|
|
|
|
#### 1. Auto-Routed APIs (Smart Dispatch)
|
|
**APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
|
|
|
|
When you call:
|
|
```python
|
|
await inference.post_chat_completion(model="llama-2-7b")
|
|
```
|
|
|
|
Router automatically determines:
|
|
- "Which provider has llama-2-7b?"
|
|
- "Route this request there"
|
|
- "Stream response back"
|
|
|
|
**Implementation**: `routers/` directory contains auto-routers
|
|
|
|
#### 2. Routing Table APIs (Registry Pattern)
|
|
**APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
|
|
|
|
When you call:
|
|
```python
|
|
models = await models.list_models() # Merged list from ALL providers
|
|
```
|
|
|
|
Router:
|
|
- Queries each provider
|
|
- Merges results
|
|
- Returns unified list
|
|
|
|
**Implementation**: `routing_tables/` directory
|
|
|
|
### Why This Matters
|
|
- **Users don't think about providers** - just use the API
|
|
- **Multiple implementations work** - router handles dispatch
|
|
- **Easy scaling** - add new providers without touching user code
|
|
- **Resource management** - router knows what's available
|
|
|
|
---
|
|
|
|
## Configuration as a Weapon
|
|
|
|
### The Power of YAML Over Code
|
|
Traditional approach:
|
|
```python
|
|
# Code changes needed for each provider!
|
|
if use_openai:
|
|
from openai import OpenAI
|
|
client = OpenAI(api_key=...)
|
|
elif use_ollama:
|
|
from ollama import Client
|
|
client = Client(url=...)
|
|
# etc.
|
|
```
|
|
|
|
Llama Stack approach:
|
|
```yaml
|
|
# Zero code changes!
|
|
providers:
|
|
inference:
|
|
- provider_type: remote::openai
|
|
config:
|
|
api_key: ${env.OPENAI_API_KEY}
|
|
```
|
|
|
|
Then later, change to:
|
|
```yaml
|
|
providers:
|
|
inference:
|
|
- provider_type: remote::ollama
|
|
config:
|
|
host: localhost
|
|
```
|
|
|
|
**Same application code** works with both!
|
|
|
|
### Environment Variable Magic
|
|
```bash
|
|
# Change provider at runtime
|
|
INFERENCE_MODEL=llama-2-70b llama stack run starter
|
|
|
|
# No redeployment needed!
|
|
```
|
|
|
|
---
|
|
|
|
## The Distributions Strategy
|
|
|
|
### Problem: "Works on My Machine"
|
|
- Different developers need different setups
|
|
- Production needs different providers than development
|
|
- CI/CD needs lightweight dependencies
|
|
|
|
### Solution: Pre-verified Distributions
|
|
```
|
|
starter → Works on CPU with free APIs (Ollama + OpenAI)
|
|
starter-gpu → Works on GPU machines
|
|
meta-reference-gpu → Works with full local setup
|
|
postgres-demo → Production-grade with persistent storage
|
|
```
|
|
|
|
Each distribution:
|
|
- Pre-selects working providers
|
|
- Sets sensible defaults
|
|
- Bundles required dependencies
|
|
- Tested end-to-end
|
|
|
|
**Result**: `llama stack run starter` just works for 80% of use cases
|
|
|
|
### Why This Beats Documentation
|
|
- **No setup guides needed** - distribution does it
|
|
- **No guessing** - curated, tested combinations
|
|
- **Reproducible** - same distro always works same way
|
|
- **Upgradeable** - update distro = get improvements
|
|
|
|
---
|
|
|
|
## The Testing Genius: Record-Replay
|
|
|
|
### Traditional Testing Hell for AI
|
|
Problem:
|
|
- API calls cost money
|
|
- API responses are non-deterministic
|
|
- Each provider has different response formats
|
|
- Tests become slow and flaky
|
|
|
|
### The Record-Replay Solution
|
|
|
|
First run (record):
|
|
```bash
|
|
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
|
|
# Makes real API calls, saves responses to YAML
|
|
```
|
|
|
|
All subsequent runs (replay):
|
|
```bash
|
|
pytest tests/integration/
|
|
# Returns cached responses, NO API calls, instant results
|
|
```
|
|
|
|
### Why This is Brilliant
|
|
- **Cost**: Record once, replay 1000x. Save thousands of dollars
|
|
- **Speed**: Cached responses = instant test execution
|
|
- **Reliability**: Deterministic results (no API variability)
|
|
- **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc.
|
|
|
|
**File location**: `tests/integration/[api]/cassettes/`
|
|
|
|
---
|
|
|
|
## Core Runtime: The Stack Class
|
|
|
|
### The Elegance of Inheritance
|
|
```python
|
|
class LlamaStack(
|
|
Inference, # Chat completion, embeddings
|
|
Agents, # Multi-turn orchestration
|
|
Safety, # Content filtering
|
|
VectorIO, # Vector operations
|
|
Tools, # Function execution
|
|
Eval, # Evaluation
|
|
Scoring, # Response scoring
|
|
Models, # Model registry
|
|
# ... 19 more APIs
|
|
):
|
|
pass
|
|
```
|
|
|
|
A single `LlamaStack` instance:
|
|
- Implements 27 different APIs
|
|
- Has 50+ providers backing it
|
|
- Routes requests intelligently
|
|
- Manages dependencies
|
|
|
|
All from a ~400 line file + lots of protocol definitions!
|
|
|
|
---
|
|
|
|
## Dependency Injection Without the Complexity
|
|
|
|
### How Providers Depend on Each Other
|
|
Problem: Agents need Inference, Inference needs Models registry
|
|
```python
|
|
class AgentProvider:
|
|
def __init__(self,
|
|
inference: InferenceProvider,
|
|
safety: SafetyProvider,
|
|
tool_runtime: ToolRuntimeProvider):
|
|
self.inference = inference
|
|
self.safety = safety
|
|
self.tool_runtime = tool_runtime
|
|
```
|
|
|
|
### How It Gets Resolved
|
|
**File**: `core/resolver.py`
|
|
|
|
1. Parse `run.yaml` - which providers enabled?
|
|
2. Build dependency graph - who depends on whom?
|
|
3. Topological sort - what order to instantiate?
|
|
4. Instantiate in order - each gets its dependencies
|
|
|
|
**Result**: Complex dependency chains handled automatically!
|
|
|
|
---
|
|
|
|
## The Client Duality
|
|
|
|
### Two Ways to Use Llama Stack
|
|
|
|
#### 1. Library Mode (In-Process)
|
|
```python
|
|
from llama_stack import AsyncLlamaStackAsLibraryClient
|
|
|
|
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
|
|
response = await client.inference.post_chat_completion(...)
|
|
```
|
|
- No HTTP overhead
|
|
- Direct Python API
|
|
- Embedded in application
|
|
- **File**: `core/library_client.py`
|
|
|
|
#### 2. Server Mode (HTTP)
|
|
```bash
|
|
llama stack run starter # Start server on port 8321
|
|
```
|
|
|
|
```python
|
|
from llama_stack_client import AsyncLlamaStackClient
|
|
|
|
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
|
|
response = await client.inference.post_chat_completion(...)
|
|
```
|
|
- Distributed architecture
|
|
- Share single server across apps
|
|
- Easy deployment
|
|
- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
|
|
|
|
**Result**: Same API, different deployment strategies!
|
|
|
|
---
|
|
|
|
## The Model System Insight
|
|
|
|
### Why It Exists
|
|
Problem: Different model IDs across providers
|
|
- HuggingFace: `meta-llama/Llama-2-7b`
|
|
- Ollama: `llama2`
|
|
- OpenAI: `gpt-4`
|
|
|
|
### Solution: Universal Model Registry
|
|
**File**: `models/llama/sku_list.py`
|
|
|
|
```python
|
|
resolve_model("meta-llama/Llama-2-7b")
|
|
# Returns Model object with:
|
|
# - Architecture info
|
|
# - Tokenizer
|
|
# - Quantization options
|
|
# - Resource requirements
|
|
```
|
|
|
|
Allows:
|
|
- Consistent model IDs across providers
|
|
- Intelligent resource allocation
|
|
- Provider-agnostic inference
|
|
|
|
---
|
|
|
|
## The CLI Is Smart
|
|
|
|
### It Does More Than You Think
|
|
```bash
|
|
llama stack run starter
|
|
```
|
|
|
|
This command:
|
|
1. Resolves the starter distribution template
|
|
2. Merges with environment variables
|
|
3. Creates/updates `~/.llama/distributions/starter/run.yaml`
|
|
4. Installs missing dependencies
|
|
5. Starts HTTP server on port 8321
|
|
6. Initializes all providers
|
|
7. Registers available models
|
|
8. Ready for requests
|
|
|
|
**No separate build step needed!** (unless building Docker images)
|
|
|
|
### Introspection Commands
|
|
```bash
|
|
llama stack list-apis # See all 27 APIs
|
|
llama stack list-providers # See all 50+ providers
|
|
llama stack list # See all distributions
|
|
llama stack list-deps starter # See what to install
|
|
```
|
|
|
|
Used for documentation, debugging, and automation
|
|
|
|
---
|
|
|
|
## Storage: The Oft-Overlooked Component
|
|
|
|
### Three Storage Types
|
|
1. **KV Store** - Metadata (models, shields)
|
|
2. **SQL Store** - Structured (conversations, datasets)
|
|
3. **Inference Store** - Caching (for testing)
|
|
|
|
### Why Multiple Backends Matter
|
|
- Development: SQLite (no dependencies)
|
|
- Production: PostgreSQL (scalable)
|
|
- Distributed: Redis (shared state)
|
|
- Testing: In-memory (fast)
|
|
|
|
**Files**:
|
|
- `core/storage/datatypes.py` - Interfaces
|
|
- `providers/utils/kvstore/` - Implementations
|
|
- `providers/utils/sqlstore/` - Implementations
|
|
|
|
---
|
|
|
|
## Telemetry: Built-In Observability
|
|
|
|
### What Gets Traced
|
|
- Every API call
|
|
- Token usage (if provider supports it)
|
|
- Latency
|
|
- Errors
|
|
- Custom metrics from providers
|
|
|
|
### Integration
|
|
- OpenTelemetry compatible
|
|
- Automatic context propagation
|
|
- Works across async boundaries
|
|
- **File**: `providers/utils/telemetry/`
|
|
|
|
---
|
|
|
|
## Extension Strategy: How to Add Custom Functionality
|
|
|
|
### Adding a Custom API
|
|
1. Create protocol in `apis/my_api/my_api.py`
|
|
2. Implement providers (inline and/or remote)
|
|
3. Register in `core/resolver.py`
|
|
4. Add to distributions
|
|
|
|
### Adding a Custom Provider
|
|
1. Create module in `providers/[inline|remote]/[api]/[provider]/`
|
|
2. Implement config and adapter classes
|
|
3. Register in `providers/registry/[api].py`
|
|
4. Use in distribution YAML
|
|
|
|
### Adding a Custom Distribution
|
|
1. Create subdirectory in `distributions/[name]/`
|
|
2. Implement template in `[name].py`
|
|
3. Register in distribution discovery
|
|
|
|
---
|
|
|
|
## Common Misconceptions Clarified
|
|
|
|
### "APIs are HTTP endpoints"
|
|
**Wrong** - APIs are Python protocols. HTTP comes later via FastAPI.
|
|
- The "Inference" API is just a Python Protocol
|
|
- Providers implement it
|
|
- Core wraps it with HTTP for server mode
|
|
- Library mode uses it directly
|
|
|
|
### "Providers are all external services"
|
|
**Wrong** - Providers can be:
|
|
- Inline (local execution): Meta Reference, FAISS, Llama Guard
|
|
- Remote (external services): OpenAI, Ollama, Qdrant
|
|
|
|
Inline providers have low latency and no dependency on external services.
|
|
|
|
### "You must run a server"
|
|
**Wrong** - Two modes:
|
|
- Server mode: `llama stack run starter` (HTTP)
|
|
- Library mode: Import and use directly in Python
|
|
|
|
### "Distributions are just Docker images"
|
|
**Wrong** - Distributions are:
|
|
- Templates (what providers to use)
|
|
- Configs (how to configure them)
|
|
- Dependencies (what to install)
|
|
- Can be Docker OR local Python
|
|
|
|
---
|
|
|
|
## Performance Implications
|
|
|
|
### Inline Providers Are Fast
|
|
```
|
|
Inline (e.g., Meta Reference)
|
|
├─ 0ms network latency
|
|
├─ No HTTP serialization/deserialization
|
|
├─ Direct GPU access
|
|
└─ Fast (but high resource cost)
|
|
|
|
Remote (e.g., OpenAI)
|
|
├─ 100-500ms network latency
|
|
├─ HTTP serialization overhead
|
|
├─ Low resource cost
|
|
└─ Slower (but cheap)
|
|
```
|
|
|
|
### Streaming Is Native
|
|
```python
|
|
response = await inference.post_chat_completion(model=..., stream=True)
|
|
async for chunk in response:
|
|
print(chunk.delta) # Process token by token
|
|
```
|
|
|
|
Tokens arrive as they're generated, no waiting for full response.
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### API Keys Are Config
|
|
```yaml
|
|
inference:
|
|
- provider_id: openai
|
|
config:
|
|
api_key: ${env.OPENAI_API_KEY} # From environment
|
|
```
|
|
|
|
Never hardcoded, always from env vars.
|
|
|
|
### Access Control
|
|
**File**: `core/access_control/`
|
|
|
|
Providers can implement access rules:
|
|
- Per-user restrictions
|
|
- Per-model restrictions
|
|
- Rate limiting
|
|
- Audit logging
|
|
|
|
### Sensitive Field Redaction
|
|
Config logging automatically redacts:
|
|
- API keys
|
|
- Passwords
|
|
- Tokens
|
|
|
|
---
|
|
|
|
## Maturity Indicators
|
|
|
|
### Signs of Production-Ready Design
|
|
1. **Separated Concerns** - APIs, Providers, Distributions
|
|
2. **Plugin Architecture** - Easy to extend
|
|
3. **Configuration Over Code** - Deploy without recompiling
|
|
4. **Comprehensive Testing** - Unit + Integration with record-replay
|
|
5. **Multiple Client Options** - Library + Server modes
|
|
6. **Storage Abstraction** - Multiple backends
|
|
7. **Dependency Management** - Automatic resolution
|
|
8. **Error Handling** - Structured, informative errors
|
|
9. **Observability** - Built-in telemetry
|
|
10. **Documentation** - Distributions + CLI introspection
|
|
|
|
Llama Stack has all 10!
|
|
|
|
---
|
|
|
|
## Key Architectural Decisions
|
|
|
|
### Why Async/Await Throughout?
|
|
- Modern Python standard
|
|
- Works well with streaming
|
|
- Natural for I/O-heavy operations (API calls, GPU operations)
|
|
|
|
### Why Pydantic for Config?
|
|
- Type validation
|
|
- Auto-documentation
|
|
- JSON schema generation
|
|
- Easy serialization
|
|
|
|
### Why Protocol Classes for APIs?
|
|
- Define interface without implementation
|
|
- Multiple implementations possible
|
|
- Type hints work with duck typing
|
|
- Minimal magic
|
|
|
|
### Why YAML for Config?
|
|
- Human readable
|
|
- Environment variable support
|
|
- Comments allowed
|
|
- Wide tool support
|
|
|
|
### Why Record-Replay for Tests?
|
|
- Cost efficient
|
|
- Deterministic
|
|
- Real behavior captured
|
|
- Provider-agnostic
|
|
|
|
---
|
|
|
|
## The Learning Path for Contributors
|
|
|
|
### Understanding Order
|
|
1. **Start**: `pyproject.toml` - Entry point
|
|
2. **Learn**: `core/datatypes.py` - Data structures
|
|
3. **Understand**: `apis/inference/inference.py` - Example API
|
|
4. **See**: `providers/registry/inference.py` - Provider registry
|
|
5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider
|
|
6. **Read**: `providers/remote/inference/openai/` - Remote provider
|
|
7. **Study**: `core/resolver.py` - How it all connects
|
|
8. **Understand**: `core/stack.py` - Main orchestrator
|
|
9. **See**: `distributions/starter/` - How to use it
|
|
10. **Run**: `tests/integration/` - How to test
|
|
|
|
Each step builds on previous understanding.
|
|
|
|
---
|
|
|
|
## The Elegant Parts
|
|
|
|
### Most Elegant: The Router
|
|
The router system is beautiful:
|
|
- Transparent to users
|
|
- Automatic provider selection
|
|
- Works with 1 or 100 providers
|
|
- No hardcoding needed
|
|
|
|
### Most Flexible: YAML Config
|
|
Configuration as first-class citizen:
|
|
- Switch providers without code
|
|
- Override at runtime
|
|
- Version control friendly
|
|
- Documentation via config
|
|
|
|
### Most Useful: Record-Replay Tests
|
|
Testing pattern solves real problems:
|
|
- Cost
|
|
- Speed
|
|
- Reliability
|
|
- Coverage
|
|
|
|
### Most Scalable: Distribution Templates
|
|
Pre-configured bundles:
|
|
- One command to start
|
|
- Verified combinations
|
|
- Easy to document
|
|
- Simple to teach
|
|
|
|
---
|
|
|
|
## The Future
|
|
|
|
### What's Being Built
|
|
- More providers (Nvidia, SambaNova, etc.)
|
|
- More APIs (more task types)
|
|
- On-device execution (ExecuTorch)
|
|
- Better observability (more telemetry)
|
|
- Easier extensions (simpler API for custom providers)
|
|
|
|
### How It Stays Maintainable
|
|
- Protocol-based design limits coupling
|
|
- Clear separation of concerns
|
|
- Comprehensive testing
|
|
- Configuration over code
|
|
- Plugin architecture
|
|
|
|
The architecture is **future-proof** by design.
|
|
|