llama-stack-mirror/ARCHITECTURE_INSIGHTS.md
Antony Sallas 5ef6ccf90e fix: UI bug fixes and comprehensive architecture documentation
- Fixed Agent Instructions overflow by adding vertical scrolling
- Fixed duplicate chat content by skipping turn_complete events
- Added comprehensive architecture documentation (4 files)
- Added UI bug fixes documentation
- Added Notion API upgrade analysis
- Created documentation registry for Notion pages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
2025-10-27 13:34:02 +08:00

620 lines
15 KiB
Markdown

# Llama Stack - Architecture Insights for Developers
## Why This Architecture Works
### Problem It Solves
Without Llama Stack, building AI applications requires:
- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
- Rewriting code to switch providers
- Duplicating logic for common patterns (safety checks, vector search, etc.)
- Managing complex dependencies manually
### Solution: The Three Pillars
```
Single, Unified API Interface
Multiple Provider Implementations
Pre-configured Distributions
```
**Result**: Write once, run anywhere (locally, cloud, on-device)
---
## The Genius of the Plugin Architecture
### How It Works
1. **Define Abstract Interface** (Protocol in `apis/`)
```python
class Inference(Protocol):
async def post_chat_completion(...) -> AsyncIterator[...]: ...
```
2. **Multiple Implementations** (in `providers/`)
- Local: Meta Reference, vLLM, Ollama
- Cloud: OpenAI, Anthropic, Groq, Bedrock
- Each implements same interface
3. **Runtime Selection** (via YAML config)
```yaml
providers:
inference:
- provider_type: remote::openai
```
4. **Zero Code Changes** to switch providers!
### Why This Beats Individual SDKs
- **Single SDK** vs 30+ provider SDKs
- **Same API** vs learning each provider's quirks
- **Easy migration** - change 1 config value
- **Testing** - same tests work across all providers
---
## The Request Routing Intelligence
### Two Clever Routing Strategies
#### 1. Auto-Routed APIs (Smart Dispatch)
**APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
When you call:
```python
await inference.post_chat_completion(model="llama-2-7b")
```
Router automatically determines:
- "Which provider has llama-2-7b?"
- "Route this request there"
- "Stream response back"
**Implementation**: `routers/` directory contains auto-routers
#### 2. Routing Table APIs (Registry Pattern)
**APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
When you call:
```python
models = await models.list_models() # Merged list from ALL providers
```
Router:
- Queries each provider
- Merges results
- Returns unified list
**Implementation**: `routing_tables/` directory
### Why This Matters
- **Users don't think about providers** - just use the API
- **Multiple implementations work** - router handles dispatch
- **Easy scaling** - add new providers without touching user code
- **Resource management** - router knows what's available
---
## Configuration as a Weapon
### The Power of YAML Over Code
Traditional approach:
```python
# Code changes needed for each provider!
if use_openai:
from openai import OpenAI
client = OpenAI(api_key=...)
elif use_ollama:
from ollama import Client
client = Client(url=...)
# etc.
```
Llama Stack approach:
```yaml
# Zero code changes!
providers:
inference:
- provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY}
```
Then later, change to:
```yaml
providers:
inference:
- provider_type: remote::ollama
config:
host: localhost
```
**Same application code** works with both!
### Environment Variable Magic
```bash
# Change provider at runtime
INFERENCE_MODEL=llama-2-70b llama stack run starter
# No redeployment needed!
```
---
## The Distributions Strategy
### Problem: "Works on My Machine"
- Different developers need different setups
- Production needs different providers than development
- CI/CD needs lightweight dependencies
### Solution: Pre-verified Distributions
```
starter → Works on CPU with free APIs (Ollama + OpenAI)
starter-gpu → Works on GPU machines
meta-reference-gpu → Works with full local setup
postgres-demo → Production-grade with persistent storage
```
Each distribution:
- Pre-selects working providers
- Sets sensible defaults
- Bundles required dependencies
- Tested end-to-end
**Result**: `llama stack run starter` just works for 80% of use cases
### Why This Beats Documentation
- **No setup guides needed** - distribution does it
- **No guessing** - curated, tested combinations
- **Reproducible** - same distro always works same way
- **Upgradeable** - update distro = get improvements
---
## The Testing Genius: Record-Replay
### Traditional Testing Hell for AI
Problem:
- API calls cost money
- API responses are non-deterministic
- Each provider has different response formats
- Tests become slow and flaky
### The Record-Replay Solution
First run (record):
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
# Makes real API calls, saves responses to YAML
```
All subsequent runs (replay):
```bash
pytest tests/integration/
# Returns cached responses, NO API calls, instant results
```
### Why This is Brilliant
- **Cost**: Record once, replay 1000x. Save thousands of dollars
- **Speed**: Cached responses = instant test execution
- **Reliability**: Deterministic results (no API variability)
- **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc.
**File location**: `tests/integration/[api]/cassettes/`
---
## Core Runtime: The Stack Class
### The Elegance of Inheritance
```python
class LlamaStack(
Inference, # Chat completion, embeddings
Agents, # Multi-turn orchestration
Safety, # Content filtering
VectorIO, # Vector operations
Tools, # Function execution
Eval, # Evaluation
Scoring, # Response scoring
Models, # Model registry
# ... 19 more APIs
):
pass
```
A single `LlamaStack` instance:
- Implements 27 different APIs
- Has 50+ providers backing it
- Routes requests intelligently
- Manages dependencies
All from a ~400 line file + lots of protocol definitions!
---
## Dependency Injection Without the Complexity
### How Providers Depend on Each Other
Problem: Agents need Inference, Inference needs Models registry
```python
class AgentProvider:
def __init__(self,
inference: InferenceProvider,
safety: SafetyProvider,
tool_runtime: ToolRuntimeProvider):
self.inference = inference
self.safety = safety
self.tool_runtime = tool_runtime
```
### How It Gets Resolved
**File**: `core/resolver.py`
1. Parse `run.yaml` - which providers enabled?
2. Build dependency graph - who depends on whom?
3. Topological sort - what order to instantiate?
4. Instantiate in order - each gets its dependencies
**Result**: Complex dependency chains handled automatically!
---
## The Client Duality
### Two Ways to Use Llama Stack
#### 1. Library Mode (In-Process)
```python
from llama_stack import AsyncLlamaStackAsLibraryClient
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)
```
- No HTTP overhead
- Direct Python API
- Embedded in application
- **File**: `core/library_client.py`
#### 2. Server Mode (HTTP)
```bash
llama stack run starter # Start server on port 8321
```
```python
from llama_stack_client import AsyncLlamaStackClient
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)
```
- Distributed architecture
- Share single server across apps
- Easy deployment
- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
**Result**: Same API, different deployment strategies!
---
## The Model System Insight
### Why It Exists
Problem: Different model IDs across providers
- HuggingFace: `meta-llama/Llama-2-7b`
- Ollama: `llama2`
- OpenAI: `gpt-4`
### Solution: Universal Model Registry
**File**: `models/llama/sku_list.py`
```python
resolve_model("meta-llama/Llama-2-7b")
# Returns Model object with:
# - Architecture info
# - Tokenizer
# - Quantization options
# - Resource requirements
```
Allows:
- Consistent model IDs across providers
- Intelligent resource allocation
- Provider-agnostic inference
---
## The CLI Is Smart
### It Does More Than You Think
```bash
llama stack run starter
```
This command:
1. Resolves the starter distribution template
2. Merges with environment variables
3. Creates/updates `~/.llama/distributions/starter/run.yaml`
4. Installs missing dependencies
5. Starts HTTP server on port 8321
6. Initializes all providers
7. Registers available models
8. Ready for requests
**No separate build step needed!** (unless building Docker images)
### Introspection Commands
```bash
llama stack list-apis # See all 27 APIs
llama stack list-providers # See all 50+ providers
llama stack list # See all distributions
llama stack list-deps starter # See what to install
```
Used for documentation, debugging, and automation
---
## Storage: The Oft-Overlooked Component
### Three Storage Types
1. **KV Store** - Metadata (models, shields)
2. **SQL Store** - Structured (conversations, datasets)
3. **Inference Store** - Caching (for testing)
### Why Multiple Backends Matter
- Development: SQLite (no dependencies)
- Production: PostgreSQL (scalable)
- Distributed: Redis (shared state)
- Testing: In-memory (fast)
**Files**:
- `core/storage/datatypes.py` - Interfaces
- `providers/utils/kvstore/` - Implementations
- `providers/utils/sqlstore/` - Implementations
---
## Telemetry: Built-In Observability
### What Gets Traced
- Every API call
- Token usage (if provider supports it)
- Latency
- Errors
- Custom metrics from providers
### Integration
- OpenTelemetry compatible
- Automatic context propagation
- Works across async boundaries
- **File**: `providers/utils/telemetry/`
---
## Extension Strategy: How to Add Custom Functionality
### Adding a Custom API
1. Create protocol in `apis/my_api/my_api.py`
2. Implement providers (inline and/or remote)
3. Register in `core/resolver.py`
4. Add to distributions
### Adding a Custom Provider
1. Create module in `providers/[inline|remote]/[api]/[provider]/`
2. Implement config and adapter classes
3. Register in `providers/registry/[api].py`
4. Use in distribution YAML
### Adding a Custom Distribution
1. Create subdirectory in `distributions/[name]/`
2. Implement template in `[name].py`
3. Register in distribution discovery
---
## Common Misconceptions Clarified
### "APIs are HTTP endpoints"
**Wrong** - APIs are Python protocols. HTTP comes later via FastAPI.
- The "Inference" API is just a Python Protocol
- Providers implement it
- Core wraps it with HTTP for server mode
- Library mode uses it directly
### "Providers are all external services"
**Wrong** - Providers can be:
- Inline (local execution): Meta Reference, FAISS, Llama Guard
- Remote (external services): OpenAI, Ollama, Qdrant
Inline providers have low latency and no dependency on external services.
### "You must run a server"
**Wrong** - Two modes:
- Server mode: `llama stack run starter` (HTTP)
- Library mode: Import and use directly in Python
### "Distributions are just Docker images"
**Wrong** - Distributions are:
- Templates (what providers to use)
- Configs (how to configure them)
- Dependencies (what to install)
- Can be Docker OR local Python
---
## Performance Implications
### Inline Providers Are Fast
```
Inline (e.g., Meta Reference)
├─ 0ms network latency
├─ No HTTP serialization/deserialization
├─ Direct GPU access
└─ Fast (but high resource cost)
Remote (e.g., OpenAI)
├─ 100-500ms network latency
├─ HTTP serialization overhead
├─ Low resource cost
└─ Slower (but cheap)
```
### Streaming Is Native
```python
response = await inference.post_chat_completion(model=..., stream=True)
async for chunk in response:
print(chunk.delta) # Process token by token
```
Tokens arrive as they're generated, no waiting for full response.
---
## Security Considerations
### API Keys Are Config
```yaml
inference:
- provider_id: openai
config:
api_key: ${env.OPENAI_API_KEY} # From environment
```
Never hardcoded, always from env vars.
### Access Control
**File**: `core/access_control/`
Providers can implement access rules:
- Per-user restrictions
- Per-model restrictions
- Rate limiting
- Audit logging
### Sensitive Field Redaction
Config logging automatically redacts:
- API keys
- Passwords
- Tokens
---
## Maturity Indicators
### Signs of Production-Ready Design
1. **Separated Concerns** - APIs, Providers, Distributions
2. **Plugin Architecture** - Easy to extend
3. **Configuration Over Code** - Deploy without recompiling
4. **Comprehensive Testing** - Unit + Integration with record-replay
5. **Multiple Client Options** - Library + Server modes
6. **Storage Abstraction** - Multiple backends
7. **Dependency Management** - Automatic resolution
8. **Error Handling** - Structured, informative errors
9. **Observability** - Built-in telemetry
10. **Documentation** - Distributions + CLI introspection
Llama Stack has all 10!
---
## Key Architectural Decisions
### Why Async/Await Throughout?
- Modern Python standard
- Works well with streaming
- Natural for I/O-heavy operations (API calls, GPU operations)
### Why Pydantic for Config?
- Type validation
- Auto-documentation
- JSON schema generation
- Easy serialization
### Why Protocol Classes for APIs?
- Define interface without implementation
- Multiple implementations possible
- Type hints work with duck typing
- Minimal magic
### Why YAML for Config?
- Human readable
- Environment variable support
- Comments allowed
- Wide tool support
### Why Record-Replay for Tests?
- Cost efficient
- Deterministic
- Real behavior captured
- Provider-agnostic
---
## The Learning Path for Contributors
### Understanding Order
1. **Start**: `pyproject.toml` - Entry point
2. **Learn**: `core/datatypes.py` - Data structures
3. **Understand**: `apis/inference/inference.py` - Example API
4. **See**: `providers/registry/inference.py` - Provider registry
5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider
6. **Read**: `providers/remote/inference/openai/` - Remote provider
7. **Study**: `core/resolver.py` - How it all connects
8. **Understand**: `core/stack.py` - Main orchestrator
9. **See**: `distributions/starter/` - How to use it
10. **Run**: `tests/integration/` - How to test
Each step builds on previous understanding.
---
## The Elegant Parts
### Most Elegant: The Router
The router system is beautiful:
- Transparent to users
- Automatic provider selection
- Works with 1 or 100 providers
- No hardcoding needed
### Most Flexible: YAML Config
Configuration as first-class citizen:
- Switch providers without code
- Override at runtime
- Version control friendly
- Documentation via config
### Most Useful: Record-Replay Tests
Testing pattern solves real problems:
- Cost
- Speed
- Reliability
- Coverage
### Most Scalable: Distribution Templates
Pre-configured bundles:
- One command to start
- Verified combinations
- Easy to document
- Simple to teach
---
## The Future
### What's Being Built
- More providers (Nvidia, SambaNova, etc.)
- More APIs (more task types)
- On-device execution (ExecuTorch)
- Better observability (more telemetry)
- Easier extensions (simpler API for custom providers)
### How It Stays Maintainable
- Protocol-based design limits coupling
- Clear separation of concerns
- Comprehensive testing
- Configuration over code
- Plugin architecture
The architecture is **future-proof** by design.