- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
15 KiB
Llama Stack - Architecture Insights for Developers
Why This Architecture Works
Problem It Solves
Without Llama Stack, building AI applications requires:
- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
- Rewriting code to switch providers
- Duplicating logic for common patterns (safety checks, vector search, etc.)
- Managing complex dependencies manually
Solution: The Three Pillars
Single, Unified API Interface
↓
Multiple Provider Implementations
↓
Pre-configured Distributions
Result: Write once, run anywhere (locally, cloud, on-device)
The Genius of the Plugin Architecture
How It Works
-
Define Abstract Interface (Protocol in
apis/)class Inference(Protocol): async def post_chat_completion(...) -> AsyncIterator[...]: ... -
Multiple Implementations (in
providers/)- Local: Meta Reference, vLLM, Ollama
- Cloud: OpenAI, Anthropic, Groq, Bedrock
- Each implements same interface
-
Runtime Selection (via YAML config)
providers: inference: - provider_type: remote::openai -
Zero Code Changes to switch providers!
Why This Beats Individual SDKs
- Single SDK vs 30+ provider SDKs
- Same API vs learning each provider's quirks
- Easy migration - change 1 config value
- Testing - same tests work across all providers
The Request Routing Intelligence
Two Clever Routing Strategies
1. Auto-Routed APIs (Smart Dispatch)
APIs: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
When you call:
await inference.post_chat_completion(model="llama-2-7b")
Router automatically determines:
- "Which provider has llama-2-7b?"
- "Route this request there"
- "Stream response back"
Implementation: routers/ directory contains auto-routers
2. Routing Table APIs (Registry Pattern)
APIs: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
When you call:
models = await models.list_models() # Merged list from ALL providers
Router:
- Queries each provider
- Merges results
- Returns unified list
Implementation: routing_tables/ directory
Why This Matters
- Users don't think about providers - just use the API
- Multiple implementations work - router handles dispatch
- Easy scaling - add new providers without touching user code
- Resource management - router knows what's available
Configuration as a Weapon
The Power of YAML Over Code
Traditional approach:
# Code changes needed for each provider!
if use_openai:
from openai import OpenAI
client = OpenAI(api_key=...)
elif use_ollama:
from ollama import Client
client = Client(url=...)
# etc.
Llama Stack approach:
# Zero code changes!
providers:
inference:
- provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY}
Then later, change to:
providers:
inference:
- provider_type: remote::ollama
config:
host: localhost
Same application code works with both!
Environment Variable Magic
# Change provider at runtime
INFERENCE_MODEL=llama-2-70b llama stack run starter
# No redeployment needed!
The Distributions Strategy
Problem: "Works on My Machine"
- Different developers need different setups
- Production needs different providers than development
- CI/CD needs lightweight dependencies
Solution: Pre-verified Distributions
starter → Works on CPU with free APIs (Ollama + OpenAI)
starter-gpu → Works on GPU machines
meta-reference-gpu → Works with full local setup
postgres-demo → Production-grade with persistent storage
Each distribution:
- Pre-selects working providers
- Sets sensible defaults
- Bundles required dependencies
- Tested end-to-end
Result: llama stack run starter just works for 80% of use cases
Why This Beats Documentation
- No setup guides needed - distribution does it
- No guessing - curated, tested combinations
- Reproducible - same distro always works same way
- Upgradeable - update distro = get improvements
The Testing Genius: Record-Replay
Traditional Testing Hell for AI
Problem:
- API calls cost money
- API responses are non-deterministic
- Each provider has different response formats
- Tests become slow and flaky
The Record-Replay Solution
First run (record):
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
# Makes real API calls, saves responses to YAML
All subsequent runs (replay):
pytest tests/integration/
# Returns cached responses, NO API calls, instant results
Why This is Brilliant
- Cost: Record once, replay 1000x. Save thousands of dollars
- Speed: Cached responses = instant test execution
- Reliability: Deterministic results (no API variability)
- Coverage: One test works with OpenAI, Ollama, Anthropic, etc.
File location: tests/integration/[api]/cassettes/
Core Runtime: The Stack Class
The Elegance of Inheritance
class LlamaStack(
Inference, # Chat completion, embeddings
Agents, # Multi-turn orchestration
Safety, # Content filtering
VectorIO, # Vector operations
Tools, # Function execution
Eval, # Evaluation
Scoring, # Response scoring
Models, # Model registry
# ... 19 more APIs
):
pass
A single LlamaStack instance:
- Implements 27 different APIs
- Has 50+ providers backing it
- Routes requests intelligently
- Manages dependencies
All from a ~400 line file + lots of protocol definitions!
Dependency Injection Without the Complexity
How Providers Depend on Each Other
Problem: Agents need Inference, Inference needs Models registry
class AgentProvider:
def __init__(self,
inference: InferenceProvider,
safety: SafetyProvider,
tool_runtime: ToolRuntimeProvider):
self.inference = inference
self.safety = safety
self.tool_runtime = tool_runtime
How It Gets Resolved
File: core/resolver.py
- Parse
run.yaml- which providers enabled? - Build dependency graph - who depends on whom?
- Topological sort - what order to instantiate?
- Instantiate in order - each gets its dependencies
Result: Complex dependency chains handled automatically!
The Client Duality
Two Ways to Use Llama Stack
1. Library Mode (In-Process)
from llama_stack import AsyncLlamaStackAsLibraryClient
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)
- No HTTP overhead
- Direct Python API
- Embedded in application
- File:
core/library_client.py
2. Server Mode (HTTP)
llama stack run starter # Start server on port 8321
from llama_stack_client import AsyncLlamaStackClient
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)
- Distributed architecture
- Share single server across apps
- Easy deployment
- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
Result: Same API, different deployment strategies!
The Model System Insight
Why It Exists
Problem: Different model IDs across providers
- HuggingFace:
meta-llama/Llama-2-7b - Ollama:
llama2 - OpenAI:
gpt-4
Solution: Universal Model Registry
File: models/llama/sku_list.py
resolve_model("meta-llama/Llama-2-7b")
# Returns Model object with:
# - Architecture info
# - Tokenizer
# - Quantization options
# - Resource requirements
Allows:
- Consistent model IDs across providers
- Intelligent resource allocation
- Provider-agnostic inference
The CLI Is Smart
It Does More Than You Think
llama stack run starter
This command:
- Resolves the starter distribution template
- Merges with environment variables
- Creates/updates
~/.llama/distributions/starter/run.yaml - Installs missing dependencies
- Starts HTTP server on port 8321
- Initializes all providers
- Registers available models
- Ready for requests
No separate build step needed! (unless building Docker images)
Introspection Commands
llama stack list-apis # See all 27 APIs
llama stack list-providers # See all 50+ providers
llama stack list # See all distributions
llama stack list-deps starter # See what to install
Used for documentation, debugging, and automation
Storage: The Oft-Overlooked Component
Three Storage Types
- KV Store - Metadata (models, shields)
- SQL Store - Structured (conversations, datasets)
- Inference Store - Caching (for testing)
Why Multiple Backends Matter
- Development: SQLite (no dependencies)
- Production: PostgreSQL (scalable)
- Distributed: Redis (shared state)
- Testing: In-memory (fast)
Files:
core/storage/datatypes.py- Interfacesproviders/utils/kvstore/- Implementationsproviders/utils/sqlstore/- Implementations
Telemetry: Built-In Observability
What Gets Traced
- Every API call
- Token usage (if provider supports it)
- Latency
- Errors
- Custom metrics from providers
Integration
- OpenTelemetry compatible
- Automatic context propagation
- Works across async boundaries
- File:
providers/utils/telemetry/
Extension Strategy: How to Add Custom Functionality
Adding a Custom API
- Create protocol in
apis/my_api/my_api.py - Implement providers (inline and/or remote)
- Register in
core/resolver.py - Add to distributions
Adding a Custom Provider
- Create module in
providers/[inline|remote]/[api]/[provider]/ - Implement config and adapter classes
- Register in
providers/registry/[api].py - Use in distribution YAML
Adding a Custom Distribution
- Create subdirectory in
distributions/[name]/ - Implement template in
[name].py - Register in distribution discovery
Common Misconceptions Clarified
"APIs are HTTP endpoints"
Wrong - APIs are Python protocols. HTTP comes later via FastAPI.
- The "Inference" API is just a Python Protocol
- Providers implement it
- Core wraps it with HTTP for server mode
- Library mode uses it directly
"Providers are all external services"
Wrong - Providers can be:
- Inline (local execution): Meta Reference, FAISS, Llama Guard
- Remote (external services): OpenAI, Ollama, Qdrant
Inline providers have low latency and no dependency on external services.
"You must run a server"
Wrong - Two modes:
- Server mode:
llama stack run starter(HTTP) - Library mode: Import and use directly in Python
"Distributions are just Docker images"
Wrong - Distributions are:
- Templates (what providers to use)
- Configs (how to configure them)
- Dependencies (what to install)
- Can be Docker OR local Python
Performance Implications
Inline Providers Are Fast
Inline (e.g., Meta Reference)
├─ 0ms network latency
├─ No HTTP serialization/deserialization
├─ Direct GPU access
└─ Fast (but high resource cost)
Remote (e.g., OpenAI)
├─ 100-500ms network latency
├─ HTTP serialization overhead
├─ Low resource cost
└─ Slower (but cheap)
Streaming Is Native
response = await inference.post_chat_completion(model=..., stream=True)
async for chunk in response:
print(chunk.delta) # Process token by token
Tokens arrive as they're generated, no waiting for full response.
Security Considerations
API Keys Are Config
inference:
- provider_id: openai
config:
api_key: ${env.OPENAI_API_KEY} # From environment
Never hardcoded, always from env vars.
Access Control
File: core/access_control/
Providers can implement access rules:
- Per-user restrictions
- Per-model restrictions
- Rate limiting
- Audit logging
Sensitive Field Redaction
Config logging automatically redacts:
- API keys
- Passwords
- Tokens
Maturity Indicators
Signs of Production-Ready Design
- Separated Concerns - APIs, Providers, Distributions
- Plugin Architecture - Easy to extend
- Configuration Over Code - Deploy without recompiling
- Comprehensive Testing - Unit + Integration with record-replay
- Multiple Client Options - Library + Server modes
- Storage Abstraction - Multiple backends
- Dependency Management - Automatic resolution
- Error Handling - Structured, informative errors
- Observability - Built-in telemetry
- Documentation - Distributions + CLI introspection
Llama Stack has all 10!
Key Architectural Decisions
Why Async/Await Throughout?
- Modern Python standard
- Works well with streaming
- Natural for I/O-heavy operations (API calls, GPU operations)
Why Pydantic for Config?
- Type validation
- Auto-documentation
- JSON schema generation
- Easy serialization
Why Protocol Classes for APIs?
- Define interface without implementation
- Multiple implementations possible
- Type hints work with duck typing
- Minimal magic
Why YAML for Config?
- Human readable
- Environment variable support
- Comments allowed
- Wide tool support
Why Record-Replay for Tests?
- Cost efficient
- Deterministic
- Real behavior captured
- Provider-agnostic
The Learning Path for Contributors
Understanding Order
- Start:
pyproject.toml- Entry point - Learn:
core/datatypes.py- Data structures - Understand:
apis/inference/inference.py- Example API - See:
providers/registry/inference.py- Provider registry - Read:
providers/inline/inference/meta_reference/- Inline provider - Read:
providers/remote/inference/openai/- Remote provider - Study:
core/resolver.py- How it all connects - Understand:
core/stack.py- Main orchestrator - See:
distributions/starter/- How to use it - Run:
tests/integration/- How to test
Each step builds on previous understanding.
The Elegant Parts
Most Elegant: The Router
The router system is beautiful:
- Transparent to users
- Automatic provider selection
- Works with 1 or 100 providers
- No hardcoding needed
Most Flexible: YAML Config
Configuration as first-class citizen:
- Switch providers without code
- Override at runtime
- Version control friendly
- Documentation via config
Most Useful: Record-Replay Tests
Testing pattern solves real problems:
- Cost
- Speed
- Reliability
- Coverage
Most Scalable: Distribution Templates
Pre-configured bundles:
- One command to start
- Verified combinations
- Easy to document
- Simple to teach
The Future
What's Being Built
- More providers (Nvidia, SambaNova, etc.)
- More APIs (more task types)
- On-device execution (ExecuTorch)
- Better observability (more telemetry)
- Easier extensions (simpler API for custom providers)
How It Stays Maintainable
- Protocol-based design limits coupling
- Clear separation of concerns
- Comprehensive testing
- Configuration over code
- Plugin architecture
The architecture is future-proof by design.