# Llama Stack - Architecture Insights for Developers ## Why This Architecture Works ### Problem It Solves Without Llama Stack, building AI applications requires: - Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.) - Rewriting code to switch providers - Duplicating logic for common patterns (safety checks, vector search, etc.) - Managing complex dependencies manually ### Solution: The Three Pillars ``` Single, Unified API Interface ↓ Multiple Provider Implementations ↓ Pre-configured Distributions ``` **Result**: Write once, run anywhere (locally, cloud, on-device) --- ## The Genius of the Plugin Architecture ### How It Works 1. **Define Abstract Interface** (Protocol in `apis/`) ```python class Inference(Protocol): async def post_chat_completion(...) -> AsyncIterator[...]: ... ``` 2. **Multiple Implementations** (in `providers/`) - Local: Meta Reference, vLLM, Ollama - Cloud: OpenAI, Anthropic, Groq, Bedrock - Each implements same interface 3. **Runtime Selection** (via YAML config) ```yaml providers: inference: - provider_type: remote::openai ``` 4. **Zero Code Changes** to switch providers! ### Why This Beats Individual SDKs - **Single SDK** vs 30+ provider SDKs - **Same API** vs learning each provider's quirks - **Easy migration** - change 1 config value - **Testing** - same tests work across all providers --- ## The Request Routing Intelligence ### Two Clever Routing Strategies #### 1. Auto-Routed APIs (Smart Dispatch) **APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime When you call: ```python await inference.post_chat_completion(model="llama-2-7b") ``` Router automatically determines: - "Which provider has llama-2-7b?" - "Route this request there" - "Stream response back" **Implementation**: `routers/` directory contains auto-routers #### 2. Routing Table APIs (Registry Pattern) **APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions When you call: ```python models = await models.list_models() # Merged list from ALL providers ``` Router: - Queries each provider - Merges results - Returns unified list **Implementation**: `routing_tables/` directory ### Why This Matters - **Users don't think about providers** - just use the API - **Multiple implementations work** - router handles dispatch - **Easy scaling** - add new providers without touching user code - **Resource management** - router knows what's available --- ## Configuration as a Weapon ### The Power of YAML Over Code Traditional approach: ```python # Code changes needed for each provider! if use_openai: from openai import OpenAI client = OpenAI(api_key=...) elif use_ollama: from ollama import Client client = Client(url=...) # etc. ``` Llama Stack approach: ```yaml # Zero code changes! providers: inference: - provider_type: remote::openai config: api_key: ${env.OPENAI_API_KEY} ``` Then later, change to: ```yaml providers: inference: - provider_type: remote::ollama config: host: localhost ``` **Same application code** works with both! ### Environment Variable Magic ```bash # Change provider at runtime INFERENCE_MODEL=llama-2-70b llama stack run starter # No redeployment needed! ``` --- ## The Distributions Strategy ### Problem: "Works on My Machine" - Different developers need different setups - Production needs different providers than development - CI/CD needs lightweight dependencies ### Solution: Pre-verified Distributions ``` starter → Works on CPU with free APIs (Ollama + OpenAI) starter-gpu → Works on GPU machines meta-reference-gpu → Works with full local setup postgres-demo → Production-grade with persistent storage ``` Each distribution: - Pre-selects working providers - Sets sensible defaults - Bundles required dependencies - Tested end-to-end **Result**: `llama stack run starter` just works for 80% of use cases ### Why This Beats Documentation - **No setup guides needed** - distribution does it - **No guessing** - curated, tested combinations - **Reproducible** - same distro always works same way - **Upgradeable** - update distro = get improvements --- ## The Testing Genius: Record-Replay ### Traditional Testing Hell for AI Problem: - API calls cost money - API responses are non-deterministic - Each provider has different response formats - Tests become slow and flaky ### The Record-Replay Solution First run (record): ```bash LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/ # Makes real API calls, saves responses to YAML ``` All subsequent runs (replay): ```bash pytest tests/integration/ # Returns cached responses, NO API calls, instant results ``` ### Why This is Brilliant - **Cost**: Record once, replay 1000x. Save thousands of dollars - **Speed**: Cached responses = instant test execution - **Reliability**: Deterministic results (no API variability) - **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc. **File location**: `tests/integration/[api]/cassettes/` --- ## Core Runtime: The Stack Class ### The Elegance of Inheritance ```python class LlamaStack( Inference, # Chat completion, embeddings Agents, # Multi-turn orchestration Safety, # Content filtering VectorIO, # Vector operations Tools, # Function execution Eval, # Evaluation Scoring, # Response scoring Models, # Model registry # ... 19 more APIs ): pass ``` A single `LlamaStack` instance: - Implements 27 different APIs - Has 50+ providers backing it - Routes requests intelligently - Manages dependencies All from a ~400 line file + lots of protocol definitions! --- ## Dependency Injection Without the Complexity ### How Providers Depend on Each Other Problem: Agents need Inference, Inference needs Models registry ```python class AgentProvider: def __init__(self, inference: InferenceProvider, safety: SafetyProvider, tool_runtime: ToolRuntimeProvider): self.inference = inference self.safety = safety self.tool_runtime = tool_runtime ``` ### How It Gets Resolved **File**: `core/resolver.py` 1. Parse `run.yaml` - which providers enabled? 2. Build dependency graph - who depends on whom? 3. Topological sort - what order to instantiate? 4. Instantiate in order - each gets its dependencies **Result**: Complex dependency chains handled automatically! --- ## The Client Duality ### Two Ways to Use Llama Stack #### 1. Library Mode (In-Process) ```python from llama_stack import AsyncLlamaStackAsLibraryClient client = await AsyncLlamaStackAsLibraryClient.create(run_config) response = await client.inference.post_chat_completion(...) ``` - No HTTP overhead - Direct Python API - Embedded in application - **File**: `core/library_client.py` #### 2. Server Mode (HTTP) ```bash llama stack run starter # Start server on port 8321 ``` ```python from llama_stack_client import AsyncLlamaStackClient client = AsyncLlamaStackClient(base_url="http://localhost:8321") response = await client.inference.post_chat_completion(...) ``` - Distributed architecture - Share single server across apps - Easy deployment - Language-agnostic clients (Python, TypeScript, Swift, Kotlin) **Result**: Same API, different deployment strategies! --- ## The Model System Insight ### Why It Exists Problem: Different model IDs across providers - HuggingFace: `meta-llama/Llama-2-7b` - Ollama: `llama2` - OpenAI: `gpt-4` ### Solution: Universal Model Registry **File**: `models/llama/sku_list.py` ```python resolve_model("meta-llama/Llama-2-7b") # Returns Model object with: # - Architecture info # - Tokenizer # - Quantization options # - Resource requirements ``` Allows: - Consistent model IDs across providers - Intelligent resource allocation - Provider-agnostic inference --- ## The CLI Is Smart ### It Does More Than You Think ```bash llama stack run starter ``` This command: 1. Resolves the starter distribution template 2. Merges with environment variables 3. Creates/updates `~/.llama/distributions/starter/run.yaml` 4. Installs missing dependencies 5. Starts HTTP server on port 8321 6. Initializes all providers 7. Registers available models 8. Ready for requests **No separate build step needed!** (unless building Docker images) ### Introspection Commands ```bash llama stack list-apis # See all 27 APIs llama stack list-providers # See all 50+ providers llama stack list # See all distributions llama stack list-deps starter # See what to install ``` Used for documentation, debugging, and automation --- ## Storage: The Oft-Overlooked Component ### Three Storage Types 1. **KV Store** - Metadata (models, shields) 2. **SQL Store** - Structured (conversations, datasets) 3. **Inference Store** - Caching (for testing) ### Why Multiple Backends Matter - Development: SQLite (no dependencies) - Production: PostgreSQL (scalable) - Distributed: Redis (shared state) - Testing: In-memory (fast) **Files**: - `core/storage/datatypes.py` - Interfaces - `providers/utils/kvstore/` - Implementations - `providers/utils/sqlstore/` - Implementations --- ## Telemetry: Built-In Observability ### What Gets Traced - Every API call - Token usage (if provider supports it) - Latency - Errors - Custom metrics from providers ### Integration - OpenTelemetry compatible - Automatic context propagation - Works across async boundaries - **File**: `providers/utils/telemetry/` --- ## Extension Strategy: How to Add Custom Functionality ### Adding a Custom API 1. Create protocol in `apis/my_api/my_api.py` 2. Implement providers (inline and/or remote) 3. Register in `core/resolver.py` 4. Add to distributions ### Adding a Custom Provider 1. Create module in `providers/[inline|remote]/[api]/[provider]/` 2. Implement config and adapter classes 3. Register in `providers/registry/[api].py` 4. Use in distribution YAML ### Adding a Custom Distribution 1. Create subdirectory in `distributions/[name]/` 2. Implement template in `[name].py` 3. Register in distribution discovery --- ## Common Misconceptions Clarified ### "APIs are HTTP endpoints" **Wrong** - APIs are Python protocols. HTTP comes later via FastAPI. - The "Inference" API is just a Python Protocol - Providers implement it - Core wraps it with HTTP for server mode - Library mode uses it directly ### "Providers are all external services" **Wrong** - Providers can be: - Inline (local execution): Meta Reference, FAISS, Llama Guard - Remote (external services): OpenAI, Ollama, Qdrant Inline providers have low latency and no dependency on external services. ### "You must run a server" **Wrong** - Two modes: - Server mode: `llama stack run starter` (HTTP) - Library mode: Import and use directly in Python ### "Distributions are just Docker images" **Wrong** - Distributions are: - Templates (what providers to use) - Configs (how to configure them) - Dependencies (what to install) - Can be Docker OR local Python --- ## Performance Implications ### Inline Providers Are Fast ``` Inline (e.g., Meta Reference) ├─ 0ms network latency ├─ No HTTP serialization/deserialization ├─ Direct GPU access └─ Fast (but high resource cost) Remote (e.g., OpenAI) ├─ 100-500ms network latency ├─ HTTP serialization overhead ├─ Low resource cost └─ Slower (but cheap) ``` ### Streaming Is Native ```python response = await inference.post_chat_completion(model=..., stream=True) async for chunk in response: print(chunk.delta) # Process token by token ``` Tokens arrive as they're generated, no waiting for full response. --- ## Security Considerations ### API Keys Are Config ```yaml inference: - provider_id: openai config: api_key: ${env.OPENAI_API_KEY} # From environment ``` Never hardcoded, always from env vars. ### Access Control **File**: `core/access_control/` Providers can implement access rules: - Per-user restrictions - Per-model restrictions - Rate limiting - Audit logging ### Sensitive Field Redaction Config logging automatically redacts: - API keys - Passwords - Tokens --- ## Maturity Indicators ### Signs of Production-Ready Design 1. **Separated Concerns** - APIs, Providers, Distributions 2. **Plugin Architecture** - Easy to extend 3. **Configuration Over Code** - Deploy without recompiling 4. **Comprehensive Testing** - Unit + Integration with record-replay 5. **Multiple Client Options** - Library + Server modes 6. **Storage Abstraction** - Multiple backends 7. **Dependency Management** - Automatic resolution 8. **Error Handling** - Structured, informative errors 9. **Observability** - Built-in telemetry 10. **Documentation** - Distributions + CLI introspection Llama Stack has all 10! --- ## Key Architectural Decisions ### Why Async/Await Throughout? - Modern Python standard - Works well with streaming - Natural for I/O-heavy operations (API calls, GPU operations) ### Why Pydantic for Config? - Type validation - Auto-documentation - JSON schema generation - Easy serialization ### Why Protocol Classes for APIs? - Define interface without implementation - Multiple implementations possible - Type hints work with duck typing - Minimal magic ### Why YAML for Config? - Human readable - Environment variable support - Comments allowed - Wide tool support ### Why Record-Replay for Tests? - Cost efficient - Deterministic - Real behavior captured - Provider-agnostic --- ## The Learning Path for Contributors ### Understanding Order 1. **Start**: `pyproject.toml` - Entry point 2. **Learn**: `core/datatypes.py` - Data structures 3. **Understand**: `apis/inference/inference.py` - Example API 4. **See**: `providers/registry/inference.py` - Provider registry 5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider 6. **Read**: `providers/remote/inference/openai/` - Remote provider 7. **Study**: `core/resolver.py` - How it all connects 8. **Understand**: `core/stack.py` - Main orchestrator 9. **See**: `distributions/starter/` - How to use it 10. **Run**: `tests/integration/` - How to test Each step builds on previous understanding. --- ## The Elegant Parts ### Most Elegant: The Router The router system is beautiful: - Transparent to users - Automatic provider selection - Works with 1 or 100 providers - No hardcoding needed ### Most Flexible: YAML Config Configuration as first-class citizen: - Switch providers without code - Override at runtime - Version control friendly - Documentation via config ### Most Useful: Record-Replay Tests Testing pattern solves real problems: - Cost - Speed - Reliability - Coverage ### Most Scalable: Distribution Templates Pre-configured bundles: - One command to start - Verified combinations - Easy to document - Simple to teach --- ## The Future ### What's Being Built - More providers (Nvidia, SambaNova, etc.) - More APIs (more task types) - On-device execution (ExecuTorch) - Better observability (more telemetry) - Easier extensions (simpler API for custom providers) ### How It Stays Maintainable - Protocol-based design limits coupling - Clear separation of concerns - Comprehensive testing - Configuration over code - Plugin architecture The architecture is **future-proof** by design.