llama-stack-mirror/ARCHITECTURE_INSIGHTS.md
Antony Sallas 5ef6ccf90e fix: UI bug fixes and comprehensive architecture documentation
- Fixed Agent Instructions overflow by adding vertical scrolling
- Fixed duplicate chat content by skipping turn_complete events
- Added comprehensive architecture documentation (4 files)
- Added UI bug fixes documentation
- Added Notion API upgrade analysis
- Created documentation registry for Notion pages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
2025-10-27 13:34:02 +08:00

15 KiB

Llama Stack - Architecture Insights for Developers

Why This Architecture Works

Problem It Solves

Without Llama Stack, building AI applications requires:

  • Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
  • Rewriting code to switch providers
  • Duplicating logic for common patterns (safety checks, vector search, etc.)
  • Managing complex dependencies manually

Solution: The Three Pillars

Single, Unified API Interface
        ↓
Multiple Provider Implementations
        ↓
Pre-configured Distributions

Result: Write once, run anywhere (locally, cloud, on-device)


The Genius of the Plugin Architecture

How It Works

  1. Define Abstract Interface (Protocol in apis/)

    class Inference(Protocol):
        async def post_chat_completion(...) -> AsyncIterator[...]: ...
    
  2. Multiple Implementations (in providers/)

    • Local: Meta Reference, vLLM, Ollama
    • Cloud: OpenAI, Anthropic, Groq, Bedrock
    • Each implements same interface
  3. Runtime Selection (via YAML config)

    providers:
      inference:
        - provider_type: remote::openai
    
  4. Zero Code Changes to switch providers!

Why This Beats Individual SDKs

  • Single SDK vs 30+ provider SDKs
  • Same API vs learning each provider's quirks
  • Easy migration - change 1 config value
  • Testing - same tests work across all providers

The Request Routing Intelligence

Two Clever Routing Strategies

1. Auto-Routed APIs (Smart Dispatch)

APIs: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime

When you call:

await inference.post_chat_completion(model="llama-2-7b")

Router automatically determines:

  • "Which provider has llama-2-7b?"
  • "Route this request there"
  • "Stream response back"

Implementation: routers/ directory contains auto-routers

2. Routing Table APIs (Registry Pattern)

APIs: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions

When you call:

models = await models.list_models()  # Merged list from ALL providers

Router:

  • Queries each provider
  • Merges results
  • Returns unified list

Implementation: routing_tables/ directory

Why This Matters

  • Users don't think about providers - just use the API
  • Multiple implementations work - router handles dispatch
  • Easy scaling - add new providers without touching user code
  • Resource management - router knows what's available

Configuration as a Weapon

The Power of YAML Over Code

Traditional approach:

# Code changes needed for each provider!
if use_openai:
    from openai import OpenAI
    client = OpenAI(api_key=...)
elif use_ollama:
    from ollama import Client
    client = Client(url=...)
# etc.

Llama Stack approach:

# Zero code changes!
providers:
  inference:
    - provider_type: remote::openai
      config:
        api_key: ${env.OPENAI_API_KEY}

Then later, change to:

providers:
  inference:
    - provider_type: remote::ollama
      config:
        host: localhost

Same application code works with both!

Environment Variable Magic

# Change provider at runtime
INFERENCE_MODEL=llama-2-70b llama stack run starter

# No redeployment needed!

The Distributions Strategy

Problem: "Works on My Machine"

  • Different developers need different setups
  • Production needs different providers than development
  • CI/CD needs lightweight dependencies

Solution: Pre-verified Distributions

starter → Works on CPU with free APIs (Ollama + OpenAI)
starter-gpu → Works on GPU machines
meta-reference-gpu → Works with full local setup
postgres-demo → Production-grade with persistent storage

Each distribution:

  • Pre-selects working providers
  • Sets sensible defaults
  • Bundles required dependencies
  • Tested end-to-end

Result: llama stack run starter just works for 80% of use cases

Why This Beats Documentation

  • No setup guides needed - distribution does it
  • No guessing - curated, tested combinations
  • Reproducible - same distro always works same way
  • Upgradeable - update distro = get improvements

The Testing Genius: Record-Replay

Traditional Testing Hell for AI

Problem:

  • API calls cost money
  • API responses are non-deterministic
  • Each provider has different response formats
  • Tests become slow and flaky

The Record-Replay Solution

First run (record):

LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
# Makes real API calls, saves responses to YAML

All subsequent runs (replay):

pytest tests/integration/
# Returns cached responses, NO API calls, instant results

Why This is Brilliant

  • Cost: Record once, replay 1000x. Save thousands of dollars
  • Speed: Cached responses = instant test execution
  • Reliability: Deterministic results (no API variability)
  • Coverage: One test works with OpenAI, Ollama, Anthropic, etc.

File location: tests/integration/[api]/cassettes/


Core Runtime: The Stack Class

The Elegance of Inheritance

class LlamaStack(
    Inference,        # Chat completion, embeddings
    Agents,           # Multi-turn orchestration
    Safety,           # Content filtering
    VectorIO,         # Vector operations
    Tools,            # Function execution
    Eval,             # Evaluation
    Scoring,          # Response scoring
    Models,           # Model registry
    # ... 19 more APIs
):
    pass

A single LlamaStack instance:

  • Implements 27 different APIs
  • Has 50+ providers backing it
  • Routes requests intelligently
  • Manages dependencies

All from a ~400 line file + lots of protocol definitions!


Dependency Injection Without the Complexity

How Providers Depend on Each Other

Problem: Agents need Inference, Inference needs Models registry

class AgentProvider:
    def __init__(self, 
                 inference: InferenceProvider,
                 safety: SafetyProvider,
                 tool_runtime: ToolRuntimeProvider):
        self.inference = inference
        self.safety = safety
        self.tool_runtime = tool_runtime

How It Gets Resolved

File: core/resolver.py

  1. Parse run.yaml - which providers enabled?
  2. Build dependency graph - who depends on whom?
  3. Topological sort - what order to instantiate?
  4. Instantiate in order - each gets its dependencies

Result: Complex dependency chains handled automatically!


The Client Duality

Two Ways to Use Llama Stack

1. Library Mode (In-Process)

from llama_stack import AsyncLlamaStackAsLibraryClient

client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)
  • No HTTP overhead
  • Direct Python API
  • Embedded in application
  • File: core/library_client.py

2. Server Mode (HTTP)

llama stack run starter  # Start server on port 8321
from llama_stack_client import AsyncLlamaStackClient

client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)
  • Distributed architecture
  • Share single server across apps
  • Easy deployment
  • Language-agnostic clients (Python, TypeScript, Swift, Kotlin)

Result: Same API, different deployment strategies!


The Model System Insight

Why It Exists

Problem: Different model IDs across providers

  • HuggingFace: meta-llama/Llama-2-7b
  • Ollama: llama2
  • OpenAI: gpt-4

Solution: Universal Model Registry

File: models/llama/sku_list.py

resolve_model("meta-llama/Llama-2-7b")
# Returns Model object with:
# - Architecture info
# - Tokenizer
# - Quantization options
# - Resource requirements

Allows:

  • Consistent model IDs across providers
  • Intelligent resource allocation
  • Provider-agnostic inference

The CLI Is Smart

It Does More Than You Think

llama stack run starter

This command:

  1. Resolves the starter distribution template
  2. Merges with environment variables
  3. Creates/updates ~/.llama/distributions/starter/run.yaml
  4. Installs missing dependencies
  5. Starts HTTP server on port 8321
  6. Initializes all providers
  7. Registers available models
  8. Ready for requests

No separate build step needed! (unless building Docker images)

Introspection Commands

llama stack list-apis           # See all 27 APIs
llama stack list-providers      # See all 50+ providers
llama stack list                # See all distributions
llama stack list-deps starter   # See what to install

Used for documentation, debugging, and automation


Storage: The Oft-Overlooked Component

Three Storage Types

  1. KV Store - Metadata (models, shields)
  2. SQL Store - Structured (conversations, datasets)
  3. Inference Store - Caching (for testing)

Why Multiple Backends Matter

  • Development: SQLite (no dependencies)
  • Production: PostgreSQL (scalable)
  • Distributed: Redis (shared state)
  • Testing: In-memory (fast)

Files:

  • core/storage/datatypes.py - Interfaces
  • providers/utils/kvstore/ - Implementations
  • providers/utils/sqlstore/ - Implementations

Telemetry: Built-In Observability

What Gets Traced

  • Every API call
  • Token usage (if provider supports it)
  • Latency
  • Errors
  • Custom metrics from providers

Integration

  • OpenTelemetry compatible
  • Automatic context propagation
  • Works across async boundaries
  • File: providers/utils/telemetry/

Extension Strategy: How to Add Custom Functionality

Adding a Custom API

  1. Create protocol in apis/my_api/my_api.py
  2. Implement providers (inline and/or remote)
  3. Register in core/resolver.py
  4. Add to distributions

Adding a Custom Provider

  1. Create module in providers/[inline|remote]/[api]/[provider]/
  2. Implement config and adapter classes
  3. Register in providers/registry/[api].py
  4. Use in distribution YAML

Adding a Custom Distribution

  1. Create subdirectory in distributions/[name]/
  2. Implement template in [name].py
  3. Register in distribution discovery

Common Misconceptions Clarified

"APIs are HTTP endpoints"

Wrong - APIs are Python protocols. HTTP comes later via FastAPI.

  • The "Inference" API is just a Python Protocol
  • Providers implement it
  • Core wraps it with HTTP for server mode
  • Library mode uses it directly

"Providers are all external services"

Wrong - Providers can be:

  • Inline (local execution): Meta Reference, FAISS, Llama Guard
  • Remote (external services): OpenAI, Ollama, Qdrant

Inline providers have low latency and no dependency on external services.

"You must run a server"

Wrong - Two modes:

  • Server mode: llama stack run starter (HTTP)
  • Library mode: Import and use directly in Python

"Distributions are just Docker images"

Wrong - Distributions are:

  • Templates (what providers to use)
  • Configs (how to configure them)
  • Dependencies (what to install)
  • Can be Docker OR local Python

Performance Implications

Inline Providers Are Fast

Inline (e.g., Meta Reference)
├─ 0ms network latency
├─ No HTTP serialization/deserialization
├─ Direct GPU access
└─ Fast (but high resource cost)

Remote (e.g., OpenAI)
├─ 100-500ms network latency
├─ HTTP serialization overhead
├─ Low resource cost
└─ Slower (but cheap)

Streaming Is Native

response = await inference.post_chat_completion(model=..., stream=True)
async for chunk in response:
    print(chunk.delta)  # Process token by token

Tokens arrive as they're generated, no waiting for full response.


Security Considerations

API Keys Are Config

inference:
  - provider_id: openai
    config:
      api_key: ${env.OPENAI_API_KEY}  # From environment

Never hardcoded, always from env vars.

Access Control

File: core/access_control/

Providers can implement access rules:

  • Per-user restrictions
  • Per-model restrictions
  • Rate limiting
  • Audit logging

Sensitive Field Redaction

Config logging automatically redacts:

  • API keys
  • Passwords
  • Tokens

Maturity Indicators

Signs of Production-Ready Design

  1. Separated Concerns - APIs, Providers, Distributions
  2. Plugin Architecture - Easy to extend
  3. Configuration Over Code - Deploy without recompiling
  4. Comprehensive Testing - Unit + Integration with record-replay
  5. Multiple Client Options - Library + Server modes
  6. Storage Abstraction - Multiple backends
  7. Dependency Management - Automatic resolution
  8. Error Handling - Structured, informative errors
  9. Observability - Built-in telemetry
  10. Documentation - Distributions + CLI introspection

Llama Stack has all 10!


Key Architectural Decisions

Why Async/Await Throughout?

  • Modern Python standard
  • Works well with streaming
  • Natural for I/O-heavy operations (API calls, GPU operations)

Why Pydantic for Config?

  • Type validation
  • Auto-documentation
  • JSON schema generation
  • Easy serialization

Why Protocol Classes for APIs?

  • Define interface without implementation
  • Multiple implementations possible
  • Type hints work with duck typing
  • Minimal magic

Why YAML for Config?

  • Human readable
  • Environment variable support
  • Comments allowed
  • Wide tool support

Why Record-Replay for Tests?

  • Cost efficient
  • Deterministic
  • Real behavior captured
  • Provider-agnostic

The Learning Path for Contributors

Understanding Order

  1. Start: pyproject.toml - Entry point
  2. Learn: core/datatypes.py - Data structures
  3. Understand: apis/inference/inference.py - Example API
  4. See: providers/registry/inference.py - Provider registry
  5. Read: providers/inline/inference/meta_reference/ - Inline provider
  6. Read: providers/remote/inference/openai/ - Remote provider
  7. Study: core/resolver.py - How it all connects
  8. Understand: core/stack.py - Main orchestrator
  9. See: distributions/starter/ - How to use it
  10. Run: tests/integration/ - How to test

Each step builds on previous understanding.


The Elegant Parts

Most Elegant: The Router

The router system is beautiful:

  • Transparent to users
  • Automatic provider selection
  • Works with 1 or 100 providers
  • No hardcoding needed

Most Flexible: YAML Config

Configuration as first-class citizen:

  • Switch providers without code
  • Override at runtime
  • Version control friendly
  • Documentation via config

Most Useful: Record-Replay Tests

Testing pattern solves real problems:

  • Cost
  • Speed
  • Reliability
  • Coverage

Most Scalable: Distribution Templates

Pre-configured bundles:

  • One command to start
  • Verified combinations
  • Easy to document
  • Simple to teach

The Future

What's Being Built

  • More providers (Nvidia, SambaNova, etc.)
  • More APIs (more task types)
  • On-device execution (ExecuTorch)
  • Better observability (more telemetry)
  • Easier extensions (simpler API for custom providers)

How It Stays Maintainable

  • Protocol-based design limits coupling
  • Clear separation of concerns
  • Comprehensive testing
  • Configuration over code
  • Plugin architecture

The architecture is future-proof by design.