mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-11 19:56:03 +00:00

Antony Sallas 5ef6ccf90e fix: UI bug fixes and comprehensive architecture documentation

- Fixed Agent Instructions overflow by adding vertical scrolling
- Fixed duplicate chat content by skipping turn_complete events
- Added comprehensive architecture documentation (4 files)
- Added UI bug fixes documentation
- Added Notion API upgrade analysis
- Created documentation registry for Notion pages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

2025-10-27 13:34:02 +08:00

15 KiB

Raw Blame History

Llama Stack - Architecture Insights for Developers

Why This Architecture Works

Problem It Solves

Without Llama Stack, building AI applications requires:

Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
Rewriting code to switch providers
Duplicating logic for common patterns (safety checks, vector search, etc.)
Managing complex dependencies manually

Solution: The Three Pillars

Single, Unified API Interface
        ↓
Multiple Provider Implementations
        ↓
Pre-configured Distributions

Result: Write once, run anywhere (locally, cloud, on-device)

The Genius of the Plugin Architecture

How It Works

Define Abstract Interface (Protocol in apis/)

class Inference(Protocol):
    async def post_chat_completion(...) -> AsyncIterator[...]: ...

Multiple Implementations (in providers/)
- Local: Meta Reference, vLLM, Ollama
- Cloud: OpenAI, Anthropic, Groq, Bedrock
- Each implements same interface

Runtime Selection (via YAML config)

providers:
  inference:
    - provider_type: remote::openai

Zero Code Changes to switch providers!

Why This Beats Individual SDKs

Single SDK vs 30+ provider SDKs
Same API vs learning each provider's quirks
Easy migration - change 1 config value
Testing - same tests work across all providers

The Request Routing Intelligence

Two Clever Routing Strategies

1. Auto-Routed APIs (Smart Dispatch)

APIs: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime

When you call:

await inference.post_chat_completion(model="llama-2-7b")

Router automatically determines:

"Which provider has llama-2-7b?"
"Route this request there"
"Stream response back"

Implementation: routers/ directory contains auto-routers

2. Routing Table APIs (Registry Pattern)

APIs: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions

When you call:

models = await models.list_models()  # Merged list from ALL providers

Router:

Queries each provider
Merges results
Returns unified list

Implementation: routing_tables/ directory

Why This Matters

Users don't think about providers - just use the API
Multiple implementations work - router handles dispatch
Easy scaling - add new providers without touching user code
Resource management - router knows what's available

Configuration as a Weapon

The Power of YAML Over Code

Traditional approach:

# Code changes needed for each provider!
if use_openai:
    from openai import OpenAI
    client = OpenAI(api_key=...)
elif use_ollama:
    from ollama import Client
    client = Client(url=...)
# etc.

Llama Stack approach:

# Zero code changes!
providers:
  inference:
    - provider_type: remote::openai
      config:
        api_key: ${env.OPENAI_API_KEY}

Then later, change to:

providers:
  inference:
    - provider_type: remote::ollama
      config:
        host: localhost

Same application code works with both!

Environment Variable Magic

# Change provider at runtime
INFERENCE_MODEL=llama-2-70b llama stack run starter

# No redeployment needed!

The Distributions Strategy

Problem: "Works on My Machine"

Different developers need different setups
Production needs different providers than development
CI/CD needs lightweight dependencies

Solution: Pre-verified Distributions

starter → Works on CPU with free APIs (Ollama + OpenAI)
starter-gpu → Works on GPU machines
meta-reference-gpu → Works with full local setup
postgres-demo → Production-grade with persistent storage

Each distribution:

Pre-selects working providers
Sets sensible defaults
Bundles required dependencies
Tested end-to-end

Result: llama stack run starter just works for 80% of use cases

Why This Beats Documentation

No setup guides needed - distribution does it
No guessing - curated, tested combinations
Reproducible - same distro always works same way
Upgradeable - update distro = get improvements

The Testing Genius: Record-Replay

Traditional Testing Hell for AI

Problem:

API calls cost money
API responses are non-deterministic
Each provider has different response formats
Tests become slow and flaky

The Record-Replay Solution

First run (record):

LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
# Makes real API calls, saves responses to YAML

All subsequent runs (replay):

pytest tests/integration/
# Returns cached responses, NO API calls, instant results

Why This is Brilliant

Cost: Record once, replay 1000x. Save thousands of dollars
Speed: Cached responses = instant test execution
Reliability: Deterministic results (no API variability)
Coverage: One test works with OpenAI, Ollama, Anthropic, etc.

File location: tests/integration/[api]/cassettes/

Core Runtime: The Stack Class

The Elegance of Inheritance

class LlamaStack(
    Inference,        # Chat completion, embeddings
    Agents,           # Multi-turn orchestration
    Safety,           # Content filtering
    VectorIO,         # Vector operations
    Tools,            # Function execution
    Eval,             # Evaluation
    Scoring,          # Response scoring
    Models,           # Model registry
    # ... 19 more APIs
):
    pass

A single LlamaStack instance:

Implements 27 different APIs
Has 50+ providers backing it
Routes requests intelligently
Manages dependencies

All from a ~400 line file + lots of protocol definitions!

Dependency Injection Without the Complexity

How Providers Depend on Each Other

Problem: Agents need Inference, Inference needs Models registry

class AgentProvider:
    def __init__(self, 
                 inference: InferenceProvider,
                 safety: SafetyProvider,
                 tool_runtime: ToolRuntimeProvider):
        self.inference = inference
        self.safety = safety
        self.tool_runtime = tool_runtime

How It Gets Resolved

File: core/resolver.py

Parse run.yaml - which providers enabled?
Build dependency graph - who depends on whom?
Topological sort - what order to instantiate?
Instantiate in order - each gets its dependencies

Result: Complex dependency chains handled automatically!

The Client Duality

Two Ways to Use Llama Stack

1. Library Mode (In-Process)

from llama_stack import AsyncLlamaStackAsLibraryClient

client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)

No HTTP overhead
Direct Python API
Embedded in application
File: core/library_client.py

2. Server Mode (HTTP)

llama stack run starter  # Start server on port 8321

from llama_stack_client import AsyncLlamaStackClient

client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)

Distributed architecture
Share single server across apps
Easy deployment
Language-agnostic clients (Python, TypeScript, Swift, Kotlin)

Result: Same API, different deployment strategies!

The Model System Insight

Why It Exists

Problem: Different model IDs across providers

HuggingFace: meta-llama/Llama-2-7b
Ollama: llama2
OpenAI: gpt-4

Solution: Universal Model Registry

File: models/llama/sku_list.py

resolve_model("meta-llama/Llama-2-7b")
# Returns Model object with:
# - Architecture info
# - Tokenizer
# - Quantization options
# - Resource requirements

Allows:

Consistent model IDs across providers
Intelligent resource allocation
Provider-agnostic inference

The CLI Is Smart

It Does More Than You Think

llama stack run starter

This command:

Resolves the starter distribution template
Merges with environment variables
Creates/updates ~/.llama/distributions/starter/run.yaml
Installs missing dependencies
Starts HTTP server on port 8321
Initializes all providers
Registers available models
Ready for requests

No separate build step needed! (unless building Docker images)

Introspection Commands

llama stack list-apis           # See all 27 APIs
llama stack list-providers      # See all 50+ providers
llama stack list                # See all distributions
llama stack list-deps starter   # See what to install

Used for documentation, debugging, and automation

Storage: The Oft-Overlooked Component

Three Storage Types

KV Store - Metadata (models, shields)
SQL Store - Structured (conversations, datasets)
Inference Store - Caching (for testing)

Why Multiple Backends Matter

Development: SQLite (no dependencies)
Production: PostgreSQL (scalable)
Distributed: Redis (shared state)
Testing: In-memory (fast)

Files:

core/storage/datatypes.py - Interfaces
providers/utils/kvstore/ - Implementations
providers/utils/sqlstore/ - Implementations

Telemetry: Built-In Observability

What Gets Traced

Every API call
Token usage (if provider supports it)
Latency
Errors
Custom metrics from providers

Integration

OpenTelemetry compatible
Automatic context propagation
Works across async boundaries
File: providers/utils/telemetry/

Extension Strategy: How to Add Custom Functionality

Adding a Custom API

Create protocol in apis/my_api/my_api.py
Implement providers (inline and/or remote)
Register in core/resolver.py
Add to distributions

Adding a Custom Provider

Create module in providers/[inline|remote]/[api]/[provider]/
Implement config and adapter classes
Register in providers/registry/[api].py
Use in distribution YAML

Adding a Custom Distribution

Create subdirectory in distributions/[name]/
Implement template in [name].py
Register in distribution discovery

Common Misconceptions Clarified

"APIs are HTTP endpoints"

Wrong - APIs are Python protocols. HTTP comes later via FastAPI.

The "Inference" API is just a Python Protocol
Providers implement it
Core wraps it with HTTP for server mode
Library mode uses it directly

"Providers are all external services"

Wrong - Providers can be:

Inline (local execution): Meta Reference, FAISS, Llama Guard
Remote (external services): OpenAI, Ollama, Qdrant

Inline providers have low latency and no dependency on external services.

"You must run a server"

Wrong - Two modes:

Server mode: llama stack run starter (HTTP)
Library mode: Import and use directly in Python

"Distributions are just Docker images"

Wrong - Distributions are:

Templates (what providers to use)
Configs (how to configure them)
Dependencies (what to install)
Can be Docker OR local Python

Performance Implications

Inline Providers Are Fast

Inline (e.g., Meta Reference)
├─ 0ms network latency
├─ No HTTP serialization/deserialization
├─ Direct GPU access
└─ Fast (but high resource cost)

Remote (e.g., OpenAI)
├─ 100-500ms network latency
├─ HTTP serialization overhead
├─ Low resource cost
└─ Slower (but cheap)

Streaming Is Native

response = await inference.post_chat_completion(model=..., stream=True)
async for chunk in response:
    print(chunk.delta)  # Process token by token

Tokens arrive as they're generated, no waiting for full response.

Security Considerations

API Keys Are Config

inference:
  - provider_id: openai
    config:
      api_key: ${env.OPENAI_API_KEY}  # From environment

Never hardcoded, always from env vars.

Access Control

File: core/access_control/

Providers can implement access rules:

Per-user restrictions
Per-model restrictions
Rate limiting
Audit logging

Sensitive Field Redaction

Config logging automatically redacts:

API keys
Passwords
Tokens

Maturity Indicators

Signs of Production-Ready Design

Separated Concerns - APIs, Providers, Distributions
Plugin Architecture - Easy to extend
Configuration Over Code - Deploy without recompiling
Comprehensive Testing - Unit + Integration with record-replay
Multiple Client Options - Library + Server modes
Storage Abstraction - Multiple backends
Dependency Management - Automatic resolution
Error Handling - Structured, informative errors
Observability - Built-in telemetry
Documentation - Distributions + CLI introspection

Llama Stack has all 10!

Key Architectural Decisions

Why Async/Await Throughout?

Modern Python standard
Works well with streaming
Natural for I/O-heavy operations (API calls, GPU operations)

Why Pydantic for Config?

Type validation
Auto-documentation
JSON schema generation
Easy serialization

Why Protocol Classes for APIs?

Define interface without implementation
Multiple implementations possible
Type hints work with duck typing
Minimal magic

Why YAML for Config?

Human readable
Environment variable support
Comments allowed
Wide tool support

Why Record-Replay for Tests?

Cost efficient
Deterministic
Real behavior captured
Provider-agnostic

The Learning Path for Contributors

Understanding Order

Start: pyproject.toml - Entry point
Learn: core/datatypes.py - Data structures
Understand: apis/inference/inference.py - Example API
See: providers/registry/inference.py - Provider registry
Read: providers/inline/inference/meta_reference/ - Inline provider
Read: providers/remote/inference/openai/ - Remote provider
Study: core/resolver.py - How it all connects
Understand: core/stack.py - Main orchestrator
See: distributions/starter/ - How to use it
Run: tests/integration/ - How to test

Each step builds on previous understanding.

The Elegant Parts

Most Elegant: The Router

The router system is beautiful:

Transparent to users
Automatic provider selection
Works with 1 or 100 providers
No hardcoding needed

Most Flexible: YAML Config

Configuration as first-class citizen:

Switch providers without code
Override at runtime
Version control friendly
Documentation via config

Most Useful: Record-Replay Tests

Testing pattern solves real problems:

Cost
Speed
Reliability
Coverage

Most Scalable: Distribution Templates

Pre-configured bundles:

One command to start
Verified combinations
Easy to document
Simple to teach

The Future

What's Being Built

More providers (Nvidia, SambaNova, etc.)
More APIs (more task types)
On-device execution (ExecuTorch)
Better observability (more telemetry)
Easier extensions (simpler API for custom providers)

How It Stays Maintainable

Protocol-based design limits coupling
Clear separation of concerns
Comprehensive testing
Configuration over code
Plugin architecture

The architecture is future-proof by design.

15 KiB Raw Blame History