- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
29 KiB
Llama Stack Architecture - Comprehensive Overview
Executive Summary
Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a unified API layer with a plugin architecture for providers, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).
1. Core Architecture Philosophy
Separation of Concerns
- APIs: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
- Providers: Implement those interfaces (inline for local, remote for external services)
- Distributions: Pre-configure and bundle providers for specific deployment scenarios
Key Design Patterns
- Plugin Architecture: Dynamically load providers based on configuration
- Dependency Injection: Providers declare dependencies on other APIs/providers
- Routing: Smart routing directs requests to appropriate provider implementations
- Configuration-Driven: YAML-based configuration enables flexibility without code changes
2. Directory Structure (llama_stack/)
llama_stack/
├── apis/ # Abstract API definitions (27 APIs total)
│ ├── inference/ # LLM inference interface
│ ├── agents/ # Agent orchestration
│ ├── safety/ # Content filtering & safety
│ ├── vector_io/ # Vector database operations
│ ├── tools/ # Tool/function calling runtime
│ ├── scoring/ # Response scoring
│ ├── eval/ # Evaluation framework
│ ├── post_training/ # Fine-tuning & training
│ ├── datasetio/ # Dataset loading/management
│ ├── conversations/ # Conversation management
│ ├── common/ # Shared datatypes (SamplingParams, etc.)
│ └── [22 more...] # Models, Shields, Benchmarks, etc.
│
├── providers/ # Provider implementations (inline & remote)
│ ├── inline/ # In-process implementations
│ │ ├── inference/ # Meta Reference, Sentence Transformers
│ │ ├── agents/ # Agent orchestration implementations
│ │ ├── safety/ # Llama Guard, Code Scanner
│ │ ├── vector_io/ # FAISS, SQLite-vec, Milvus
│ │ ├── post_training/ # TorchTune
│ │ ├── eval/ # Evaluation implementations
│ │ ├── tool_runtime/ # RAG runtime, MCP protocol
│ │ └── [more...]
│ │
│ ├── remote/ # External service adapters
│ │ ├── inference/ # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
│ │ ├── vector_io/ # ChromaDB, Qdrant, Weaviate, Postgres
│ │ ├── safety/ # Bedrock, SambaNova, Nvidia
│ │ ├── agents/ # Sample implementations
│ │ ├── tool_runtime/ # Brave Search, Tavily, Wolfram Alpha
│ │ └── [more...]
│ │
│ ├── registry/ # Provider discovery/registration (inference.py, agents.py, etc.)
│ │ └── [One file per API with all providers for that API]
│ │
│ ├── utils/ # Shared provider utilities
│ │ ├── inference/ # Embedding mixin, OpenAI compat
│ │ ├── kvstore/ # Key-value store abstractions
│ │ ├── sqlstore/ # SQL storage abstractions
│ │ ├── telemetry/ # Tracing, metrics
│ │ └── [more...]
│ │
│ └── datatypes.py # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
│
├── core/ # Core runtime & orchestration
│ ├── stack.py # Main LlamaStack class (implements all APIs)
│ ├── datatypes.py # Config models (StackRunConfig, Provider, etc.)
│ ├── resolver.py # Provider resolution & dependency injection
│ ├── library_client.py # In-process client for library usage
│ ├── build.py # Distribution building
│ ├── configure.py # Configuration handling
│ ├── distribution.py # Distribution management
│ ├── routers/ # Auto-routed API implementations (infer route based on routing key)
│ ├── routing_tables/ # Manual routing tables (e.g., Models, Shields, VectorStores)
│ ├── server/ # FastAPI HTTP server setup
│ ├── storage/ # Backend storage abstractions (KVStore, SqlStore)
│ ├── utils/ # Config resolution, dynamic imports
│ └── conversations/ # Conversation service implementation
│
├── cli/ # Command-line interface
│ ├── llama.py # Main entry point
│ └── stack/ # Stack management commands
│ ├── run.py # Start a distribution
│ ├── list_apis.py # List available APIs
│ ├── list_providers.py # List providers
│ ├── list_deps.py # List dependencies
│ └── [more...]
│
├── distributions/ # Pre-configured distribution templates
│ ├── starter/ # CPU-friendly multi-provider starter
│ ├── starter-gpu/ # GPU-optimized starter
│ ├── meta-reference-gpu/ # Full-featured Meta reference
│ ├── postgres-demo/ # PostgreSQL-based demo
│ ├── template.py # Distribution template base class
│ └── [more...]
│
├── models/ # Llama model implementations
│ └── llama/
│ ├── llama3/ # Llama 3 implementation
│ ├── llama4/ # Llama 4 implementation
│ ├── sku_list.py # Model registry (maps model IDs to implementations)
│ ├── checkpoint.py # Model checkpoint handling
│ ├── datatypes.py # ToolDefinition, StopReason, etc.
│ └── [more...]
│
├── testing/ # Testing utilities
│ └── api_recorder.py # Record/replay infrastructure for integration tests
│
└── ui/ # Web UI (Streamlit-based)
├── app/
├── components/
├── pages/
└── [React/TypeScript frontend]
3. API Layer (27 APIs)
What is an API?
Each API is an abstract protocol (Python Protocol class) that defines an interface. APIs are located in llama_stack/apis/ with a structure like:
apis/inference/
├── __init__.py # Exports the Inference protocol
├── inference.py # Full API definition (300+ lines)
└── event_logger.py # Supporting types
Key APIs
Core Inference API
- Path:
llama_stack/apis/inference/inference.py - Methods:
post_chat_completion(),post_completion(),post_embedding(),get_models() - Types:
SamplingParams,SamplingStrategy(greedy/top-p/top-k),OpenAIChatCompletion - Providers: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)
Agents API
- Path:
llama_stack/apis/agents/agents.py - Methods:
create_agent(),update_agent(),create_session(),agentic_loop_turn() - Features: Multi-turn conversations, tool calling, streaming
- Providers: Meta Reference (inline), Fireworks, Together
Safety API
- Path:
llama_stack/apis/safety/safety.py - Methods:
run_shields()- filter content before/after inference - Providers: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia
Vector IO API
- Path:
llama_stack/apis/vector_io/vector_io.py - Methods:
insert(),query(),delete()- vector database operations - Providers: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)
Tools / Tool Runtime API
- Path:
llama_stack/apis/tools/tool_runtime.py - Methods:
execute_tool()- execute functions during agent loops - Providers: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol
Other Major APIs
- Post Training: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
- Eval: Evaluation frameworks (Meta Reference with autoevals)
- Scoring: Response scoring (Basic, LLM-as-Judge, Braintrust)
- Datasets: Dataset management
- DatasetIO: Dataset loading from HuggingFace, Nvidia, local files
- Conversations: Multi-turn conversation state management
- Vector Stores: Vector store metadata & configuration
- Shields: Shield (safety filter) registry
- Models: Model registry management
- Batches: Batch processing
- Prompts: Prompt templates & management
- Telemetry: Tracing & metrics collection
- Inspect: Introspection & debugging
4. Provider System
Provider Types
1. Inline Providers (InlineProviderSpec)
- Run in-process (same Python process as server)
- High performance, low latency
- No network overhead
- Heavier resource requirements
- Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)
Structure:
InlineProviderSpec(
api=Api.inference,
provider_type="inline::meta-reference",
module="llama_stack.providers.inline.inference.meta_reference",
config_class="...MetaReferenceInferenceConfig",
pip_packages=[...],
container_image="..." # Optional for containerization
)
2. Remote Providers (RemoteProviderSpec)
- Connect to external services via HTTP/API
- Lower resource requirements
- Network latency
- Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
- Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB
Structure:
RemoteProviderSpec(
api=Api.inference,
adapter_type="openai",
provider_type="remote::openai",
module="llama_stack.providers.remote.inference.openai",
config_class="...OpenAIInferenceConfig",
pip_packages=[...]
)
Provider Registration
Providers are registered in registry files (llama_stack/providers/registry/):
inference.py- All inference providers (30+)agents.py- All agent providerssafety.py- All safety providersvector_io.py- All vector IO providerstool_runtime.py- All tool runtime providers- [etc.]
Each registry file has an available_providers() function returning a list of ProviderSpec.
Provider Config
Each provider has a config class (e.g., MetaReferenceInferenceConfig):
class MetaReferenceInferenceConfig(BaseModel):
max_batch_size: int = 1
enable_pydantic_sampling: bool = True
# sample_run_config() - provides default values for testing
# pip_packages() - lists dependencies
Provider Implementation
Inline providers look like:
class MetaReferenceInferenceImpl(InferenceProvider):
async def post_chat_completion(
self,
model: str,
request: OpenAIChatCompletionRequestWithExtraBody,
) -> AsyncIterator[OpenAIChatCompletionChunk]:
# Load model, run inference, yield streaming results
...
Remote providers implement HTTP adapters:
class OllamaInferenceImpl(InferenceProvider):
async def post_chat_completion(...):
# Make HTTP requests to Ollama server
...
5. Core Runtime & Resolution
Stack Resolution Process
File: llama_stack/core/resolver.py
- Load Configuration → Parse
run.yamlwith enabled providers - Resolve Dependencies → Build dependency graph (e.g., agents may depend on inference)
- Instantiate Providers → Create provider instances with configs
- Create Router/Routed Impls → Set up request routing
- Register Resources → Register models, shields, datasets, etc.
The LlamaStack Class
File: llama_stack/core/stack.py
class LlamaStack(
Providers, # Meta API for provider management
Inference, # LLM inference
Agents, # Agent orchestration
Safety, # Content safety
VectorIO, # Vector operations
Tools, # Tool runtime
Eval, # Evaluation
# ... 15 more APIs ...
):
pass
This class inherits from all APIs, making a single LlamaStack instance support all functionality.
Two Client Modes
1. Library Client (In-Process)
from llama_stack import AsyncLlamaStackAsLibraryClient
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)
File: llama_stack/core/library_client.py
2. Server Client (HTTP)
from llama_stack_client import AsyncLlamaStackClient
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)
Uses the separate llama-stack-client package.
6. Request Routing
Two Routing Strategies
1. Auto-Routed APIs (e.g., Inference, Safety, VectorIO)
- Routing key = provider instance
- Router automatically selects provider based on resource ID
- Implementation:
AutoRoutedProviderSpec→routers/directory
# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
# Router selects provider based on which provider has that model
Routed APIs:
- Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime
2. Routing Table APIs (e.g., Models, Shields, VectorStores)
- Registry APIs that list/register resources
- Implementation:
RoutingTableProviderSpec→routing_tables/directory
# models.list_models() → merged list from all providers
# models.register_model(...) → router selects provider
Registry APIs:
- Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups
7. Distributions
What is a Distribution?
A Distribution is a pre-configured, verified bundle of providers for a specific deployment scenario.
File: llama_stack/distributions/template.py (base) → specific distros in subdirectories
Example: Starter Distribution
File: llama_stack/distributions/starter/starter.py
def get_distribution_template(name: str = "starter"):
providers = {
"inference": [
remote::ollama,
remote::vllm,
remote::openai,
# ... others ...
],
"vector_io": [
inline::faiss,
inline::sqlite-vec,
remote::qdrant,
# ... others ...
],
"safety": [
inline::llama-guard,
inline::code-scanner,
],
# ... other APIs ...
}
return DistributionTemplate(
name="starter",
providers=providers,
run_configs={
"run.yaml": RunConfigSettings(...)
}
)
Built-in Distributions
- starter: CPU-only, multi-provider (Ollama, OpenAI, etc.)
- starter-gpu: GPU-optimized version
- meta-reference-gpu: Full Meta reference implementation
- postgres-demo: PostgreSQL-backed version
- watsonx: IBM Watson X integration
- nvidia: NVIDIA-specific optimizations
- open-benchmark: For benchmarking
Distribution Lifecycle
llama stack run starter
↓
Resolve starter distribution template
↓
Merge with run.yaml config & environment variables
↓
Build/install dependencies (if needed)
↓
Start HTTP server (Uvicorn)
↓
Initialize all providers
↓
Register resources (models, shields, etc.)
↓
Ready for requests
8. CLI Architecture
File: llama_stack/cli/
Entry Point
$ llama [subcommand] [args]
Maps to pyproject.toml:
[project.scripts]
llama = "llama_stack.cli.llama:main"
Subcommands
llama stack [command]
├── run [distro|config] [--port PORT] # Start a distribution
├── list-deps [distro] # Show dependencies to install
├── list-apis # Show all APIs
├── list-providers # Show all providers
└── list [NAME] # Show distributions
Architecture:
llama.py- Main parser with subcommandsstack/stack.py- Stack subcommand routerstack/run.py- Implementation ofllama stack runstack/list_deps.py- Dependency resolution & display
9. Testing Architecture
Location: tests/ directory
Test Types
1. Unit Tests (tests/unit/)
- Fast, isolated component testing
- Mock external dependencies
- Run with:
uv run --group unit pytest tests/unit/ - Examples:
core/test_stack_validation.py- Config validationdistribution/test_distribution.py- Distribution loadingcore/routers/test_vector_io.py- Routing logic
2. Integration Tests (tests/integration/)
- End-to-end workflows
- Record-Replay pattern: Record real API responses once, replay for fast/cheap testing
- Run with:
uv run --group test pytest tests/integration/ --stack-config=starter - Structure:
tests/integration/ ├── agents/ │ ├── test_agents.py │ ├── test_persistence.py │ └── cassettes/ # Recorded API responses (YAML) ├── inference/ ├── safety/ ├── vector_io/ └── [more...]
Record-Replay System
File: llama_stack/testing/api_recorder.py
Benefits:
- Cost Control: Record real API calls once, replay thousands of times
- Speed: Cached responses = instant test execution
- Reliability: Deterministic results (no API variability)
- Provider Coverage: Same test works with OpenAI, Anthropic, Ollama, etc.
How it works:
- First run (with
LLAMA_STACK_TEST_INFERENCE_MODE=record): Real API calls saved to YAML - Subsequent runs: Load YAML and return matching responses
- CI automatically re-records when needed
Test Organization
- Common utilities:
tests/common/ - External provider tests:
tests/external/(test external APIs) - Container tests:
tests/containers/(test Docker integration) - Conftest: pytest fixtures in each directory
10. Key Design Patterns
Pattern 1: Protocol-Based Abstraction
# API definition (protocol)
class Inference(Protocol):
async def post_chat_completion(...) -> AsyncIterator[...]: ...
# Provider implementation
class InferenceProvider:
async def post_chat_completion(...): ...
Pattern 2: Dependency Injection
class AgentProvider:
def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
self.inference = inference
self.safety = safety
Pattern 3: Configuration-Driven Instantiation
# run.yaml
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
max_depth: 5
Pattern 4: Routing by Resource
# Request: inference.post_chat_completion(model="llama-2-7b")
# Router finds which provider has "llama-2-7b" and routes there
Pattern 5: Registry Pattern for Resources
# Register at startup
await models.register_model(Model(
identifier="llama-2-7b",
provider_id="inference::meta-reference",
...
))
# Later, query or filter
models_list = await models.list_models()
11. Configuration Management
Config Files
1. run.yaml - Runtime Configuration
Location: ~/.llama/distributions/{name}/run.yaml
version: 2
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
host: localhost
port: 11434
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
default_models:
- identifier: llama-2-7b
provider_id: ollama
vector_stores_config:
default_provider_id: faiss
2. build.yaml - Build Configuration
Specifies which providers to install.
3. Environment Variables
Override config values at runtime:
INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter
Config Resolution
File: llama_stack/core/utils/config_resolution.py
Order of precedence:
- Environment variables (highest)
- Runtime config (run.yaml)
- Distribution template defaults (lowest)
12. Extension Points for Developers
Adding a Custom Provider
-
Create provider module:
llama_stack/providers/remote/inference/my_provider/ ├── __init__.py ├── config.py # MyProviderConfig └── my_provider.py # MyProviderImpl(InferenceProvider) -
Register in registry:
# llama_stack/providers/registry/inference.py RemoteProviderSpec( api=Api.inference, adapter_type="my_provider", provider_type="remote::my_provider", config_class="...MyProviderConfig", module="llama_stack.providers.remote.inference.my_provider", ) -
Use in distribution:
providers: inference: - provider_id: my_provider provider_type: remote::my_provider config: {...}
Adding a Custom API
- Define protocol in
llama_stack/apis/my_api/my_api.py - Implement providers
- Register in resolver and distributions
- Add CLI support if needed
13. Storage & Persistence
Storage Backends
File: llama_stack/core/storage/datatypes.py
KV Store (Key-Value)
- Store metadata: models, shields, vector stores
- Backends: SQLite (inline), Redis, Postgres
SQL Store
- Store structured data: conversations, datasets
- Backends: SQLite (inline), Postgres
Inference Store
- Cache inference results for recording/replay
- Used in testing
Storage Configuration
storage:
type: sqlite
config:
dir: ~/.llama/distributions/starter
14. Telemetry & Tracing
Tracing System
File: llama_stack/providers/utils/telemetry/
- Automatic request tracing with OpenTelemetry
- Trace context propagation across async calls
- Integration with OpenTelemetry collectors
Telemetry API
Providers can implement the Telemetry API to collect metrics:
- Token usage
- Latency
- Error rates
- Custom metrics
15. Model System
Model Registry
File: llama_stack/models/llama/sku_list.py
resolve_model("meta-llama/Llama-2-7b")
→ Llama2Model(...)
Maps model IDs to their:
- Architecture
- Tokenizer
- Quantization options
- Required resources
Supported Models
- Llama 3 - Full architecture support
- Llama 3.1 - Extended context
- Llama 3.2 - Multimodal support
- Llama 4 - Latest generation
- Custom models - Via provider registration
Model Quantization
- int8, int4
- GPTQ
- Hadamard transform
- Custom quantizers
16. Key Files to Understand
For Understanding Core Concepts
llama_stack/core/datatypes.py- Configuration data typesllama_stack/providers/datatypes.py- Provider specsllama_stack/apis/inference/inference.py- Example API
For Understanding Runtime
llama_stack/core/stack.py- Main runtime classllama_stack/core/resolver.py- Dependency resolutionllama_stack/core/library_client.py- In-process client
For Understanding Providers
llama_stack/providers/registry/inference.py- Inference provider registryllama_stack/providers/inline/inference/meta_reference/inference.py- Example inlinellama_stack/providers/remote/inference/openai/openai.py- Example remote
For Understanding Distributions
llama_stack/distributions/template.py- Distribution templatellama_stack/distributions/starter/starter.py- Starter distrollama_stack/cli/stack/run.py- Distribution startup
17. Development Workflow
Running Locally
# Install dependencies
uv sync --all-groups
# Run a distribution (auto-starts server)
llama stack run starter
# In another terminal, interact with it
curl http://localhost:8321/health
Testing
# Unit tests (fast, no external dependencies)
uv run --group unit pytest tests/unit/
# Integration tests (with record-replay)
uv run --group test pytest tests/integration/ --stack-config=starter
# Re-record integration tests (record real API calls)
LLAMA_STACK_TEST_INFERENCE_MODE=record \
uv run --group test pytest tests/integration/ --stack-config=starter
Building Distributions
# Build Starter distribution
llama stack build starter --name my-starter
# Run it
llama stack run my-starter
18. Notable Implementation Details
Async-First Architecture
- All I/O is async (using
asyncio) - Streaming responses with
AsyncIterator - FastAPI for HTTP server (built on Starlette)
Streaming Support
- Inference responses stream tokens
- Agents stream turn-by-turn updates
- Proper async context preservation
Error Handling
- Structured errors with detailed messages
- Graceful degradation when dependencies unavailable
- Provider health checks
Extensibility
- External providers via module import
- Custom APIs via ExternalApiSpec
- Plugin discovery via provider registry
19. Typical Request Flow
User Request (e.g., chat completion)
↓
CLI or SDK Client
↓
HTTP Request → FastAPI Server (port 8321)
↓
Route Handler (e.g., /inference/chat-completion)
↓
Router (Auto-Routed API)
→ Determine which provider has the model
↓
Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
↓
External Service or Local Execution
↓
Response (streaming or complete)
↓
Send back to Client
20. Key Takeaways
- Unified APIs: Single abstraction for 27+ AI capabilities
- Pluggable Providers: 50+ implementations (inline & remote)
- Configuration-Driven: Switch providers via YAML, not code
- Distributions: Pre-verified bundles for common scenarios
- Record-Replay Testing: Cost-effective integration tests
- Two Client Modes: Library (in-process) or HTTP (distributed)
- Smart Routing: Automatic request routing to appropriate providers
- Async-First: Native streaming and concurrent request handling
- Extensible: Custom APIs and providers easily added
- Production-Ready: Health checks, telemetry, access control, storage
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (CLI, SDK, Web UI, Custom Apps) │
└────────────────────┬────────────────────────────────────────┘
│
┌───────────┴────────────┐
│ │
┌────▼────────┐ ┌───────▼──────┐
│ Library │ │ HTTP Server │
│ Client │ │ (FastAPI) │
└────┬────────┘ └───────┬──────┘
│ │
└───────────┬───────────┘
│
┌──────────▼──────────┐
│ LlamaStack Class │
│ (implements all │
│ 27 APIs) │
└──────────┬──────────┘
│
┌──────────────┼──────────────┐
│ │ │
│ Router │ Routing │ Resource
│ (Auto- │ Tables │ Registries
│ routed │ (Models, │ (Models,
│ APIs) │ Shields) │ Shields,
│ │ │ etc.)
└──────────────┼──────────────┘
│
┌────────────┴──────────────┐
│ │
┌────▼──────────┐ ┌──────────▼─────┐
│ Inline │ │ Remote │
│ Providers │ │ Providers │
│ │ │ │
│ • Meta Ref │ │ • OpenAI │
│ • FAISS │ │ • Ollama │
│ • Llama Guard │ │ • Qdrant │
│ • etc. │ │ • etc. │
│ │ │ │
└───────────────┘ └─────────────────┘
│ │
│ │
Local Execution External Services
(GPUs/CPUs) (APIs/Servers)