mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-11 19:56:03 +00:00

Antony Sallas 5ef6ccf90e fix: UI bug fixes and comprehensive architecture documentation

- Fixed Agent Instructions overflow by adding vertical scrolling
- Fixed duplicate chat content by skipping turn_complete events
- Added comprehensive architecture documentation (4 files)
- Added UI bug fixes documentation
- Added Notion API upgrade analysis
- Created documentation registry for Notion pages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

2025-10-27 13:34:02 +08:00

29 KiB

Raw Blame History

Llama Stack Architecture - Comprehensive Overview

Executive Summary

Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a unified API layer with a plugin architecture for providers, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).

1. Core Architecture Philosophy

Separation of Concerns

APIs: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
Providers: Implement those interfaces (inline for local, remote for external services)
Distributions: Pre-configure and bundle providers for specific deployment scenarios

Key Design Patterns

Plugin Architecture: Dynamically load providers based on configuration
Dependency Injection: Providers declare dependencies on other APIs/providers
Routing: Smart routing directs requests to appropriate provider implementations
Configuration-Driven: YAML-based configuration enables flexibility without code changes

2. Directory Structure (`llama_stack/`)

llama_stack/
├── apis/                    # Abstract API definitions (27 APIs total)
│   ├── inference/          # LLM inference interface
│   ├── agents/             # Agent orchestration
│   ├── safety/             # Content filtering & safety
│   ├── vector_io/          # Vector database operations
│   ├── tools/              # Tool/function calling runtime
│   ├── scoring/            # Response scoring
│   ├── eval/               # Evaluation framework
│   ├── post_training/      # Fine-tuning & training
│   ├── datasetio/          # Dataset loading/management
│   ├── conversations/      # Conversation management
│   ├── common/             # Shared datatypes (SamplingParams, etc.)
│   └── [22 more...]        # Models, Shields, Benchmarks, etc.
│
├── providers/              # Provider implementations (inline & remote)
│   ├── inline/             # In-process implementations
│   │   ├── inference/      # Meta Reference, Sentence Transformers
│   │   ├── agents/         # Agent orchestration implementations
│   │   ├── safety/         # Llama Guard, Code Scanner
│   │   ├── vector_io/      # FAISS, SQLite-vec, Milvus
│   │   ├── post_training/  # TorchTune
│   │   ├── eval/           # Evaluation implementations
│   │   ├── tool_runtime/   # RAG runtime, MCP protocol
│   │   └── [more...]
│   │
│   ├── remote/             # External service adapters
│   │   ├── inference/      # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
│   │   ├── vector_io/      # ChromaDB, Qdrant, Weaviate, Postgres
│   │   ├── safety/         # Bedrock, SambaNova, Nvidia
│   │   ├── agents/         # Sample implementations
│   │   ├── tool_runtime/   # Brave Search, Tavily, Wolfram Alpha
│   │   └── [more...]
│   │
│   ├── registry/           # Provider discovery/registration (inference.py, agents.py, etc.)
│   │   └── [One file per API with all providers for that API]
│   │
│   ├── utils/              # Shared provider utilities
│   │   ├── inference/      # Embedding mixin, OpenAI compat
│   │   ├── kvstore/        # Key-value store abstractions
│   │   ├── sqlstore/       # SQL storage abstractions
│   │   ├── telemetry/      # Tracing, metrics
│   │   └── [more...]
│   │
│   └── datatypes.py        # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
│
├── core/                   # Core runtime & orchestration
│   ├── stack.py            # Main LlamaStack class (implements all APIs)
│   ├── datatypes.py        # Config models (StackRunConfig, Provider, etc.)
│   ├── resolver.py         # Provider resolution & dependency injection
│   ├── library_client.py   # In-process client for library usage
│   ├── build.py            # Distribution building
│   ├── configure.py        # Configuration handling
│   ├── distribution.py     # Distribution management
│   ├── routers/            # Auto-routed API implementations (infer route based on routing key)
│   ├── routing_tables/     # Manual routing tables (e.g., Models, Shields, VectorStores)
│   ├── server/             # FastAPI HTTP server setup
│   ├── storage/            # Backend storage abstractions (KVStore, SqlStore)
│   ├── utils/              # Config resolution, dynamic imports
│   └── conversations/      # Conversation service implementation
│
├── cli/                    # Command-line interface
│   ├── llama.py            # Main entry point
│   └── stack/              # Stack management commands
│       ├── run.py          # Start a distribution
│       ├── list_apis.py    # List available APIs
│       ├── list_providers.py # List providers
│       ├── list_deps.py    # List dependencies
│       └── [more...]
│
├── distributions/          # Pre-configured distribution templates
│   ├── starter/            # CPU-friendly multi-provider starter
│   ├── starter-gpu/        # GPU-optimized starter
│   ├── meta-reference-gpu/ # Full-featured Meta reference
│   ├── postgres-demo/      # PostgreSQL-based demo
│   ├── template.py         # Distribution template base class
│   └── [more...]
│
├── models/                 # Llama model implementations
│   └── llama/
│       ├── llama3/         # Llama 3 implementation
│       ├── llama4/         # Llama 4 implementation
│       ├── sku_list.py     # Model registry (maps model IDs to implementations)
│       ├── checkpoint.py   # Model checkpoint handling
│       ├── datatypes.py    # ToolDefinition, StopReason, etc.
│       └── [more...]
│
├── testing/                # Testing utilities
│   └── api_recorder.py     # Record/replay infrastructure for integration tests
│
└── ui/                     # Web UI (Streamlit-based)
    ├── app/
    ├── components/
    ├── pages/
    └── [React/TypeScript frontend]

3. API Layer (27 APIs)

What is an API?

Each API is an abstract protocol (Python Protocol class) that defines an interface. APIs are located in llama_stack/apis/ with a structure like:

apis/inference/
├── __init__.py          # Exports the Inference protocol
├── inference.py         # Full API definition (300+ lines)
└── event_logger.py      # Supporting types

Key APIs

Core Inference API

Path: llama_stack/apis/inference/inference.py
Methods: post_chat_completion(), post_completion(), post_embedding(), get_models()
Types: SamplingParams, SamplingStrategy (greedy/top-p/top-k), OpenAIChatCompletion
Providers: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)

Agents API

Path: llama_stack/apis/agents/agents.py
Methods: create_agent(), update_agent(), create_session(), agentic_loop_turn()
Features: Multi-turn conversations, tool calling, streaming
Providers: Meta Reference (inline), Fireworks, Together

Safety API

Path: llama_stack/apis/safety/safety.py
Methods: run_shields() - filter content before/after inference
Providers: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia

Vector IO API

Path: llama_stack/apis/vector_io/vector_io.py
Methods: insert(), query(), delete() - vector database operations
Providers: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)

Tools / Tool Runtime API

Path: llama_stack/apis/tools/tool_runtime.py
Methods: execute_tool() - execute functions during agent loops
Providers: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol

Other Major APIs

Post Training: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
Eval: Evaluation frameworks (Meta Reference with autoevals)
Scoring: Response scoring (Basic, LLM-as-Judge, Braintrust)
Datasets: Dataset management
DatasetIO: Dataset loading from HuggingFace, Nvidia, local files
Conversations: Multi-turn conversation state management
Vector Stores: Vector store metadata & configuration
Shields: Shield (safety filter) registry
Models: Model registry management
Batches: Batch processing
Prompts: Prompt templates & management
Telemetry: Tracing & metrics collection
Inspect: Introspection & debugging

4. Provider System

Provider Types

1. Inline Providers (`InlineProviderSpec`)

Run in-process (same Python process as server)
High performance, low latency
No network overhead
Heavier resource requirements
Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)

Structure:

InlineProviderSpec(
    api=Api.inference,
    provider_type="inline::meta-reference",
    module="llama_stack.providers.inline.inference.meta_reference",
    config_class="...MetaReferenceInferenceConfig",
    pip_packages=[...],
    container_image="..."  # Optional for containerization
)

2. Remote Providers (`RemoteProviderSpec`)

Connect to external services via HTTP/API
Lower resource requirements
Network latency
Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB

Structure:

RemoteProviderSpec(
    api=Api.inference,
    adapter_type="openai",
    provider_type="remote::openai",
    module="llama_stack.providers.remote.inference.openai",
    config_class="...OpenAIInferenceConfig",
    pip_packages=[...]
)

Provider Registration

Providers are registered in registry files (llama_stack/providers/registry/):

inference.py - All inference providers (30+)
agents.py - All agent providers
safety.py - All safety providers
vector_io.py - All vector IO providers
tool_runtime.py - All tool runtime providers
[etc.]

Each registry file has an available_providers() function returning a list of ProviderSpec.

Provider Config

Each provider has a config class (e.g., MetaReferenceInferenceConfig):

class MetaReferenceInferenceConfig(BaseModel):
    max_batch_size: int = 1
    enable_pydantic_sampling: bool = True
    # sample_run_config() - provides default values for testing
    # pip_packages() - lists dependencies

Provider Implementation

Inline providers look like:

class MetaReferenceInferenceImpl(InferenceProvider):
    async def post_chat_completion(
        self,
        model: str,
        request: OpenAIChatCompletionRequestWithExtraBody,
    ) -> AsyncIterator[OpenAIChatCompletionChunk]:
        # Load model, run inference, yield streaming results
        ...

Remote providers implement HTTP adapters:

class OllamaInferenceImpl(InferenceProvider):
    async def post_chat_completion(...):
        # Make HTTP requests to Ollama server
        ...

5. Core Runtime & Resolution

Stack Resolution Process

File: llama_stack/core/resolver.py

Load Configuration → Parse run.yaml with enabled providers
Resolve Dependencies → Build dependency graph (e.g., agents may depend on inference)
Instantiate Providers → Create provider instances with configs
Create Router/Routed Impls → Set up request routing
Register Resources → Register models, shields, datasets, etc.

The LlamaStack Class

File: llama_stack/core/stack.py

class LlamaStack(
    Providers,      # Meta API for provider management
    Inference,      # LLM inference
    Agents,         # Agent orchestration
    Safety,         # Content safety
    VectorIO,       # Vector operations
    Tools,          # Tool runtime
    Eval,           # Evaluation
    # ... 15 more APIs ...
):
    pass

This class inherits from all APIs, making a single LlamaStack instance support all functionality.

Two Client Modes

1. Library Client (In-Process)

from llama_stack import AsyncLlamaStackAsLibraryClient

client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)

File: llama_stack/core/library_client.py

2. Server Client (HTTP)

from llama_stack_client import AsyncLlamaStackClient

client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)

Uses the separate llama-stack-client package.

6. Request Routing

Two Routing Strategies

1. Auto-Routed APIs (e.g., Inference, Safety, VectorIO)

Routing key = provider instance
Router automatically selects provider based on resource ID
Implementation: AutoRoutedProviderSpec → routers/ directory

# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
# Router selects provider based on which provider has that model

Routed APIs:

Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime

2. Routing Table APIs (e.g., Models, Shields, VectorStores)

Registry APIs that list/register resources
Implementation: RoutingTableProviderSpec → routing_tables/ directory

# models.list_models() → merged list from all providers
# models.register_model(...) → router selects provider

Registry APIs:

Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups

7. Distributions

What is a Distribution?

A Distribution is a pre-configured, verified bundle of providers for a specific deployment scenario.

File: llama_stack/distributions/template.py (base) → specific distros in subdirectories

Example: Starter Distribution

File: llama_stack/distributions/starter/starter.py

def get_distribution_template(name: str = "starter"):
    providers = {
        "inference": [
            remote::ollama,
            remote::vllm,
            remote::openai,
            # ... others ...
        ],
        "vector_io": [
            inline::faiss,
            inline::sqlite-vec,
            remote::qdrant,
            # ... others ...
        ],
        "safety": [
            inline::llama-guard,
            inline::code-scanner,
        ],
        # ... other APIs ...
    }
    return DistributionTemplate(
        name="starter",
        providers=providers,
        run_configs={
            "run.yaml": RunConfigSettings(...)
        }
    )

Built-in Distributions

starter: CPU-only, multi-provider (Ollama, OpenAI, etc.)
starter-gpu: GPU-optimized version
meta-reference-gpu: Full Meta reference implementation
postgres-demo: PostgreSQL-backed version
watsonx: IBM Watson X integration
nvidia: NVIDIA-specific optimizations
open-benchmark: For benchmarking

Distribution Lifecycle

llama stack run starter
  ↓
Resolve starter distribution template
  ↓
Merge with run.yaml config & environment variables
  ↓
Build/install dependencies (if needed)
  ↓
Start HTTP server (Uvicorn)
  ↓
Initialize all providers
  ↓
Register resources (models, shields, etc.)
  ↓
Ready for requests

8. CLI Architecture

File: llama_stack/cli/

Entry Point

$ llama [subcommand] [args]

Maps to pyproject.toml:

[project.scripts]
llama = "llama_stack.cli.llama:main"

Subcommands

llama stack [command]
  ├── run [distro|config] [--port PORT]     # Start a distribution
  ├── list-deps [distro]                    # Show dependencies to install
  ├── list-apis                             # Show all APIs
  ├── list-providers                        # Show all providers
  └── list [NAME]                           # Show distributions

Architecture:

llama.py - Main parser with subcommands
stack/stack.py - Stack subcommand router
stack/run.py - Implementation of llama stack run
stack/list_deps.py - Dependency resolution & display

9. Testing Architecture

Location: tests/ directory

Test Types

1. Unit Tests (`tests/unit/`)

Fast, isolated component testing
Mock external dependencies
Run with: uv run --group unit pytest tests/unit/
Examples:
- core/test_stack_validation.py - Config validation
- distribution/test_distribution.py - Distribution loading
- core/routers/test_vector_io.py - Routing logic

2. Integration Tests (`tests/integration/`)

End-to-end workflows
Record-Replay pattern: Record real API responses once, replay for fast/cheap testing
Run with: uv run --group test pytest tests/integration/ --stack-config=starter

Structure:

tests/integration/
├── agents/
│   ├── test_agents.py
│   ├── test_persistence.py
│   └── cassettes/  # Recorded API responses (YAML)
├── inference/
├── safety/
├── vector_io/
└── [more...]

Record-Replay System

File: llama_stack/testing/api_recorder.py

Benefits:

Cost Control: Record real API calls once, replay thousands of times
Speed: Cached responses = instant test execution
Reliability: Deterministic results (no API variability)
Provider Coverage: Same test works with OpenAI, Anthropic, Ollama, etc.

How it works:

First run (with LLAMA_STACK_TEST_INFERENCE_MODE=record): Real API calls saved to YAML
Subsequent runs: Load YAML and return matching responses
CI automatically re-records when needed

Test Organization

Common utilities: tests/common/
External provider tests: tests/external/ (test external APIs)
Container tests: tests/containers/ (test Docker integration)
Conftest: pytest fixtures in each directory

10. Key Design Patterns

Pattern 1: Protocol-Based Abstraction

# API definition (protocol)
class Inference(Protocol):
    async def post_chat_completion(...) -> AsyncIterator[...]: ...

# Provider implementation
class InferenceProvider:
    async def post_chat_completion(...): ...

Pattern 2: Dependency Injection

class AgentProvider:
    def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
        self.inference = inference
        self.safety = safety

Pattern 3: Configuration-Driven Instantiation

# run.yaml
agents:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      max_depth: 5

Pattern 4: Routing by Resource

# Request: inference.post_chat_completion(model="llama-2-7b")
# Router finds which provider has "llama-2-7b" and routes there

Pattern 5: Registry Pattern for Resources

# Register at startup
await models.register_model(Model(
    identifier="llama-2-7b",
    provider_id="inference::meta-reference",
    ...
))

# Later, query or filter
models_list = await models.list_models()

11. Configuration Management

Config Files

1. run.yaml - Runtime Configuration

Location: ~/.llama/distributions/{name}/run.yaml

version: 2
providers:
  inference:
    - provider_id: ollama
      provider_type: remote::ollama
      config:
        host: localhost
        port: 11434
  safety:
    - provider_id: llama-guard
      provider_type: inline::llama-guard
      config: {}
default_models:
  - identifier: llama-2-7b
    provider_id: ollama
vector_stores_config:
  default_provider_id: faiss

2. build.yaml - Build Configuration

Specifies which providers to install.

3. Environment Variables

Override config values at runtime:

INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter

Config Resolution

File: llama_stack/core/utils/config_resolution.py

Order of precedence:

Environment variables (highest)
Runtime config (run.yaml)
Distribution template defaults (lowest)

12. Extension Points for Developers

Adding a Custom Provider

Create provider module:

llama_stack/providers/remote/inference/my_provider/
├── __init__.py
├── config.py          # MyProviderConfig
└── my_provider.py     # MyProviderImpl(InferenceProvider)

Register in registry:

# llama_stack/providers/registry/inference.py
RemoteProviderSpec(
    api=Api.inference,
    adapter_type="my_provider",
    provider_type="remote::my_provider",
    config_class="...MyProviderConfig",
    module="llama_stack.providers.remote.inference.my_provider",
)

Use in distribution:

providers:
  inference:
    - provider_id: my_provider
      provider_type: remote::my_provider
      config: {...}

Adding a Custom API

Define protocol in llama_stack/apis/my_api/my_api.py
Implement providers
Register in resolver and distributions
Add CLI support if needed

13. Storage & Persistence

Storage Backends

File: llama_stack/core/storage/datatypes.py

KV Store (Key-Value)

Store metadata: models, shields, vector stores
Backends: SQLite (inline), Redis, Postgres

SQL Store

Store structured data: conversations, datasets
Backends: SQLite (inline), Postgres

Inference Store

Cache inference results for recording/replay
Used in testing

Storage Configuration

storage:
  type: sqlite
  config:
    dir: ~/.llama/distributions/starter

14. Telemetry & Tracing

Tracing System

File: llama_stack/providers/utils/telemetry/

Automatic request tracing with OpenTelemetry
Trace context propagation across async calls
Integration with OpenTelemetry collectors

Telemetry API

Providers can implement the Telemetry API to collect metrics:

Token usage
Latency
Error rates
Custom metrics

15. Model System

Model Registry

File: llama_stack/models/llama/sku_list.py

resolve_model("meta-llama/Llama-2-7b") 
  → Llama2Model(...)

Maps model IDs to their:

Architecture
Tokenizer
Quantization options
Required resources

Supported Models

Llama 3 - Full architecture support
Llama 3.1 - Extended context
Llama 3.2 - Multimodal support
Llama 4 - Latest generation
Custom models - Via provider registration

Model Quantization

int8, int4
GPTQ
Hadamard transform
Custom quantizers

16. Key Files to Understand

For Understanding Core Concepts

llama_stack/core/datatypes.py - Configuration data types
llama_stack/providers/datatypes.py - Provider specs
llama_stack/apis/inference/inference.py - Example API

For Understanding Runtime

llama_stack/core/stack.py - Main runtime class
llama_stack/core/resolver.py - Dependency resolution
llama_stack/core/library_client.py - In-process client

For Understanding Providers

llama_stack/providers/registry/inference.py - Inference provider registry
llama_stack/providers/inline/inference/meta_reference/inference.py - Example inline
llama_stack/providers/remote/inference/openai/openai.py - Example remote

For Understanding Distributions

llama_stack/distributions/template.py - Distribution template
llama_stack/distributions/starter/starter.py - Starter distro
llama_stack/cli/stack/run.py - Distribution startup

17. Development Workflow

Running Locally

# Install dependencies
uv sync --all-groups

# Run a distribution (auto-starts server)
llama stack run starter

# In another terminal, interact with it
curl http://localhost:8321/health

Testing

# Unit tests (fast, no external dependencies)
uv run --group unit pytest tests/unit/

# Integration tests (with record-replay)
uv run --group test pytest tests/integration/ --stack-config=starter

# Re-record integration tests (record real API calls)
LLAMA_STACK_TEST_INFERENCE_MODE=record \
  uv run --group test pytest tests/integration/ --stack-config=starter

Building Distributions

# Build Starter distribution
llama stack build starter --name my-starter

# Run it
llama stack run my-starter

18. Notable Implementation Details

Async-First Architecture

All I/O is async (using asyncio)
Streaming responses with AsyncIterator
FastAPI for HTTP server (built on Starlette)

Streaming Support

Inference responses stream tokens
Agents stream turn-by-turn updates
Proper async context preservation

Error Handling

Structured errors with detailed messages
Graceful degradation when dependencies unavailable
Provider health checks

Extensibility

External providers via module import
Custom APIs via ExternalApiSpec
Plugin discovery via provider registry

19. Typical Request Flow

User Request (e.g., chat completion)
  ↓
CLI or SDK Client
  ↓
HTTP Request → FastAPI Server (port 8321)
  ↓
Route Handler (e.g., /inference/chat-completion)
  ↓
Router (Auto-Routed API)
  → Determine which provider has the model
  ↓
Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
  ↓
External Service or Local Execution
  ↓
Response (streaming or complete)
  ↓
Send back to Client

20. Key Takeaways

Unified APIs: Single abstraction for 27+ AI capabilities
Pluggable Providers: 50+ implementations (inline & remote)
Configuration-Driven: Switch providers via YAML, not code
Distributions: Pre-verified bundles for common scenarios
Record-Replay Testing: Cost-effective integration tests
Two Client Modes: Library (in-process) or HTTP (distributed)
Smart Routing: Automatic request routing to appropriate providers
Async-First: Native streaming and concurrent request handling
Extensible: Custom APIs and providers easily added
Production-Ready: Health checks, telemetry, access control, storage

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      Client Applications                      │
│               (CLI, SDK, Web UI, Custom Apps)               │
└────────────────────┬────────────────────────────────────────┘
                     │
         ┌───────────┴────────────┐
         │                        │
    ┌────▼────────┐      ┌───────▼──────┐
    │   Library   │      │  HTTP Server │
    │   Client    │      │  (FastAPI)   │
    └────┬────────┘      └───────┬──────┘
         │                       │
         └───────────┬───────────┘
                     │
          ┌──────────▼──────────┐
          │   LlamaStack Class  │
          │  (implements all    │
          │   27 APIs)          │
          └──────────┬──────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      │    Router    │   Routing    │  Resource
      │  (Auto-      │   Tables     │  Registries
      │   routed     │  (Models,    │  (Models,
      │   APIs)      │   Shields)   │   Shields,
      │              │              │   etc.)
      └──────────────┼──────────────┘
                     │
        ┌────────────┴──────────────┐
        │                           │
   ┌────▼──────────┐    ┌──────────▼─────┐
   │ Inline        │    │ Remote          │
   │ Providers     │    │ Providers       │
   │               │    │                 │
   │ • Meta Ref    │    │ • OpenAI        │
   │ • FAISS       │    │ • Ollama        │
   │ • Llama Guard │    │ • Qdrant        │
   │ • etc.        │    │ • etc.          │
   │               │    │                 │
   └───────────────┘    └─────────────────┘
        │                       │
        │                       │
   Local Execution         External Services
   (GPUs/CPUs)            (APIs/Servers)

29 KiB Raw Blame History

Llama Stack Architecture - Comprehensive Overview

Executive Summary

1. Core Architecture Philosophy

Separation of Concerns

Key Design Patterns

2. Directory Structure (llama_stack/)

3. API Layer (27 APIs)

What is an API?

Key APIs

Core Inference API

Agents API

Safety API

Vector IO API

Tools / Tool Runtime API

Other Major APIs

4. Provider System

Provider Types

1. Inline Providers (InlineProviderSpec)

2. Remote Providers (RemoteProviderSpec)

Provider Registration

Provider Config

Provider Implementation

5. Core Runtime & Resolution

Stack Resolution Process

The LlamaStack Class

Two Client Modes

1. Library Client (In-Process)

2. Server Client (HTTP)

6. Request Routing

Two Routing Strategies

1. Auto-Routed APIs (e.g., Inference, Safety, VectorIO)

2. Routing Table APIs (e.g., Models, Shields, VectorStores)

7. Distributions

What is a Distribution?

Example: Starter Distribution

Built-in Distributions

Distribution Lifecycle

8. CLI Architecture

Entry Point

Subcommands

9. Testing Architecture

Test Types

1. Unit Tests (tests/unit/)

2. Integration Tests (tests/integration/)

Record-Replay System

Test Organization

10. Key Design Patterns

Pattern 1: Protocol-Based Abstraction

Pattern 2: Dependency Injection

Pattern 3: Configuration-Driven Instantiation

Pattern 4: Routing by Resource

Pattern 5: Registry Pattern for Resources

11. Configuration Management

Config Files

1. run.yaml - Runtime Configuration

2. build.yaml - Build Configuration

3. Environment Variables

Config Resolution

12. Extension Points for Developers

Adding a Custom Provider

Adding a Custom API

13. Storage & Persistence

Storage Backends

KV Store (Key-Value)

SQL Store

Inference Store

Storage Configuration

14. Telemetry & Tracing

Tracing System

Telemetry API

15. Model System

Model Registry

Supported Models

Model Quantization

16. Key Files to Understand

For Understanding Core Concepts

For Understanding Runtime

For Understanding Providers

For Understanding Distributions

29 KiB

Raw Blame History

2. Directory Structure (`llama_stack/`)

1. Inline Providers (`InlineProviderSpec`)

2. Remote Providers (`RemoteProviderSpec`)

1. Unit Tests (`tests/unit/`)

2. Integration Tests (`tests/integration/`)