llama-stack-mirror/ARCHITECTURE_SUMMARY.md

# Llama Stack Architecture - Comprehensive Overview

## Executive Summary

Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a **unified API layer** with a **plugin architecture for providers**, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).

---

## 1. Core Architecture Philosophy

### Separation of Concerns
- **APIs**: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
- **Providers**: Implement those interfaces (inline for local, remote for external services)
- **Distributions**: Pre-configure and bundle providers for specific deployment scenarios

### Key Design Patterns
- **Plugin Architecture**: Dynamically load providers based on configuration
- **Dependency Injection**: Providers declare dependencies on other APIs/providers
- **Routing**: Smart routing directs requests to appropriate provider implementations
- **Configuration-Driven**: YAML-based configuration enables flexibility without code changes

---

## 2. Directory Structure (`llama_stack/`)

```
llama_stack/
├── apis/                    # Abstract API definitions (27 APIs total)
│   ├── inference/          # LLM inference interface
│   ├── agents/             # Agent orchestration
│   ├── safety/             # Content filtering & safety
│   ├── vector_io/          # Vector database operations
│   ├── tools/              # Tool/function calling runtime
│   ├── scoring/            # Response scoring
│   ├── eval/               # Evaluation framework
│   ├── post_training/      # Fine-tuning & training
│   ├── datasetio/          # Dataset loading/management
│   ├── conversations/      # Conversation management
│   ├── common/             # Shared datatypes (SamplingParams, etc.)
│   └── [22 more...]        # Models, Shields, Benchmarks, etc.
│
├── providers/              # Provider implementations (inline & remote)
│   ├── inline/             # In-process implementations
│   │   ├── inference/      # Meta Reference, Sentence Transformers
│   │   ├── agents/         # Agent orchestration implementations
│   │   ├── safety/         # Llama Guard, Code Scanner
│   │   ├── vector_io/      # FAISS, SQLite-vec, Milvus
│   │   ├── post_training/  # TorchTune
│   │   ├── eval/           # Evaluation implementations
│   │   ├── tool_runtime/   # RAG runtime, MCP protocol
│   │   └── [more...]
│   │
│   ├── remote/             # External service adapters
│   │   ├── inference/      # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
│   │   ├── vector_io/      # ChromaDB, Qdrant, Weaviate, Postgres
│   │   ├── safety/         # Bedrock, SambaNova, Nvidia
│   │   ├── agents/         # Sample implementations
│   │   ├── tool_runtime/   # Brave Search, Tavily, Wolfram Alpha
│   │   └── [more...]
│   │
│   ├── registry/           # Provider discovery/registration (inference.py, agents.py, etc.)
│   │   └── [One file per API with all providers for that API]
│   │
│   ├── utils/              # Shared provider utilities
│   │   ├── inference/      # Embedding mixin, OpenAI compat
│   │   ├── kvstore/        # Key-value store abstractions
│   │   ├── sqlstore/       # SQL storage abstractions
│   │   ├── telemetry/      # Tracing, metrics
│   │   └── [more...]
│   │
│   └── datatypes.py        # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
│
├── core/                   # Core runtime & orchestration
│   ├── stack.py            # Main LlamaStack class (implements all APIs)
│   ├── datatypes.py        # Config models (StackRunConfig, Provider, etc.)
│   ├── resolver.py         # Provider resolution & dependency injection
│   ├── library_client.py   # In-process client for library usage
│   ├── build.py            # Distribution building
│   ├── configure.py        # Configuration handling
│   ├── distribution.py     # Distribution management
│   ├── routers/            # Auto-routed API implementations (infer route based on routing key)
│   ├── routing_tables/     # Manual routing tables (e.g., Models, Shields, VectorStores)
│   ├── server/             # FastAPI HTTP server setup
│   ├── storage/            # Backend storage abstractions (KVStore, SqlStore)
│   ├── utils/              # Config resolution, dynamic imports
│   └── conversations/      # Conversation service implementation
│
├── cli/                    # Command-line interface
│   ├── llama.py            # Main entry point
│   └── stack/              # Stack management commands
│       ├── run.py          # Start a distribution
│       ├── list_apis.py    # List available APIs
│       ├── list_providers.py # List providers
│       ├── list_deps.py    # List dependencies
│       └── [more...]
│
├── distributions/          # Pre-configured distribution templates
│   ├── starter/            # CPU-friendly multi-provider starter
│   ├── starter-gpu/        # GPU-optimized starter
│   ├── meta-reference-gpu/ # Full-featured Meta reference
│   ├── postgres-demo/      # PostgreSQL-based demo
│   ├── template.py         # Distribution template base class
│   └── [more...]
│
├── models/                 # Llama model implementations
│   └── llama/
│       ├── llama3/         # Llama 3 implementation
│       ├── llama4/         # Llama 4 implementation
│       ├── sku_list.py     # Model registry (maps model IDs to implementations)
│       ├── checkpoint.py   # Model checkpoint handling
│       ├── datatypes.py    # ToolDefinition, StopReason, etc.
│       └── [more...]
│
├── testing/                # Testing utilities
│   └── api_recorder.py     # Record/replay infrastructure for integration tests
│
└── ui/                     # Web UI (Streamlit-based)
    ├── app/
    ├── components/
    ├── pages/
    └── [React/TypeScript frontend]
```

---

## 3. API Layer (27 APIs)

### What is an API?
Each API is an abstract **protocol** (Python Protocol class) that defines an interface. APIs are located in `llama_stack/apis/` with a structure like:

```
apis/inference/
├── __init__.py          # Exports the Inference protocol
├── inference.py         # Full API definition (300+ lines)
└── event_logger.py      # Supporting types
```

### Key APIs

#### Core Inference API
- **Path**: `llama_stack/apis/inference/inference.py`
- **Methods**: `post_chat_completion()`, `post_completion()`, `post_embedding()`, `get_models()`
- **Types**: `SamplingParams`, `SamplingStrategy` (greedy/top-p/top-k), `OpenAIChatCompletion`
- **Providers**: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)

#### Agents API
- **Path**: `llama_stack/apis/agents/agents.py`
- **Methods**: `create_agent()`, `update_agent()`, `create_session()`, `agentic_loop_turn()`
- **Features**: Multi-turn conversations, tool calling, streaming
- **Providers**: Meta Reference (inline), Fireworks, Together

#### Safety API
- **Path**: `llama_stack/apis/safety/safety.py`
- **Methods**: `run_shields()` - filter content before/after inference
- **Providers**: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia

#### Vector IO API
- **Path**: `llama_stack/apis/vector_io/vector_io.py`
- **Methods**: `insert()`, `query()`, `delete()` - vector database operations
- **Providers**: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)

#### Tools / Tool Runtime API
- **Path**: `llama_stack/apis/tools/tool_runtime.py`
- **Methods**: `execute_tool()` - execute functions during agent loops
- **Providers**: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol

#### Other Major APIs
- **Post Training**: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
- **Eval**: Evaluation frameworks (Meta Reference with autoevals)
- **Scoring**: Response scoring (Basic, LLM-as-Judge, Braintrust)
- **Datasets**: Dataset management
- **DatasetIO**: Dataset loading from HuggingFace, Nvidia, local files
- **Conversations**: Multi-turn conversation state management
- **Vector Stores**: Vector store metadata & configuration
- **Shields**: Shield (safety filter) registry
- **Models**: Model registry management
- **Batches**: Batch processing
- **Prompts**: Prompt templates & management
- **Telemetry**: Tracing & metrics collection
- **Inspect**: Introspection & debugging

---

## 4. Provider System

### Provider Types

#### 1. **Inline Providers** (`InlineProviderSpec`)
- Run in-process (same Python process as server)
- High performance, low latency
- No network overhead
- Heavier resource requirements
- Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)

**Structure**:
```python
InlineProviderSpec(
    api=Api.inference,
    provider_type="inline::meta-reference",
    module="llama_stack.providers.inline.inference.meta_reference",
    config_class="...MetaReferenceInferenceConfig",
    pip_packages=[...],
    container_image="..."  # Optional for containerization
)
```

#### 2. **Remote Providers** (`RemoteProviderSpec`)
- Connect to external services via HTTP/API
- Lower resource requirements
- Network latency
- Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
- Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB

**Structure**:
```python
RemoteProviderSpec(
    api=Api.inference,
    adapter_type="openai",
    provider_type="remote::openai",
    module="llama_stack.providers.remote.inference.openai",
    config_class="...OpenAIInferenceConfig",
    pip_packages=[...]
)
```

### Provider Registration

Providers are registered in **registry files** (`llama_stack/providers/registry/`):
- `inference.py` - All inference providers (30+)
- `agents.py` - All agent providers
- `safety.py` - All safety providers
- `vector_io.py` - All vector IO providers
- `tool_runtime.py` - All tool runtime providers
- [etc.]

Each registry file has an `available_providers()` function returning a list of `ProviderSpec`.

### Provider Config

Each provider has a config class (e.g., `MetaReferenceInferenceConfig`):
```python
class MetaReferenceInferenceConfig(BaseModel):
    max_batch_size: int = 1
    enable_pydantic_sampling: bool = True
    # sample_run_config() - provides default values for testing
    # pip_packages() - lists dependencies
```

### Provider Implementation

Inline providers look like:
```python
class MetaReferenceInferenceImpl(InferenceProvider):
    async def post_chat_completion(
        self,
        model: str,
        request: OpenAIChatCompletionRequestWithExtraBody,
    ) -> AsyncIterator[OpenAIChatCompletionChunk]:
        # Load model, run inference, yield streaming results
        ...
```

Remote providers implement HTTP adapters:
```python
class OllamaInferenceImpl(InferenceProvider):
    async def post_chat_completion(...):
        # Make HTTP requests to Ollama server
        ...
```

---

## 5. Core Runtime & Resolution

### Stack Resolution Process

**File**: `llama_stack/core/resolver.py`

1. **Load Configuration** → Parse `run.yaml` with enabled providers
2. **Resolve Dependencies** → Build dependency graph (e.g., agents may depend on inference)
3. **Instantiate Providers** → Create provider instances with configs
4. **Create Router/Routed Impls** → Set up request routing
5. **Register Resources** → Register models, shields, datasets, etc.

### The LlamaStack Class

**File**: `llama_stack/core/stack.py`

```python
class LlamaStack(
    Providers,      # Meta API for provider management
    Inference,      # LLM inference
    Agents,         # Agent orchestration
    Safety,         # Content safety
    VectorIO,       # Vector operations
    Tools,          # Tool runtime
    Eval,           # Evaluation
    # ... 15 more APIs ...
):
    pass
```

This class **inherits from all APIs**, making a single `LlamaStack` instance support all functionality.

### Two Client Modes

#### 1. **Library Client** (In-Process)
```python
from llama_stack import AsyncLlamaStackAsLibraryClient

client = await AsyncLlamaStackAsLibraryClient.create(run_config)
response = await client.inference.post_chat_completion(...)
```
**File**: `llama_stack/core/library_client.py`

#### 2. **Server Client** (HTTP)
```python
from llama_stack_client import AsyncLlamaStackClient

client = AsyncLlamaStackClient(base_url="http://localhost:8321")
response = await client.inference.post_chat_completion(...)
```
Uses the separate `llama-stack-client` package.

---

## 6. Request Routing

### Two Routing Strategies

#### 1. **Auto-Routed APIs** (e.g., Inference, Safety, VectorIO)
- Routing key = provider instance
- Router automatically selects provider based on resource ID
- **Implementation**: `AutoRoutedProviderSpec` → `routers/` directory

```python
# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
# Router selects provider based on which provider has that model
```

**Routed APIs**:
- Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime

#### 2. **Routing Table APIs** (e.g., Models, Shields, VectorStores)
- Registry APIs that list/register resources
- **Implementation**: `RoutingTableProviderSpec` → `routing_tables/` directory

```python
# models.list_models() → merged list from all providers
# models.register_model(...) → router selects provider
```

**Registry APIs**:
- Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups

---

## 7. Distributions

### What is a Distribution?

A **Distribution** is a pre-configured, verified bundle of providers for a specific deployment scenario.

**File**: `llama_stack/distributions/template.py` (base) → specific distros in subdirectories

### Example: Starter Distribution

**File**: `llama_stack/distributions/starter/starter.py`

```python
def get_distribution_template(name: str = "starter"):
    providers = {
        "inference": [
            remote::ollama,
            remote::vllm,
            remote::openai,
            # ... others ...
        ],
        "vector_io": [
            inline::faiss,
            inline::sqlite-vec,
            remote::qdrant,
            # ... others ...
        ],
        "safety": [
            inline::llama-guard,
            inline::code-scanner,
        ],
        # ... other APIs ...
    }
    return DistributionTemplate(
        name="starter",
        providers=providers,
        run_configs={
            "run.yaml": RunConfigSettings(...)
        }
    )
```

### Built-in Distributions

1. **starter**: CPU-only, multi-provider (Ollama, OpenAI, etc.)
2. **starter-gpu**: GPU-optimized version
3. **meta-reference-gpu**: Full Meta reference implementation
4. **postgres-demo**: PostgreSQL-backed version
5. **watsonx**: IBM Watson X integration
6. **nvidia**: NVIDIA-specific optimizations
7. **open-benchmark**: For benchmarking

### Distribution Lifecycle

```
llama stack run starter
  ↓
Resolve starter distribution template
  ↓
Merge with run.yaml config & environment variables
  ↓
Build/install dependencies (if needed)
  ↓
Start HTTP server (Uvicorn)
  ↓
Initialize all providers
  ↓
Register resources (models, shields, etc.)
  ↓
Ready for requests
```

---

## 8. CLI Architecture

**File**: `llama_stack/cli/`

### Entry Point

```bash
$ llama [subcommand] [args]
```

Maps to **pyproject.toml**:
```toml
[project.scripts]
llama = "llama_stack.cli.llama:main"
```

### Subcommands

```
llama stack [command]
  ├── run [distro|config] [--port PORT]     # Start a distribution
  ├── list-deps [distro]                    # Show dependencies to install
  ├── list-apis                             # Show all APIs
  ├── list-providers                        # Show all providers
  └── list [NAME]                           # Show distributions
```

**Architecture**:
- `llama.py` - Main parser with subcommands
- `stack/stack.py` - Stack subcommand router
- `stack/run.py` - Implementation of `llama stack run`
- `stack/list_deps.py` - Dependency resolution & display

---

## 9. Testing Architecture

**Location**: `tests/` directory

### Test Types

#### 1. **Unit Tests** (`tests/unit/`)
- Fast, isolated component testing
- Mock external dependencies
- **Run with**: `uv run --group unit pytest tests/unit/`
- **Examples**:
  - `core/test_stack_validation.py` - Config validation
  - `distribution/test_distribution.py` - Distribution loading
  - `core/routers/test_vector_io.py` - Routing logic

#### 2. **Integration Tests** (`tests/integration/`)
- End-to-end workflows
- **Record-Replay pattern**: Record real API responses once, replay for fast/cheap testing
- **Run with**: `uv run --group test pytest tests/integration/ --stack-config=starter`
- **Structure**:
  ```
  tests/integration/
  ├── agents/
  │   ├── test_agents.py
  │   ├── test_persistence.py
  │   └── cassettes/  # Recorded API responses (YAML)
  ├── inference/
  ├── safety/
  ├── vector_io/
  └── [more...]
  ```

### Record-Replay System

**File**: `llama_stack/testing/api_recorder.py`

**Benefits**:
- **Cost Control**: Record real API calls once, replay thousands of times
- **Speed**: Cached responses = instant test execution
- **Reliability**: Deterministic results (no API variability)
- **Provider Coverage**: Same test works with OpenAI, Anthropic, Ollama, etc.

**How it works**:
1. First run (with `LLAMA_STACK_TEST_INFERENCE_MODE=record`): Real API calls saved to YAML
2. Subsequent runs: Load YAML and return matching responses
3. CI automatically re-records when needed

### Test Organization

- **Common utilities**: `tests/common/`
- **External provider tests**: `tests/external/` (test external APIs)
- **Container tests**: `tests/containers/` (test Docker integration)
- **Conftest**: pytest fixtures in each directory

---

## 10. Key Design Patterns

### Pattern 1: Protocol-Based Abstraction
```python
# API definition (protocol)
class Inference(Protocol):
    async def post_chat_completion(...) -> AsyncIterator[...]: ...

# Provider implementation
class InferenceProvider:
    async def post_chat_completion(...): ...
```

### Pattern 2: Dependency Injection
```python
class AgentProvider:
    def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
        self.inference = inference
        self.safety = safety
```

### Pattern 3: Configuration-Driven Instantiation
```yaml
# run.yaml
agents:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      max_depth: 5
```

### Pattern 4: Routing by Resource
```python
# Request: inference.post_chat_completion(model="llama-2-7b")
# Router finds which provider has "llama-2-7b" and routes there
```

### Pattern 5: Registry Pattern for Resources
```python
# Register at startup
await models.register_model(Model(
    identifier="llama-2-7b",
    provider_id="inference::meta-reference",
    ...
))

# Later, query or filter
models_list = await models.list_models()
```

---

## 11. Configuration Management

### Config Files

#### 1. **run.yaml** - Runtime Configuration
Location: `~/.llama/distributions/{name}/run.yaml`

```yaml
version: 2
providers:
  inference:
    - provider_id: ollama
      provider_type: remote::ollama
      config:
        host: localhost
        port: 11434
  safety:
    - provider_id: llama-guard
      provider_type: inline::llama-guard
      config: {}
default_models:
  - identifier: llama-2-7b
    provider_id: ollama
vector_stores_config:
  default_provider_id: faiss
```

#### 2. **build.yaml** - Build Configuration
Specifies which providers to install.

#### 3. Environment Variables
Override config values at runtime:
```bash
INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter
```

### Config Resolution

**File**: `llama_stack/core/utils/config_resolution.py`

Order of precedence:
1. Environment variables (highest)
2. Runtime config (run.yaml)
3. Distribution template defaults (lowest)

---

## 12. Extension Points for Developers

### Adding a Custom Provider

1. **Create provider module**:
   ```python
   llama_stack/providers/remote/inference/my_provider/
   ├── __init__.py
   ├── config.py          # MyProviderConfig
   └── my_provider.py     # MyProviderImpl(InferenceProvider)
   ```

2. **Register in registry**:
   ```python
   # llama_stack/providers/registry/inference.py
   RemoteProviderSpec(
       api=Api.inference,
       adapter_type="my_provider",
       provider_type="remote::my_provider",
       config_class="...MyProviderConfig",
       module="llama_stack.providers.remote.inference.my_provider",
   )
   ```

3. **Use in distribution**:
   ```yaml
   providers:
     inference:
       - provider_id: my_provider
         provider_type: remote::my_provider
         config: {...}
   ```

### Adding a Custom API

1. Define protocol in `llama_stack/apis/my_api/my_api.py`
2. Implement providers
3. Register in resolver and distributions
4. Add CLI support if needed

---

## 13. Storage & Persistence

### Storage Backends

**File**: `llama_stack/core/storage/datatypes.py`

#### KV Store (Key-Value)
- Store metadata: models, shields, vector stores
- Backends: SQLite (inline), Redis, Postgres

#### SQL Store
- Store structured data: conversations, datasets
- Backends: SQLite (inline), Postgres

#### Inference Store
- Cache inference results for recording/replay
- Used in testing

### Storage Configuration

```yaml
storage:
  type: sqlite
  config:
    dir: ~/.llama/distributions/starter
```

---

## 14. Telemetry & Tracing

### Tracing System

**File**: `llama_stack/providers/utils/telemetry/`

- Automatic request tracing with OpenTelemetry
- Trace context propagation across async calls
- Integration with OpenTelemetry collectors

### Telemetry API

Providers can implement the Telemetry API to collect metrics:
- Token usage
- Latency
- Error rates
- Custom metrics

---

## 15. Model System

### Model Registry

**File**: `llama_stack/models/llama/sku_list.py`

```python
resolve_model("meta-llama/Llama-2-7b")
  → Llama2Model(...)
```

Maps model IDs to their:
- Architecture
- Tokenizer
- Quantization options
- Required resources

### Supported Models

- **Llama 3** - Full architecture support
- **Llama 3.1** - Extended context
- **Llama 3.2** - Multimodal support
- **Llama 4** - Latest generation
- **Custom models** - Via provider registration

### Model Quantization

- int8, int4
- GPTQ
- Hadamard transform
- Custom quantizers

---

## 16. Key Files to Understand

### For Understanding Core Concepts
1. `llama_stack/core/datatypes.py` - Configuration data types
2. `llama_stack/providers/datatypes.py` - Provider specs
3. `llama_stack/apis/inference/inference.py` - Example API

### For Understanding Runtime
1. `llama_stack/core/stack.py` - Main runtime class
2. `llama_stack/core/resolver.py` - Dependency resolution
3. `llama_stack/core/library_client.py` - In-process client

### For Understanding Providers
1. `llama_stack/providers/registry/inference.py` - Inference provider registry
2. `llama_stack/providers/inline/inference/meta_reference/inference.py` - Example inline
3. `llama_stack/providers/remote/inference/openai/openai.py` - Example remote

### For Understanding Distributions
1. `llama_stack/distributions/template.py` - Distribution template
2. `llama_stack/distributions/starter/starter.py` - Starter distro
3. `llama_stack/cli/stack/run.py` - Distribution startup

---

## 17. Development Workflow

### Running Locally

```bash
# Install dependencies
uv sync --all-groups

# Run a distribution (auto-starts server)
llama stack run starter

# In another terminal, interact with it
curl http://localhost:8321/health
```

### Testing

```bash
# Unit tests (fast, no external dependencies)
uv run --group unit pytest tests/unit/

# Integration tests (with record-replay)
uv run --group test pytest tests/integration/ --stack-config=starter

# Re-record integration tests (record real API calls)
LLAMA_STACK_TEST_INFERENCE_MODE=record \
  uv run --group test pytest tests/integration/ --stack-config=starter
```

### Building Distributions

```bash
# Build Starter distribution
llama stack build starter --name my-starter

# Run it
llama stack run my-starter
```

---

## 18. Notable Implementation Details

### Async-First Architecture
- All I/O is async (using `asyncio`)
- Streaming responses with `AsyncIterator`
- FastAPI for HTTP server (built on Starlette)

### Streaming Support
- Inference responses stream tokens
- Agents stream turn-by-turn updates
- Proper async context preservation

### Error Handling
- Structured errors with detailed messages
- Graceful degradation when dependencies unavailable
- Provider health checks

### Extensibility
- External providers via module import
- Custom APIs via ExternalApiSpec
- Plugin discovery via provider registry

---

## 19. Typical Request Flow

```
User Request (e.g., chat completion)
  ↓
CLI or SDK Client
  ↓
HTTP Request → FastAPI Server (port 8321)
  ↓
Route Handler (e.g., /inference/chat-completion)
  ↓
Router (Auto-Routed API)
  → Determine which provider has the model
  ↓
Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
  ↓
External Service or Local Execution
  ↓
Response (streaming or complete)
  ↓
Send back to Client
```

---

## 20. Key Takeaways

1. **Unified APIs**: Single abstraction for 27+ AI capabilities
2. **Pluggable Providers**: 50+ implementations (inline & remote)
3. **Configuration-Driven**: Switch providers via YAML, not code
4. **Distributions**: Pre-verified bundles for common scenarios
5. **Record-Replay Testing**: Cost-effective integration tests
6. **Two Client Modes**: Library (in-process) or HTTP (distributed)
7. **Smart Routing**: Automatic request routing to appropriate providers
8. **Async-First**: Native streaming and concurrent request handling
9. **Extensible**: Custom APIs and providers easily added
10. **Production-Ready**: Health checks, telemetry, access control, storage

---

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                      Client Applications                      │
│               (CLI, SDK, Web UI, Custom Apps)               │
└────────────────────┬────────────────────────────────────────┘
                     │
         ┌───────────┴────────────┐
         │                        │
    ┌────▼────────┐      ┌───────▼──────┐
    │   Library   │      │  HTTP Server │
    │   Client    │      │  (FastAPI)   │
    └────┬────────┘      └───────┬──────┘
         │                       │
         └───────────┬───────────┘
                     │
          ┌──────────▼──────────┐
          │   LlamaStack Class  │
          │  (implements all    │
          │   27 APIs)          │
          └──────────┬──────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      │    Router    │   Routing    │  Resource
      │  (Auto-      │   Tables     │  Registries
      │   routed     │  (Models,    │  (Models,
      │   APIs)      │   Shields)   │   Shields,
      │              │              │   etc.)
      └──────────────┼──────────────┘
                     │
        ┌────────────┴──────────────┐
        │                           │
   ┌────▼──────────┐    ┌──────────▼─────┐
   │ Inline        │    │ Remote          │
   │ Providers     │    │ Providers       │
   │               │    │                 │
   │ • Meta Ref    │    │ • OpenAI        │
   │ • FAISS       │    │ • Ollama        │
   │ • Llama Guard │    │ • Qdrant        │
   │ • etc.        │    │ • etc.          │
   │               │    │                 │
   └───────────────┘    └─────────────────┘
        │                       │
        │                       │
   Local Execution         External Services
   (GPUs/CPUs)            (APIs/Servers)
```