fix: UI bug fixes and comprehensive architecture documentation

- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
2025-12-11 19:56:03 +00:00 · 2025-10-27 13:34:02 +08:00 · 2025-10-27 13:34:02 +08:00 · 5ef6ccf90e
commit 5ef6ccf90e
parent 3059423cd7
8 changed files with 2603 additions and 1 deletions
--- a/ARCHITECTURE_SUMMARY.md
+++ b/ARCHITECTURE_SUMMARY.md
@ -0,0 +1,925 @@
+# Llama Stack Architecture - Comprehensive Overview
+
+## Executive Summary
+
+Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a **unified API layer** with a **plugin architecture for providers**, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).
+
+---
+
+## 1. Core Architecture Philosophy
+
+### Separation of Concerns
+- **APIs**: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
+- **Providers**: Implement those interfaces (inline for local, remote for external services)
+- **Distributions**: Pre-configure and bundle providers for specific deployment scenarios
+
+### Key Design Patterns
+- **Plugin Architecture**: Dynamically load providers based on configuration
+- **Dependency Injection**: Providers declare dependencies on other APIs/providers
+- **Routing**: Smart routing directs requests to appropriate provider implementations
+- **Configuration-Driven**: YAML-based configuration enables flexibility without code changes
+
+---
+
+## 2. Directory Structure (`llama_stack/`)
+
+```
+llama_stack/
+├── apis/                    # Abstract API definitions (27 APIs total)
+│   ├── inference/          # LLM inference interface
+│   ├── agents/             # Agent orchestration
+│   ├── safety/             # Content filtering & safety
+│   ├── vector_io/          # Vector database operations
+│   ├── tools/              # Tool/function calling runtime
+│   ├── scoring/            # Response scoring
+│   ├── eval/               # Evaluation framework
+│   ├── post_training/      # Fine-tuning & training
+│   ├── datasetio/          # Dataset loading/management
+│   ├── conversations/      # Conversation management
+│   ├── common/             # Shared datatypes (SamplingParams, etc.)
+│   └── [22 more...]        # Models, Shields, Benchmarks, etc.
+│
+├── providers/              # Provider implementations (inline & remote)
+│   ├── inline/             # In-process implementations
+│   │   ├── inference/      # Meta Reference, Sentence Transformers
+│   │   ├── agents/         # Agent orchestration implementations
+│   │   ├── safety/         # Llama Guard, Code Scanner
+│   │   ├── vector_io/      # FAISS, SQLite-vec, Milvus
+│   │   ├── post_training/  # TorchTune
+│   │   ├── eval/           # Evaluation implementations
+│   │   ├── tool_runtime/   # RAG runtime, MCP protocol
+│   │   └── [more...]
+│   │
+│   ├── remote/             # External service adapters
+│   │   ├── inference/      # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
+│   │   ├── vector_io/      # ChromaDB, Qdrant, Weaviate, Postgres
+│   │   ├── safety/         # Bedrock, SambaNova, Nvidia
+│   │   ├── agents/         # Sample implementations
+│   │   ├── tool_runtime/   # Brave Search, Tavily, Wolfram Alpha
+│   │   └── [more...]
+│   │
+│   ├── registry/           # Provider discovery/registration (inference.py, agents.py, etc.)
+│   │   └── [One file per API with all providers for that API]
+│   │
+│   ├── utils/              # Shared provider utilities
+│   │   ├── inference/      # Embedding mixin, OpenAI compat
+│   │   ├── kvstore/        # Key-value store abstractions
+│   │   ├── sqlstore/       # SQL storage abstractions
+│   │   ├── telemetry/      # Tracing, metrics
+│   │   └── [more...]
+│   │
+│   └── datatypes.py        # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
+│
+├── core/                   # Core runtime & orchestration
+│   ├── stack.py            # Main LlamaStack class (implements all APIs)
+│   ├── datatypes.py        # Config models (StackRunConfig, Provider, etc.)
+│   ├── resolver.py         # Provider resolution & dependency injection
+│   ├── library_client.py   # In-process client for library usage
+│   ├── build.py            # Distribution building
+│   ├── configure.py        # Configuration handling
+│   ├── distribution.py     # Distribution management
+│   ├── routers/            # Auto-routed API implementations (infer route based on routing key)
+│   ├── routing_tables/     # Manual routing tables (e.g., Models, Shields, VectorStores)
+│   ├── server/             # FastAPI HTTP server setup
+│   ├── storage/            # Backend storage abstractions (KVStore, SqlStore)
+│   ├── utils/              # Config resolution, dynamic imports
+│   └── conversations/      # Conversation service implementation
+│
+├── cli/                    # Command-line interface
+│   ├── llama.py            # Main entry point
+│   └── stack/              # Stack management commands
+│       ├── run.py          # Start a distribution
+│       ├── list_apis.py    # List available APIs
+│       ├── list_providers.py # List providers
+│       ├── list_deps.py    # List dependencies
+│       └── [more...]
+│
+├── distributions/          # Pre-configured distribution templates
+│   ├── starter/            # CPU-friendly multi-provider starter
+│   ├── starter-gpu/        # GPU-optimized starter
+│   ├── meta-reference-gpu/ # Full-featured Meta reference
+│   ├── postgres-demo/      # PostgreSQL-based demo
+│   ├── template.py         # Distribution template base class
+│   └── [more...]
+│
+├── models/                 # Llama model implementations
+│   └── llama/
+│       ├── llama3/         # Llama 3 implementation
+│       ├── llama4/         # Llama 4 implementation
+│       ├── sku_list.py     # Model registry (maps model IDs to implementations)
+│       ├── checkpoint.py   # Model checkpoint handling
+│       ├── datatypes.py    # ToolDefinition, StopReason, etc.
+│       └── [more...]
+│
+├── testing/                # Testing utilities
+│   └── api_recorder.py     # Record/replay infrastructure for integration tests
+│
+└── ui/                     # Web UI (Streamlit-based)
+    ├── app/
+    ├── components/
+    ├── pages/
+    └── [React/TypeScript frontend]
+```
+
+---
+
+## 3. API Layer (27 APIs)
+
+### What is an API?
+Each API is an abstract **protocol** (Python Protocol class) that defines an interface. APIs are located in `llama_stack/apis/` with a structure like:
+
+```
+apis/inference/
+├── __init__.py          # Exports the Inference protocol
+├── inference.py         # Full API definition (300+ lines)
+└── event_logger.py      # Supporting types
+```
+
+### Key APIs
+
+#### Core Inference API
+- **Path**: `llama_stack/apis/inference/inference.py`
+- **Methods**: `post_chat_completion()`, `post_completion()`, `post_embedding()`, `get_models()`
+- **Types**: `SamplingParams`, `SamplingStrategy` (greedy/top-p/top-k), `OpenAIChatCompletion`
+- **Providers**: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)
+
+#### Agents API
+- **Path**: `llama_stack/apis/agents/agents.py`
+- **Methods**: `create_agent()`, `update_agent()`, `create_session()`, `agentic_loop_turn()`
+- **Features**: Multi-turn conversations, tool calling, streaming
+- **Providers**: Meta Reference (inline), Fireworks, Together
+
+#### Safety API
+- **Path**: `llama_stack/apis/safety/safety.py`
+- **Methods**: `run_shields()` - filter content before/after inference
+- **Providers**: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia
+
+#### Vector IO API
+- **Path**: `llama_stack/apis/vector_io/vector_io.py`
+- **Methods**: `insert()`, `query()`, `delete()` - vector database operations
+- **Providers**: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)
+
+#### Tools / Tool Runtime API
+- **Path**: `llama_stack/apis/tools/tool_runtime.py`
+- **Methods**: `execute_tool()` - execute functions during agent loops
+- **Providers**: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol
+
+#### Other Major APIs
+- **Post Training**: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
+- **Eval**: Evaluation frameworks (Meta Reference with autoevals)
+- **Scoring**: Response scoring (Basic, LLM-as-Judge, Braintrust)
+- **Datasets**: Dataset management
+- **DatasetIO**: Dataset loading from HuggingFace, Nvidia, local files
+- **Conversations**: Multi-turn conversation state management
+- **Vector Stores**: Vector store metadata & configuration
+- **Shields**: Shield (safety filter) registry
+- **Models**: Model registry management
+- **Batches**: Batch processing
+- **Prompts**: Prompt templates & management
+- **Telemetry**: Tracing & metrics collection
+- **Inspect**: Introspection & debugging
+
+---
+
+## 4. Provider System
+
+### Provider Types
+
+#### 1. **Inline Providers** (`InlineProviderSpec`)
+- Run in-process (same Python process as server)
+- High performance, low latency
+- No network overhead
+- Heavier resource requirements
+- Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)
+
+**Structure**:
+```python
+InlineProviderSpec(
+    api=Api.inference,
+    provider_type="inline::meta-reference",
+    module="llama_stack.providers.inline.inference.meta_reference",
+    config_class="...MetaReferenceInferenceConfig",
+    pip_packages=[...],
+    container_image="..."  # Optional for containerization
+)
+```
+
+#### 2. **Remote Providers** (`RemoteProviderSpec`)
+- Connect to external services via HTTP/API
+- Lower resource requirements
+- Network latency
+- Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
+- Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB
+
+**Structure**:
+```python
+RemoteProviderSpec(
+    api=Api.inference,
+    adapter_type="openai",
+    provider_type="remote::openai",
+    module="llama_stack.providers.remote.inference.openai",
+    config_class="...OpenAIInferenceConfig",
+    pip_packages=[...]
+)
+```
+
+### Provider Registration
+
+Providers are registered in **registry files** (`llama_stack/providers/registry/`):
+- `inference.py` - All inference providers (30+)
+- `agents.py` - All agent providers
+- `safety.py` - All safety providers
+- `vector_io.py` - All vector IO providers
+- `tool_runtime.py` - All tool runtime providers
+- [etc.]
+
+Each registry file has an `available_providers()` function returning a list of `ProviderSpec`.
+
+### Provider Config
+
+Each provider has a config class (e.g., `MetaReferenceInferenceConfig`):
+```python
+class MetaReferenceInferenceConfig(BaseModel):
+    max_batch_size: int = 1
+    enable_pydantic_sampling: bool = True
+    # sample_run_config() - provides default values for testing
+    # pip_packages() - lists dependencies
+```
+
+### Provider Implementation
+
+Inline providers look like:
+```python
+class MetaReferenceInferenceImpl(InferenceProvider):
+    async def post_chat_completion(
+        self,
+        model: str,
+        request: OpenAIChatCompletionRequestWithExtraBody,
+    ) -> AsyncIterator[OpenAIChatCompletionChunk]:
+        # Load model, run inference, yield streaming results
+        ...
+```
+
+Remote providers implement HTTP adapters:
+```python
+class OllamaInferenceImpl(InferenceProvider):
+    async def post_chat_completion(...):
+        # Make HTTP requests to Ollama server
+        ...
+```
+
+---
+
+## 5. Core Runtime & Resolution
+
+### Stack Resolution Process
+
+**File**: `llama_stack/core/resolver.py`
+
+1. **Load Configuration** → Parse `run.yaml` with enabled providers
+2. **Resolve Dependencies** → Build dependency graph (e.g., agents may depend on inference)
+3. **Instantiate Providers** → Create provider instances with configs
+4. **Create Router/Routed Impls** → Set up request routing
+5. **Register Resources** → Register models, shields, datasets, etc.
+
+### The LlamaStack Class
+
+**File**: `llama_stack/core/stack.py`
+
+```python
+class LlamaStack(
+    Providers,      # Meta API for provider management
+    Inference,      # LLM inference
+    Agents,         # Agent orchestration
+    Safety,         # Content safety
+    VectorIO,       # Vector operations
+    Tools,          # Tool runtime
+    Eval,           # Evaluation
+    # ... 15 more APIs ...
+):
+    pass
+```
+
+This class **inherits from all APIs**, making a single `LlamaStack` instance support all functionality.
+
+### Two Client Modes
+
+#### 1. **Library Client** (In-Process)
+```python
+from llama_stack import AsyncLlamaStackAsLibraryClient
+
+client = await AsyncLlamaStackAsLibraryClient.create(run_config)
+response = await client.inference.post_chat_completion(...)
+```
+**File**: `llama_stack/core/library_client.py`
+
+#### 2. **Server Client** (HTTP)
+```python
+from llama_stack_client import AsyncLlamaStackClient
+
+client = AsyncLlamaStackClient(base_url="http://localhost:8321")
+response = await client.inference.post_chat_completion(...)
+```
+Uses the separate `llama-stack-client` package.
+
+---
+
+## 6. Request Routing
+
+### Two Routing Strategies
+
+#### 1. **Auto-Routed APIs** (e.g., Inference, Safety, VectorIO)
+- Routing key = provider instance
+- Router automatically selects provider based on resource ID
+- **Implementation**: `AutoRoutedProviderSpec` → `routers/` directory
+
+```python
+# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
+# Router selects provider based on which provider has that model
+```
+
+**Routed APIs**:
+- Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime
+
+#### 2. **Routing Table APIs** (e.g., Models, Shields, VectorStores)
+- Registry APIs that list/register resources
+- **Implementation**: `RoutingTableProviderSpec` → `routing_tables/` directory
+
+```python
+# models.list_models() → merged list from all providers
+# models.register_model(...) → router selects provider
+```
+
+**Registry APIs**:
+- Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups
+
+---
+
+## 7. Distributions
+
+### What is a Distribution?
+
+A **Distribution** is a pre-configured, verified bundle of providers for a specific deployment scenario.
+
+**File**: `llama_stack/distributions/template.py` (base) → specific distros in subdirectories
+
+### Example: Starter Distribution
+
+**File**: `llama_stack/distributions/starter/starter.py`
+
+```python
+def get_distribution_template(name: str = "starter"):
+    providers = {
+        "inference": [
+            remote::ollama,
+            remote::vllm,
+            remote::openai,
+            # ... others ...
+        ],
+        "vector_io": [
+            inline::faiss,
+            inline::sqlite-vec,
+            remote::qdrant,
+            # ... others ...
+        ],
+        "safety": [
+            inline::llama-guard,
+            inline::code-scanner,
+        ],
+        # ... other APIs ...
+    }
+    return DistributionTemplate(
+        name="starter",
+        providers=providers,
+        run_configs={
+            "run.yaml": RunConfigSettings(...)
+        }
+    )
+```
+
+### Built-in Distributions
+
+1. **starter**: CPU-only, multi-provider (Ollama, OpenAI, etc.)
+2. **starter-gpu**: GPU-optimized version
+3. **meta-reference-gpu**: Full Meta reference implementation
+4. **postgres-demo**: PostgreSQL-backed version
+5. **watsonx**: IBM Watson X integration
+6. **nvidia**: NVIDIA-specific optimizations
+7. **open-benchmark**: For benchmarking
+
+### Distribution Lifecycle
+
+```
+llama stack run starter
+  ↓
+Resolve starter distribution template
+  ↓
+Merge with run.yaml config & environment variables
+  ↓
+Build/install dependencies (if needed)
+  ↓
+Start HTTP server (Uvicorn)
+  ↓
+Initialize all providers
+  ↓
+Register resources (models, shields, etc.)
+  ↓
+Ready for requests
+```
+
+---
+
+## 8. CLI Architecture
+
+**File**: `llama_stack/cli/`
+
+### Entry Point
+
+```bash
+$ llama [subcommand] [args]
+```
+
+Maps to **pyproject.toml**:
+```toml
+[project.scripts]
+llama = "llama_stack.cli.llama:main"
+```
+
+### Subcommands
+
+```
+llama stack [command]
+  ├── run [distro|config] [--port PORT]     # Start a distribution
+  ├── list-deps [distro]                    # Show dependencies to install
+  ├── list-apis                             # Show all APIs
+  ├── list-providers                        # Show all providers
+  └── list [NAME]                           # Show distributions
+```
+
+**Architecture**:
+- `llama.py` - Main parser with subcommands
+- `stack/stack.py` - Stack subcommand router
+- `stack/run.py` - Implementation of `llama stack run`
+- `stack/list_deps.py` - Dependency resolution & display
+
+---
+
+## 9. Testing Architecture
+
+**Location**: `tests/` directory
+
+### Test Types
+
+#### 1. **Unit Tests** (`tests/unit/`)
+- Fast, isolated component testing
+- Mock external dependencies
+- **Run with**: `uv run --group unit pytest tests/unit/`
+- **Examples**:
+  - `core/test_stack_validation.py` - Config validation
+  - `distribution/test_distribution.py` - Distribution loading
+  - `core/routers/test_vector_io.py` - Routing logic
+
+#### 2. **Integration Tests** (`tests/integration/`)
+- End-to-end workflows
+- **Record-Replay pattern**: Record real API responses once, replay for fast/cheap testing
+- **Run with**: `uv run --group test pytest tests/integration/ --stack-config=starter`
+- **Structure**:
+  ```
+  tests/integration/
+  ├── agents/
+  │   ├── test_agents.py
+  │   ├── test_persistence.py
+  │   └── cassettes/  # Recorded API responses (YAML)
+  ├── inference/
+  ├── safety/
+  ├── vector_io/
+  └── [more...]
+  ```
+
+### Record-Replay System
+
+**File**: `llama_stack/testing/api_recorder.py`
+
+**Benefits**:
+- **Cost Control**: Record real API calls once, replay thousands of times
+- **Speed**: Cached responses = instant test execution
+- **Reliability**: Deterministic results (no API variability)
+- **Provider Coverage**: Same test works with OpenAI, Anthropic, Ollama, etc.
+
+**How it works**:
+1. First run (with `LLAMA_STACK_TEST_INFERENCE_MODE=record`): Real API calls saved to YAML
+2. Subsequent runs: Load YAML and return matching responses
+3. CI automatically re-records when needed
+
+### Test Organization
+
+- **Common utilities**: `tests/common/`
+- **External provider tests**: `tests/external/` (test external APIs)
+- **Container tests**: `tests/containers/` (test Docker integration)
+- **Conftest**: pytest fixtures in each directory
+
+---
+
+## 10. Key Design Patterns
+
+### Pattern 1: Protocol-Based Abstraction
+```python
+# API definition (protocol)
+class Inference(Protocol):
+    async def post_chat_completion(...) -> AsyncIterator[...]: ...
+
+# Provider implementation
+class InferenceProvider:
+    async def post_chat_completion(...): ...
+```
+
+### Pattern 2: Dependency Injection
+```python
+class AgentProvider:
+    def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
+        self.inference = inference
+        self.safety = safety
+```
+
+### Pattern 3: Configuration-Driven Instantiation
+```yaml
+# run.yaml
+agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      max_depth: 5
+```
+
+### Pattern 4: Routing by Resource
+```python
+# Request: inference.post_chat_completion(model="llama-2-7b")
+# Router finds which provider has "llama-2-7b" and routes there
+```
+
+### Pattern 5: Registry Pattern for Resources
+```python
+# Register at startup
+await models.register_model(Model(
+    identifier="llama-2-7b",
+    provider_id="inference::meta-reference",
+    ...
+))
+
+# Later, query or filter
+models_list = await models.list_models()
+```
+
+---
+
+## 11. Configuration Management
+
+### Config Files
+
+#### 1. **run.yaml** - Runtime Configuration
+Location: `~/.llama/distributions/{name}/run.yaml`
+
+```yaml
+version: 2
+providers:
+  inference:
+    - provider_id: ollama
+      provider_type: remote::ollama
+      config:
+        host: localhost
+        port: 11434
+  safety:
+    - provider_id: llama-guard
+      provider_type: inline::llama-guard
+      config: {}
+default_models:
+  - identifier: llama-2-7b
+    provider_id: ollama
+vector_stores_config:
+  default_provider_id: faiss
+```
+
+#### 2. **build.yaml** - Build Configuration
+Specifies which providers to install.
+
+#### 3. Environment Variables
+Override config values at runtime:
+```bash
+INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter
+```
+
+### Config Resolution
+
+**File**: `llama_stack/core/utils/config_resolution.py`
+
+Order of precedence:
+1. Environment variables (highest)
+2. Runtime config (run.yaml)
+3. Distribution template defaults (lowest)
+
+---
+
+## 12. Extension Points for Developers
+
+### Adding a Custom Provider
+
+1. **Create provider module**:
+   ```python
+   llama_stack/providers/remote/inference/my_provider/
+   ├── __init__.py
+   ├── config.py          # MyProviderConfig
+   └── my_provider.py     # MyProviderImpl(InferenceProvider)
+   ```
+
+2. **Register in registry**:
+   ```python
+   # llama_stack/providers/registry/inference.py
+   RemoteProviderSpec(
+       api=Api.inference,
+       adapter_type="my_provider",
+       provider_type="remote::my_provider",
+       config_class="...MyProviderConfig",
+       module="llama_stack.providers.remote.inference.my_provider",
+   )
+   ```
+
+3. **Use in distribution**:
+   ```yaml
+   providers:
+     inference:
+       - provider_id: my_provider
+         provider_type: remote::my_provider
+         config: {...}
+   ```
+
+### Adding a Custom API
+
+1. Define protocol in `llama_stack/apis/my_api/my_api.py`
+2. Implement providers
+3. Register in resolver and distributions
+4. Add CLI support if needed
+
+---
+
+## 13. Storage & Persistence
+
+### Storage Backends
+
+**File**: `llama_stack/core/storage/datatypes.py`
+
+#### KV Store (Key-Value)
+- Store metadata: models, shields, vector stores
+- Backends: SQLite (inline), Redis, Postgres
+
+#### SQL Store
+- Store structured data: conversations, datasets
+- Backends: SQLite (inline), Postgres
+
+#### Inference Store
+- Cache inference results for recording/replay
+- Used in testing
+
+### Storage Configuration
+
+```yaml
+storage:
+  type: sqlite
+  config:
+    dir: ~/.llama/distributions/starter
+```
+
+---
+
+## 14. Telemetry & Tracing
+
+### Tracing System
+
+**File**: `llama_stack/providers/utils/telemetry/`
+
+- Automatic request tracing with OpenTelemetry
+- Trace context propagation across async calls
+- Integration with OpenTelemetry collectors
+
+### Telemetry API
+
+Providers can implement the Telemetry API to collect metrics:
+- Token usage
+- Latency
+- Error rates
+- Custom metrics
+
+---
+
+## 15. Model System
+
+### Model Registry
+
+**File**: `llama_stack/models/llama/sku_list.py`
+
+```python
+resolve_model("meta-llama/Llama-2-7b") 
+  → Llama2Model(...)
+```
+
+Maps model IDs to their:
+- Architecture
+- Tokenizer
+- Quantization options
+- Required resources
+
+### Supported Models
+
+- **Llama 3** - Full architecture support
+- **Llama 3.1** - Extended context
+- **Llama 3.2** - Multimodal support
+- **Llama 4** - Latest generation
+- **Custom models** - Via provider registration
+
+### Model Quantization
+
+- int8, int4
+- GPTQ
+- Hadamard transform
+- Custom quantizers
+
+---
+
+## 16. Key Files to Understand
+
+### For Understanding Core Concepts
+1. `llama_stack/core/datatypes.py` - Configuration data types
+2. `llama_stack/providers/datatypes.py` - Provider specs
+3. `llama_stack/apis/inference/inference.py` - Example API
+
+### For Understanding Runtime
+1. `llama_stack/core/stack.py` - Main runtime class
+2. `llama_stack/core/resolver.py` - Dependency resolution
+3. `llama_stack/core/library_client.py` - In-process client
+
+### For Understanding Providers
+1. `llama_stack/providers/registry/inference.py` - Inference provider registry
+2. `llama_stack/providers/inline/inference/meta_reference/inference.py` - Example inline
+3. `llama_stack/providers/remote/inference/openai/openai.py` - Example remote
+
+### For Understanding Distributions
+1. `llama_stack/distributions/template.py` - Distribution template
+2. `llama_stack/distributions/starter/starter.py` - Starter distro
+3. `llama_stack/cli/stack/run.py` - Distribution startup
+
+---
+
+## 17. Development Workflow
+
+### Running Locally
+
+```bash
+# Install dependencies
+uv sync --all-groups
+
+# Run a distribution (auto-starts server)
+llama stack run starter
+
+# In another terminal, interact with it
+curl http://localhost:8321/health
+```
+
+### Testing
+
+```bash
+# Unit tests (fast, no external dependencies)
+uv run --group unit pytest tests/unit/
+
+# Integration tests (with record-replay)
+uv run --group test pytest tests/integration/ --stack-config=starter
+
+# Re-record integration tests (record real API calls)
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+  uv run --group test pytest tests/integration/ --stack-config=starter
+```
+
+### Building Distributions
+
+```bash
+# Build Starter distribution
+llama stack build starter --name my-starter
+
+# Run it
+llama stack run my-starter
+```
+
+---
+
+## 18. Notable Implementation Details
+
+### Async-First Architecture
+- All I/O is async (using `asyncio`)
+- Streaming responses with `AsyncIterator`
+- FastAPI for HTTP server (built on Starlette)
+
+### Streaming Support
+- Inference responses stream tokens
+- Agents stream turn-by-turn updates
+- Proper async context preservation
+
+### Error Handling
+- Structured errors with detailed messages
+- Graceful degradation when dependencies unavailable
+- Provider health checks
+
+### Extensibility
+- External providers via module import
+- Custom APIs via ExternalApiSpec
+- Plugin discovery via provider registry
+
+---
+
+## 19. Typical Request Flow
+
+```
+User Request (e.g., chat completion)
+  ↓
+CLI or SDK Client
+  ↓
+HTTP Request → FastAPI Server (port 8321)
+  ↓
+Route Handler (e.g., /inference/chat-completion)
+  ↓
+Router (Auto-Routed API)
+  → Determine which provider has the model
+  ↓
+Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
+  ↓
+External Service or Local Execution
+  ↓
+Response (streaming or complete)
+  ↓
+Send back to Client
+```
+
+---
+
+## 20. Key Takeaways
+
+1. **Unified APIs**: Single abstraction for 27+ AI capabilities
+2. **Pluggable Providers**: 50+ implementations (inline & remote)
+3. **Configuration-Driven**: Switch providers via YAML, not code
+4. **Distributions**: Pre-verified bundles for common scenarios
+5. **Record-Replay Testing**: Cost-effective integration tests
+6. **Two Client Modes**: Library (in-process) or HTTP (distributed)
+7. **Smart Routing**: Automatic request routing to appropriate providers
+8. **Async-First**: Native streaming and concurrent request handling
+9. **Extensible**: Custom APIs and providers easily added
+10. **Production-Ready**: Health checks, telemetry, access control, storage
+
+---
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                      Client Applications                      │
+│               (CLI, SDK, Web UI, Custom Apps)               │
+└────────────────────┬────────────────────────────────────────┘
+                     │
+         ┌───────────┴────────────┐
+         │                        │
+    ┌────▼────────┐      ┌───────▼──────┐
+    │   Library   │      │  HTTP Server │
+    │   Client    │      │  (FastAPI)   │
+    └────┬────────┘      └───────┬──────┘
+         │                       │
+         └───────────┬───────────┘
+                     │
+          ┌──────────▼──────────┐
+          │   LlamaStack Class  │
+          │  (implements all    │
+          │   27 APIs)          │
+          └──────────┬──────────┘
+                     │
+      ┌──────────────┼──────────────┐
+      │              │              │
+      │    Router    │   Routing    │  Resource
+      │  (Auto-      │   Tables     │  Registries
+      │   routed     │  (Models,    │  (Models,
+      │   APIs)      │   Shields)   │   Shields,
+      │              │              │   etc.)
+      └──────────────┼──────────────┘
+                     │
+        ┌────────────┴──────────────┐
+        │                           │
+   ┌────▼──────────┐    ┌──────────▼─────┐
+   │ Inline        │    │ Remote          │
+   │ Providers     │    │ Providers       │
+   │               │    │                 │
+   │ • Meta Ref    │    │ • OpenAI        │
+   │ • FAISS       │    │ • Ollama        │
+   │ • Llama Guard │    │ • Qdrant        │
+   │ • etc.        │    │ • etc.          │
+   │               │    │                 │
+   └───────────────┘    └─────────────────┘
+        │                       │
+        │                       │
+   Local Execution         External Services
+   (GPUs/CPUs)            (APIs/Servers)
+```
+