fix: UI bug fixes and comprehensive architecture documentation

- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
2025-12-11 19:56:03 +00:00 · 2025-10-27 13:34:02 +08:00 · 2025-10-27 13:34:02 +08:00 · 5ef6ccf90e
commit 5ef6ccf90e
parent 3059423cd7
8 changed files with 2603 additions and 1 deletions
--- a/ARCHITECTURE_INDEX.md
+++ b/ARCHITECTURE_INDEX.md
@ -0,0 +1,208 @@
+# Llama Stack Architecture - Documentation Index
+
+This directory contains comprehensive architecture documentation for the Llama Stack codebase. These documents were created through thorough exploration of the entire codebase and are designed to help developers understand the "big picture" without needing to read code snippets from dozens of files.
+
+## Documentation Files
+
+### 1. ARCHITECTURE_SUMMARY.md (30 KB)
+**Comprehensive technical reference covering all major components**
+
+Start here for a complete overview. Covers:
+- Core architecture philosophy and design patterns
+- Complete directory structure with descriptions
+- All 27 APIs with their purposes and providers
+- Provider system (inline vs remote)
+- Core runtime and resolution process
+- Request routing mechanisms
+- Distribution system
+- CLI architecture
+- Testing architecture with record-replay
+- Storage and telemetry systems
+- Key files to understand
+
+**Best for**: Getting the complete picture, reference material
+
+### 2. ARCHITECTURE_INSIGHTS.md (16 KB)
+**Strategic insights into why design decisions were made**
+
+Explains the reasoning and elegance of the architecture. Covers:
+- Why this architecture works (problems it solves)
+- The genius of the plugin system
+- Request routing intelligence
+- Configuration as a weapon
+- Distributions strategy
+- Testing genius (record-replay)
+- Core runtime elegance
+- Dependency injection
+- Client duality (library vs server)
+- Extension points
+- Performance implications
+- Security considerations
+- Maturity indicators
+- Key architectural decisions
+- Learning path for contributors
+
+**Best for**: Understanding design philosophy, decision-making context
+
+### 3. QUICK_REFERENCE.md (7.2 KB)
+**Cheat sheet and quick lookup guide**
+
+Fast reference for developers working on the codebase. Covers:
+- Key concepts at a glance
+- Directory map for navigation
+- Common task procedures
+- Core classes to know
+- Configuration file structures
+- Common file patterns
+- Key design patterns
+- Important numbers
+- Quick commands
+- File size reference
+- Testing quick reference
+- Common debugging tips
+- Most important files for beginners
+
+**Best for**: Quick lookup, developers working on code
+
+## How to Use These Documents
+
+### For New Team Members
+1. Start with ARCHITECTURE_SUMMARY.md (20 min read)
+2. Read ARCHITECTURE_INSIGHTS.md (15 min read)
+3. Bookmark QUICK_REFERENCE.md for later
+4. Start exploring code using provided file paths
+
+### For Understanding a Specific Component
+1. Search QUICK_REFERENCE.md for the component name
+2. Get the file path from ARCHITECTURE_SUMMARY.md
+3. Understand the context from ARCHITECTURE_INSIGHTS.md
+4. Read the source code
+
+### For Adding a New Feature
+1. Identify which layer(s) you're modifying (API, Provider, Distribution)
+2. Check ARCHITECTURE_SUMMARY.md for similar components
+3. Look at existing examples in the codebase
+4. Use QUICK_REFERENCE.md for implementation patterns
+5. Follow the extension points in ARCHITECTURE_INSIGHTS.md
+
+### For Debugging Issues
+1. Use QUICK_REFERENCE.md's debugging tips section
+2. Find the routing mechanism in ARCHITECTURE_SUMMARY.md
+3. Trace through provider registration in ARCHITECTURE_SUMMARY.md
+4. Check the request flow diagram
+
+## Key Takeaways
+
+### The Three Pillars
+1. **APIs** (`llama_stack/apis/`) - Abstract interfaces (27 total)
+2. **Providers** (`llama_stack/providers/`) - Implementations (50+ total)  
+3. **Distributions** (`llama_stack/distributions/`) - Pre-configured bundles
+
+### The Architecture Philosophy
+- **Separation of Concerns** - Clear boundaries between APIs, Providers, and Distributions
+- **Plugin System** - Dynamically load providers based on configuration
+- **Configuration-Driven** - YAML-based configuration enables flexibility
+- **Smart Routing** - Automatic request routing to appropriate providers
+- **Two Client Modes** - Library (in-process) or Server (HTTP)
+
+### The Testing Revolution
+- **Record-Replay Pattern** - Record real API calls once, replay thousands of times
+- **Cost Effective** - Save money on API calls during development
+- **Fast** - Cached responses = instant test execution
+- **Provider Agnostic** - Same test works with multiple providers
+
+### The Extension Strategy
+Add custom providers by:
+1. Creating a module in `providers/[inline|remote]/[api]/[provider]/`
+2. Registering in `providers/registry/[api].py`
+3. Using in distribution YAML
+
+No framework customization needed!
+
+## Important Statistics
+
+- **27 APIs** covering all major AI operations
+- **50+ Providers** across inline and remote implementations
+- **7 Built-in Distributions** for different scenarios
+- **Python 3.12+** required
+- **100% Async** - Built on asyncio throughout
+- **Pydantic** - For type validation and configuration
+- **FastAPI** - For HTTP server implementation
+- **OpenTelemetry** - For observability
+
+## Most Important Files
+
+These files are the foundation - understanding them gives 80% of the architecture knowledge:
+
+1. `/home/asallas/workarea/projects/personal/llama-stack/llama_stack/core/stack.py` - Main orchestrator
+2. `/home/asallas/workarea/projects/personal/llama-stack/llama_stack/core/resolver.py` - Dependency resolution
+3. `/home/asallas/workarea/projects/personal/llama-stack/llama_stack/apis/inference/inference.py` - Example API
+4. `/home/asallas/workarea/projects/personal/llama-stack/llama_stack/providers/datatypes.py` - Provider specs
+5. `/home/asallas/workarea/projects/personal/llama-stack/llama_stack/distributions/template.py` - Distribution base
+
+## Quick Navigation by Use Case
+
+### I want to understand how requests are routed
+1. ARCHITECTURE_SUMMARY.md → Section 6 "Request Routing"
+2. ARCHITECTURE_INSIGHTS.md → Section "The Request Routing Intelligence"
+3. Check: `llama_stack/core/routers/` and `llama_stack/core/routing_tables/`
+
+### I want to add a new provider
+1. QUICK_REFERENCE.md → "Adding a Provider"
+2. ARCHITECTURE_SUMMARY.md → Section 4 "Provider System"
+3. Look at existing providers in `llama_stack/providers/[inline|remote]/`
+
+### I want to understand the testing strategy
+1. ARCHITECTURE_SUMMARY.md → Section 9 "Testing Architecture"
+2. ARCHITECTURE_INSIGHTS.md → Section "The Testing Genius"
+3. Check: `tests/README.md` for detailed testing guide
+
+### I want to understand distributions
+1. ARCHITECTURE_SUMMARY.md → Section 7 "Distributions"
+2. ARCHITECTURE_INSIGHTS.md → Section "The Distributions Strategy"
+3. Look at: `llama_stack/distributions/starter/starter.py`
+
+### I want to understand the CLI
+1. ARCHITECTURE_SUMMARY.md → Section 8 "CLI Architecture"
+2. QUICK_REFERENCE.md → "Quick Commands"
+3. Look at: `llama_stack/cli/stack/run.py`
+
+### I want to understand configuration
+1. ARCHITECTURE_SUMMARY.md → Section 11 "Configuration Management"
+2. QUICK_REFERENCE.md → "Configuration Files"
+3. Look at: `llama_stack/core/utils/config_resolution.py`
+
+## Documentation Creation
+
+These documents were created through:
+- **Directory exploration** - Understanding the codebase structure
+- **File analysis** - Reading key files across all components
+- **Pattern identification** - Recognizing common architectural patterns
+- **Relationship mapping** - Understanding how components interact
+- **Testing analysis** - Understanding test architecture and patterns
+
+All information comes directly from the codebase, with specific file paths provided for verification and deeper exploration.
+
+## Staying Current
+
+These documents reflect the codebase as of October 27, 2025. When the codebase changes:
+1. Check if changes are in the identified key files
+2. If in existing components, documents are still largely accurate
+3. If entirely new components, documents should be updated
+4. The architecture philosophy should remain constant
+
+## Questions?
+
+When exploring the codebase with these documents:
+1. Start with the QUICK_REFERENCE.md for fast lookup
+2. Use ARCHITECTURE_SUMMARY.md for detailed information
+3. Consult ARCHITECTURE_INSIGHTS.md for design rationale
+4. Always verify with actual source code files
+
+The documentation is comprehensive but code is the source of truth.
+
+---
+
+**Created**: October 27, 2025  
+**Codebase Analyzed**: /home/asallas/workarea/projects/personal/llama-stack/  
+**Focus**: Comprehensive architecture overview for developer understanding
--- a/ARCHITECTURE_INSIGHTS.md
+++ b/ARCHITECTURE_INSIGHTS.md
@ -0,0 +1,620 @@
+# Llama Stack - Architecture Insights for Developers
+
+## Why This Architecture Works
+
+### Problem It Solves
+Without Llama Stack, building AI applications requires:
+- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
+- Rewriting code to switch providers
+- Duplicating logic for common patterns (safety checks, vector search, etc.)
+- Managing complex dependencies manually
+
+### Solution: The Three Pillars
+```
+Single, Unified API Interface
+        ↓
+Multiple Provider Implementations
+        ↓
+Pre-configured Distributions
+```
+
+**Result**: Write once, run anywhere (locally, cloud, on-device)
+
+---
+
+## The Genius of the Plugin Architecture
+
+### How It Works
+1. **Define Abstract Interface** (Protocol in `apis/`)
+   ```python
+   class Inference(Protocol):
+       async def post_chat_completion(...) -> AsyncIterator[...]: ...
+   ```
+
+2. **Multiple Implementations** (in `providers/`)
+   - Local: Meta Reference, vLLM, Ollama
+   - Cloud: OpenAI, Anthropic, Groq, Bedrock
+   - Each implements same interface
+
+3. **Runtime Selection** (via YAML config)
+   ```yaml
+   providers:
+     inference:
+       - provider_type: remote::openai
+   ```
+
+4. **Zero Code Changes** to switch providers!
+
+### Why This Beats Individual SDKs
+- **Single SDK** vs 30+ provider SDKs
+- **Same API** vs learning each provider's quirks
+- **Easy migration** - change 1 config value
+- **Testing** - same tests work across all providers
+
+---
+
+## The Request Routing Intelligence
+
+### Two Clever Routing Strategies
+
+#### 1. Auto-Routed APIs (Smart Dispatch)
+**APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
+
+When you call:
+```python
+await inference.post_chat_completion(model="llama-2-7b")
+```
+
+Router automatically determines:
+- "Which provider has llama-2-7b?"
+- "Route this request there"
+- "Stream response back"
+
+**Implementation**: `routers/` directory contains auto-routers
+
+#### 2. Routing Table APIs (Registry Pattern)
+**APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
+
+When you call:
+```python
+models = await models.list_models()  # Merged list from ALL providers
+```
+
+Router:
+- Queries each provider
+- Merges results
+- Returns unified list
+
+**Implementation**: `routing_tables/` directory
+
+### Why This Matters
+- **Users don't think about providers** - just use the API
+- **Multiple implementations work** - router handles dispatch
+- **Easy scaling** - add new providers without touching user code
+- **Resource management** - router knows what's available
+
+---
+
+## Configuration as a Weapon
+
+### The Power of YAML Over Code
+Traditional approach:
+```python
+# Code changes needed for each provider!
+if use_openai:
+    from openai import OpenAI
+    client = OpenAI(api_key=...)
+elif use_ollama:
+    from ollama import Client
+    client = Client(url=...)
+# etc.
+```
+
+Llama Stack approach:
+```yaml
+# Zero code changes!
+providers:
+  inference:
+    - provider_type: remote::openai
+      config:
+        api_key: ${env.OPENAI_API_KEY}
+```
+
+Then later, change to:
+```yaml
+providers:
+  inference:
+    - provider_type: remote::ollama
+      config:
+        host: localhost
+```
+
+**Same application code** works with both!
+
+### Environment Variable Magic
+```bash
+# Change provider at runtime
+INFERENCE_MODEL=llama-2-70b llama stack run starter
+
+# No redeployment needed!
+```
+
+---
+
+## The Distributions Strategy
+
+### Problem: "Works on My Machine"
+- Different developers need different setups
+- Production needs different providers than development
+- CI/CD needs lightweight dependencies
+
+### Solution: Pre-verified Distributions
+```
+starter → Works on CPU with free APIs (Ollama + OpenAI)
+starter-gpu → Works on GPU machines
+meta-reference-gpu → Works with full local setup
+postgres-demo → Production-grade with persistent storage
+```
+
+Each distribution:
+- Pre-selects working providers
+- Sets sensible defaults
+- Bundles required dependencies
+- Tested end-to-end
+
+**Result**: `llama stack run starter` just works for 80% of use cases
+
+### Why This Beats Documentation
+- **No setup guides needed** - distribution does it
+- **No guessing** - curated, tested combinations
+- **Reproducible** - same distro always works same way
+- **Upgradeable** - update distro = get improvements
+
+---
+
+## The Testing Genius: Record-Replay
+
+### Traditional Testing Hell for AI
+Problem:
+- API calls cost money
+- API responses are non-deterministic
+- Each provider has different response formats
+- Tests become slow and flaky
+
+### The Record-Replay Solution
+
+First run (record):
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
+# Makes real API calls, saves responses to YAML
+```
+
+All subsequent runs (replay):
+```bash
+pytest tests/integration/
+# Returns cached responses, NO API calls, instant results
+```
+
+### Why This is Brilliant
+- **Cost**: Record once, replay 1000x. Save thousands of dollars
+- **Speed**: Cached responses = instant test execution
+- **Reliability**: Deterministic results (no API variability)
+- **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc.
+
+**File location**: `tests/integration/[api]/cassettes/`
+
+---
+
+## Core Runtime: The Stack Class
+
+### The Elegance of Inheritance
+```python
+class LlamaStack(
+    Inference,        # Chat completion, embeddings
+    Agents,           # Multi-turn orchestration
+    Safety,           # Content filtering
+    VectorIO,         # Vector operations
+    Tools,            # Function execution
+    Eval,             # Evaluation
+    Scoring,          # Response scoring
+    Models,           # Model registry
+    # ... 19 more APIs
+):
+    pass
+```
+
+A single `LlamaStack` instance:
+- Implements 27 different APIs
+- Has 50+ providers backing it
+- Routes requests intelligently
+- Manages dependencies
+
+All from a ~400 line file + lots of protocol definitions!
+
+---
+
+## Dependency Injection Without the Complexity
+
+### How Providers Depend on Each Other
+Problem: Agents need Inference, Inference needs Models registry
+```python
+class AgentProvider:
+    def __init__(self, 
+                 inference: InferenceProvider,
+                 safety: SafetyProvider,
+                 tool_runtime: ToolRuntimeProvider):
+        self.inference = inference
+        self.safety = safety
+        self.tool_runtime = tool_runtime
+```
+
+### How It Gets Resolved
+**File**: `core/resolver.py`
+
+1. Parse `run.yaml` - which providers enabled?
+2. Build dependency graph - who depends on whom?
+3. Topological sort - what order to instantiate?
+4. Instantiate in order - each gets its dependencies
+
+**Result**: Complex dependency chains handled automatically!
+
+---
+
+## The Client Duality
+
+### Two Ways to Use Llama Stack
+
+#### 1. Library Mode (In-Process)
+```python
+from llama_stack import AsyncLlamaStackAsLibraryClient
+
+client = await AsyncLlamaStackAsLibraryClient.create(run_config)
+response = await client.inference.post_chat_completion(...)
+```
+- No HTTP overhead
+- Direct Python API
+- Embedded in application
+- **File**: `core/library_client.py`
+
+#### 2. Server Mode (HTTP)
+```bash
+llama stack run starter  # Start server on port 8321
+```
+
+```python
+from llama_stack_client import AsyncLlamaStackClient
+
+client = AsyncLlamaStackClient(base_url="http://localhost:8321")
+response = await client.inference.post_chat_completion(...)
+```
+- Distributed architecture
+- Share single server across apps
+- Easy deployment
+- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
+
+**Result**: Same API, different deployment strategies!
+
+---
+
+## The Model System Insight
+
+### Why It Exists
+Problem: Different model IDs across providers
+- HuggingFace: `meta-llama/Llama-2-7b`
+- Ollama: `llama2`
+- OpenAI: `gpt-4`
+
+### Solution: Universal Model Registry
+**File**: `models/llama/sku_list.py`
+
+```python
+resolve_model("meta-llama/Llama-2-7b")
+# Returns Model object with:
+# - Architecture info
+# - Tokenizer
+# - Quantization options
+# - Resource requirements
+```
+
+Allows:
+- Consistent model IDs across providers
+- Intelligent resource allocation
+- Provider-agnostic inference
+
+---
+
+## The CLI Is Smart
+
+### It Does More Than You Think
+```bash
+llama stack run starter
+```
+
+This command:
+1. Resolves the starter distribution template
+2. Merges with environment variables
+3. Creates/updates `~/.llama/distributions/starter/run.yaml`
+4. Installs missing dependencies
+5. Starts HTTP server on port 8321
+6. Initializes all providers
+7. Registers available models
+8. Ready for requests
+
+**No separate build step needed!** (unless building Docker images)
+
+### Introspection Commands
+```bash
+llama stack list-apis           # See all 27 APIs
+llama stack list-providers      # See all 50+ providers
+llama stack list                # See all distributions
+llama stack list-deps starter   # See what to install
+```
+
+Used for documentation, debugging, and automation
+
+---
+
+## Storage: The Oft-Overlooked Component
+
+### Three Storage Types
+1. **KV Store** - Metadata (models, shields)
+2. **SQL Store** - Structured (conversations, datasets)  
+3. **Inference Store** - Caching (for testing)
+
+### Why Multiple Backends Matter
+- Development: SQLite (no dependencies)
+- Production: PostgreSQL (scalable)
+- Distributed: Redis (shared state)
+- Testing: In-memory (fast)
+
+**Files**: 
+- `core/storage/datatypes.py` - Interfaces
+- `providers/utils/kvstore/` - Implementations
+- `providers/utils/sqlstore/` - Implementations
+
+---
+
+## Telemetry: Built-In Observability
+
+### What Gets Traced
+- Every API call
+- Token usage (if provider supports it)
+- Latency
+- Errors
+- Custom metrics from providers
+
+### Integration
+- OpenTelemetry compatible
+- Automatic context propagation
+- Works across async boundaries
+- **File**: `providers/utils/telemetry/`
+
+---
+
+## Extension Strategy: How to Add Custom Functionality
+
+### Adding a Custom API
+1. Create protocol in `apis/my_api/my_api.py`
+2. Implement providers (inline and/or remote)
+3. Register in `core/resolver.py`
+4. Add to distributions
+
+### Adding a Custom Provider
+1. Create module in `providers/[inline|remote]/[api]/[provider]/`
+2. Implement config and adapter classes
+3. Register in `providers/registry/[api].py`
+4. Use in distribution YAML
+
+### Adding a Custom Distribution
+1. Create subdirectory in `distributions/[name]/`
+2. Implement template in `[name].py`
+3. Register in distribution discovery
+
+---
+
+## Common Misconceptions Clarified
+
+### "APIs are HTTP endpoints"
+**Wrong** - APIs are Python protocols. HTTP comes later via FastAPI.
+- The "Inference" API is just a Python Protocol
+- Providers implement it
+- Core wraps it with HTTP for server mode
+- Library mode uses it directly
+
+### "Providers are all external services"
+**Wrong** - Providers can be:
+- Inline (local execution): Meta Reference, FAISS, Llama Guard
+- Remote (external services): OpenAI, Ollama, Qdrant
+
+Inline providers have low latency and no dependency on external services.
+
+### "You must run a server"
+**Wrong** - Two modes:
+- Server mode: `llama stack run starter` (HTTP)
+- Library mode: Import and use directly in Python
+
+### "Distributions are just Docker images"
+**Wrong** - Distributions are:
+- Templates (what providers to use)
+- Configs (how to configure them)
+- Dependencies (what to install)
+- Can be Docker OR local Python
+
+---
+
+## Performance Implications
+
+### Inline Providers Are Fast
+```
+Inline (e.g., Meta Reference)
+├─ 0ms network latency
+├─ No HTTP serialization/deserialization
+├─ Direct GPU access
+└─ Fast (but high resource cost)
+
+Remote (e.g., OpenAI)
+├─ 100-500ms network latency
+├─ HTTP serialization overhead
+├─ Low resource cost
+└─ Slower (but cheap)
+```
+
+### Streaming Is Native
+```python
+response = await inference.post_chat_completion(model=..., stream=True)
+async for chunk in response:
+    print(chunk.delta)  # Process token by token
+```
+
+Tokens arrive as they're generated, no waiting for full response.
+
+---
+
+## Security Considerations
+
+### API Keys Are Config
+```yaml
+inference:
+  - provider_id: openai
+    config:
+      api_key: ${env.OPENAI_API_KEY}  # From environment
+```
+
+Never hardcoded, always from env vars.
+
+### Access Control
+**File**: `core/access_control/`
+
+Providers can implement access rules:
+- Per-user restrictions
+- Per-model restrictions
+- Rate limiting
+- Audit logging
+
+### Sensitive Field Redaction
+Config logging automatically redacts:
+- API keys
+- Passwords
+- Tokens
+
+---
+
+## Maturity Indicators
+
+### Signs of Production-Ready Design
+1. **Separated Concerns** - APIs, Providers, Distributions
+2. **Plugin Architecture** - Easy to extend
+3. **Configuration Over Code** - Deploy without recompiling
+4. **Comprehensive Testing** - Unit + Integration with record-replay
+5. **Multiple Client Options** - Library + Server modes
+6. **Storage Abstraction** - Multiple backends
+7. **Dependency Management** - Automatic resolution
+8. **Error Handling** - Structured, informative errors
+9. **Observability** - Built-in telemetry
+10. **Documentation** - Distributions + CLI introspection
+
+Llama Stack has all 10!
+
+---
+
+## Key Architectural Decisions
+
+### Why Async/Await Throughout?
+- Modern Python standard
+- Works well with streaming
+- Natural for I/O-heavy operations (API calls, GPU operations)
+
+### Why Pydantic for Config?
+- Type validation
+- Auto-documentation
+- JSON schema generation
+- Easy serialization
+
+### Why Protocol Classes for APIs?
+- Define interface without implementation
+- Multiple implementations possible
+- Type hints work with duck typing
+- Minimal magic
+
+### Why YAML for Config?
+- Human readable
+- Environment variable support
+- Comments allowed
+- Wide tool support
+
+### Why Record-Replay for Tests?
+- Cost efficient
+- Deterministic
+- Real behavior captured
+- Provider-agnostic
+
+---
+
+## The Learning Path for Contributors
+
+### Understanding Order
+1. **Start**: `pyproject.toml` - Entry point
+2. **Learn**: `core/datatypes.py` - Data structures
+3. **Understand**: `apis/inference/inference.py` - Example API
+4. **See**: `providers/registry/inference.py` - Provider registry
+5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider
+6. **Read**: `providers/remote/inference/openai/` - Remote provider
+7. **Study**: `core/resolver.py` - How it all connects
+8. **Understand**: `core/stack.py` - Main orchestrator
+9. **See**: `distributions/starter/` - How to use it
+10. **Run**: `tests/integration/` - How to test
+
+Each step builds on previous understanding.
+
+---
+
+## The Elegant Parts
+
+### Most Elegant: The Router
+The router system is beautiful:
+- Transparent to users
+- Automatic provider selection
+- Works with 1 or 100 providers
+- No hardcoding needed
+
+### Most Flexible: YAML Config
+Configuration as first-class citizen:
+- Switch providers without code
+- Override at runtime
+- Version control friendly
+- Documentation via config
+
+### Most Useful: Record-Replay Tests
+Testing pattern solves real problems:
+- Cost
+- Speed
+- Reliability
+- Coverage
+
+### Most Scalable: Distribution Templates
+Pre-configured bundles:
+- One command to start
+- Verified combinations
+- Easy to document
+- Simple to teach
+
+---
+
+## The Future
+
+### What's Being Built
+- More providers (Nvidia, SambaNova, etc.)
+- More APIs (more task types)
+- On-device execution (ExecuTorch)
+- Better observability (more telemetry)
+- Easier extensions (simpler API for custom providers)
+
+### How It Stays Maintainable
+- Protocol-based design limits coupling
+- Clear separation of concerns
+- Comprehensive testing
+- Configuration over code
+- Plugin architecture
+
+The architecture is **future-proof** by design.
+
--- a/ARCHITECTURE_SUMMARY.md
+++ b/ARCHITECTURE_SUMMARY.md
@ -0,0 +1,925 @@
+# Llama Stack Architecture - Comprehensive Overview
+
+## Executive Summary
+
+Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a **unified API layer** with a **plugin architecture for providers**, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).
+
+---
+
+## 1. Core Architecture Philosophy
+
+### Separation of Concerns
+- **APIs**: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
+- **Providers**: Implement those interfaces (inline for local, remote for external services)
+- **Distributions**: Pre-configure and bundle providers for specific deployment scenarios
+
+### Key Design Patterns
+- **Plugin Architecture**: Dynamically load providers based on configuration
+- **Dependency Injection**: Providers declare dependencies on other APIs/providers
+- **Routing**: Smart routing directs requests to appropriate provider implementations
+- **Configuration-Driven**: YAML-based configuration enables flexibility without code changes
+
+---
+
+## 2. Directory Structure (`llama_stack/`)
+
+```
+llama_stack/
+├── apis/                    # Abstract API definitions (27 APIs total)
+│   ├── inference/          # LLM inference interface
+│   ├── agents/             # Agent orchestration
+│   ├── safety/             # Content filtering & safety
+│   ├── vector_io/          # Vector database operations
+│   ├── tools/              # Tool/function calling runtime
+│   ├── scoring/            # Response scoring
+│   ├── eval/               # Evaluation framework
+│   ├── post_training/      # Fine-tuning & training
+│   ├── datasetio/          # Dataset loading/management
+│   ├── conversations/      # Conversation management
+│   ├── common/             # Shared datatypes (SamplingParams, etc.)
+│   └── [22 more...]        # Models, Shields, Benchmarks, etc.
+│
+├── providers/              # Provider implementations (inline & remote)
+│   ├── inline/             # In-process implementations
+│   │   ├── inference/      # Meta Reference, Sentence Transformers
+│   │   ├── agents/         # Agent orchestration implementations
+│   │   ├── safety/         # Llama Guard, Code Scanner
+│   │   ├── vector_io/      # FAISS, SQLite-vec, Milvus
+│   │   ├── post_training/  # TorchTune
+│   │   ├── eval/           # Evaluation implementations
+│   │   ├── tool_runtime/   # RAG runtime, MCP protocol
+│   │   └── [more...]
+│   │
+│   ├── remote/             # External service adapters
+│   │   ├── inference/      # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
+│   │   ├── vector_io/      # ChromaDB, Qdrant, Weaviate, Postgres
+│   │   ├── safety/         # Bedrock, SambaNova, Nvidia
+│   │   ├── agents/         # Sample implementations
+│   │   ├── tool_runtime/   # Brave Search, Tavily, Wolfram Alpha
+│   │   └── [more...]
+│   │
+│   ├── registry/           # Provider discovery/registration (inference.py, agents.py, etc.)
+│   │   └── [One file per API with all providers for that API]
+│   │
+│   ├── utils/              # Shared provider utilities
+│   │   ├── inference/      # Embedding mixin, OpenAI compat
+│   │   ├── kvstore/        # Key-value store abstractions
+│   │   ├── sqlstore/       # SQL storage abstractions
+│   │   ├── telemetry/      # Tracing, metrics
+│   │   └── [more...]
+│   │
+│   └── datatypes.py        # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
+│
+├── core/                   # Core runtime & orchestration
+│   ├── stack.py            # Main LlamaStack class (implements all APIs)
+│   ├── datatypes.py        # Config models (StackRunConfig, Provider, etc.)
+│   ├── resolver.py         # Provider resolution & dependency injection
+│   ├── library_client.py   # In-process client for library usage
+│   ├── build.py            # Distribution building
+│   ├── configure.py        # Configuration handling
+│   ├── distribution.py     # Distribution management
+│   ├── routers/            # Auto-routed API implementations (infer route based on routing key)
+│   ├── routing_tables/     # Manual routing tables (e.g., Models, Shields, VectorStores)
+│   ├── server/             # FastAPI HTTP server setup
+│   ├── storage/            # Backend storage abstractions (KVStore, SqlStore)
+│   ├── utils/              # Config resolution, dynamic imports
+│   └── conversations/      # Conversation service implementation
+│
+├── cli/                    # Command-line interface
+│   ├── llama.py            # Main entry point
+│   └── stack/              # Stack management commands
+│       ├── run.py          # Start a distribution
+│       ├── list_apis.py    # List available APIs
+│       ├── list_providers.py # List providers
+│       ├── list_deps.py    # List dependencies
+│       └── [more...]
+│
+├── distributions/          # Pre-configured distribution templates
+│   ├── starter/            # CPU-friendly multi-provider starter
+│   ├── starter-gpu/        # GPU-optimized starter
+│   ├── meta-reference-gpu/ # Full-featured Meta reference
+│   ├── postgres-demo/      # PostgreSQL-based demo
+│   ├── template.py         # Distribution template base class
+│   └── [more...]
+│
+├── models/                 # Llama model implementations
+│   └── llama/
+│       ├── llama3/         # Llama 3 implementation
+│       ├── llama4/         # Llama 4 implementation
+│       ├── sku_list.py     # Model registry (maps model IDs to implementations)
+│       ├── checkpoint.py   # Model checkpoint handling
+│       ├── datatypes.py    # ToolDefinition, StopReason, etc.
+│       └── [more...]
+│
+├── testing/                # Testing utilities
+│   └── api_recorder.py     # Record/replay infrastructure for integration tests
+│
+└── ui/                     # Web UI (Streamlit-based)
+    ├── app/
+    ├── components/
+    ├── pages/
+    └── [React/TypeScript frontend]
+```
+
+---
+
+## 3. API Layer (27 APIs)
+
+### What is an API?
+Each API is an abstract **protocol** (Python Protocol class) that defines an interface. APIs are located in `llama_stack/apis/` with a structure like:
+
+```
+apis/inference/
+├── __init__.py          # Exports the Inference protocol
+├── inference.py         # Full API definition (300+ lines)
+└── event_logger.py      # Supporting types
+```
+
+### Key APIs
+
+#### Core Inference API
+- **Path**: `llama_stack/apis/inference/inference.py`
+- **Methods**: `post_chat_completion()`, `post_completion()`, `post_embedding()`, `get_models()`
+- **Types**: `SamplingParams`, `SamplingStrategy` (greedy/top-p/top-k), `OpenAIChatCompletion`
+- **Providers**: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)
+
+#### Agents API
+- **Path**: `llama_stack/apis/agents/agents.py`
+- **Methods**: `create_agent()`, `update_agent()`, `create_session()`, `agentic_loop_turn()`
+- **Features**: Multi-turn conversations, tool calling, streaming
+- **Providers**: Meta Reference (inline), Fireworks, Together
+
+#### Safety API
+- **Path**: `llama_stack/apis/safety/safety.py`
+- **Methods**: `run_shields()` - filter content before/after inference
+- **Providers**: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia
+
+#### Vector IO API
+- **Path**: `llama_stack/apis/vector_io/vector_io.py`
+- **Methods**: `insert()`, `query()`, `delete()` - vector database operations
+- **Providers**: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)
+
+#### Tools / Tool Runtime API
+- **Path**: `llama_stack/apis/tools/tool_runtime.py`
+- **Methods**: `execute_tool()` - execute functions during agent loops
+- **Providers**: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol
+
+#### Other Major APIs
+- **Post Training**: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
+- **Eval**: Evaluation frameworks (Meta Reference with autoevals)
+- **Scoring**: Response scoring (Basic, LLM-as-Judge, Braintrust)
+- **Datasets**: Dataset management
+- **DatasetIO**: Dataset loading from HuggingFace, Nvidia, local files
+- **Conversations**: Multi-turn conversation state management
+- **Vector Stores**: Vector store metadata & configuration
+- **Shields**: Shield (safety filter) registry
+- **Models**: Model registry management
+- **Batches**: Batch processing
+- **Prompts**: Prompt templates & management
+- **Telemetry**: Tracing & metrics collection
+- **Inspect**: Introspection & debugging
+
+---
+
+## 4. Provider System
+
+### Provider Types
+
+#### 1. **Inline Providers** (`InlineProviderSpec`)
+- Run in-process (same Python process as server)
+- High performance, low latency
+- No network overhead
+- Heavier resource requirements
+- Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)
+
+**Structure**:
+```python
+InlineProviderSpec(
+    api=Api.inference,
+    provider_type="inline::meta-reference",
+    module="llama_stack.providers.inline.inference.meta_reference",
+    config_class="...MetaReferenceInferenceConfig",
+    pip_packages=[...],
+    container_image="..."  # Optional for containerization
+)
+```
+
+#### 2. **Remote Providers** (`RemoteProviderSpec`)
+- Connect to external services via HTTP/API
+- Lower resource requirements
+- Network latency
+- Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
+- Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB
+
+**Structure**:
+```python
+RemoteProviderSpec(
+    api=Api.inference,
+    adapter_type="openai",
+    provider_type="remote::openai",
+    module="llama_stack.providers.remote.inference.openai",
+    config_class="...OpenAIInferenceConfig",
+    pip_packages=[...]
+)
+```
+
+### Provider Registration
+
+Providers are registered in **registry files** (`llama_stack/providers/registry/`):
+- `inference.py` - All inference providers (30+)
+- `agents.py` - All agent providers
+- `safety.py` - All safety providers
+- `vector_io.py` - All vector IO providers
+- `tool_runtime.py` - All tool runtime providers
+- [etc.]
+
+Each registry file has an `available_providers()` function returning a list of `ProviderSpec`.
+
+### Provider Config
+
+Each provider has a config class (e.g., `MetaReferenceInferenceConfig`):
+```python
+class MetaReferenceInferenceConfig(BaseModel):
+    max_batch_size: int = 1
+    enable_pydantic_sampling: bool = True
+    # sample_run_config() - provides default values for testing
+    # pip_packages() - lists dependencies
+```
+
+### Provider Implementation
+
+Inline providers look like:
+```python
+class MetaReferenceInferenceImpl(InferenceProvider):
+    async def post_chat_completion(
+        self,
+        model: str,
+        request: OpenAIChatCompletionRequestWithExtraBody,
+    ) -> AsyncIterator[OpenAIChatCompletionChunk]:
+        # Load model, run inference, yield streaming results
+        ...
+```
+
+Remote providers implement HTTP adapters:
+```python
+class OllamaInferenceImpl(InferenceProvider):
+    async def post_chat_completion(...):
+        # Make HTTP requests to Ollama server
+        ...
+```
+
+---
+
+## 5. Core Runtime & Resolution
+
+### Stack Resolution Process
+
+**File**: `llama_stack/core/resolver.py`
+
+1. **Load Configuration** → Parse `run.yaml` with enabled providers
+2. **Resolve Dependencies** → Build dependency graph (e.g., agents may depend on inference)
+3. **Instantiate Providers** → Create provider instances with configs
+4. **Create Router/Routed Impls** → Set up request routing
+5. **Register Resources** → Register models, shields, datasets, etc.
+
+### The LlamaStack Class
+
+**File**: `llama_stack/core/stack.py`
+
+```python
+class LlamaStack(
+    Providers,      # Meta API for provider management
+    Inference,      # LLM inference
+    Agents,         # Agent orchestration
+    Safety,         # Content safety
+    VectorIO,       # Vector operations
+    Tools,          # Tool runtime
+    Eval,           # Evaluation
+    # ... 15 more APIs ...
+):
+    pass
+```
+
+This class **inherits from all APIs**, making a single `LlamaStack` instance support all functionality.
+
+### Two Client Modes
+
+#### 1. **Library Client** (In-Process)
+```python
+from llama_stack import AsyncLlamaStackAsLibraryClient
+
+client = await AsyncLlamaStackAsLibraryClient.create(run_config)
+response = await client.inference.post_chat_completion(...)
+```
+**File**: `llama_stack/core/library_client.py`
+
+#### 2. **Server Client** (HTTP)
+```python
+from llama_stack_client import AsyncLlamaStackClient
+
+client = AsyncLlamaStackClient(base_url="http://localhost:8321")
+response = await client.inference.post_chat_completion(...)
+```
+Uses the separate `llama-stack-client` package.
+
+---
+
+## 6. Request Routing
+
+### Two Routing Strategies
+
+#### 1. **Auto-Routed APIs** (e.g., Inference, Safety, VectorIO)
+- Routing key = provider instance
+- Router automatically selects provider based on resource ID
+- **Implementation**: `AutoRoutedProviderSpec` → `routers/` directory
+
+```python
+# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
+# Router selects provider based on which provider has that model
+```
+
+**Routed APIs**:
+- Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime
+
+#### 2. **Routing Table APIs** (e.g., Models, Shields, VectorStores)
+- Registry APIs that list/register resources
+- **Implementation**: `RoutingTableProviderSpec` → `routing_tables/` directory
+
+```python
+# models.list_models() → merged list from all providers
+# models.register_model(...) → router selects provider
+```
+
+**Registry APIs**:
+- Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups
+
+---
+
+## 7. Distributions
+
+### What is a Distribution?
+
+A **Distribution** is a pre-configured, verified bundle of providers for a specific deployment scenario.
+
+**File**: `llama_stack/distributions/template.py` (base) → specific distros in subdirectories
+
+### Example: Starter Distribution
+
+**File**: `llama_stack/distributions/starter/starter.py`
+
+```python
+def get_distribution_template(name: str = "starter"):
+    providers = {
+        "inference": [
+            remote::ollama,
+            remote::vllm,
+            remote::openai,
+            # ... others ...
+        ],
+        "vector_io": [
+            inline::faiss,
+            inline::sqlite-vec,
+            remote::qdrant,
+            # ... others ...
+        ],
+        "safety": [
+            inline::llama-guard,
+            inline::code-scanner,
+        ],
+        # ... other APIs ...
+    }
+    return DistributionTemplate(
+        name="starter",
+        providers=providers,
+        run_configs={
+            "run.yaml": RunConfigSettings(...)
+        }
+    )
+```
+
+### Built-in Distributions
+
+1. **starter**: CPU-only, multi-provider (Ollama, OpenAI, etc.)
+2. **starter-gpu**: GPU-optimized version
+3. **meta-reference-gpu**: Full Meta reference implementation
+4. **postgres-demo**: PostgreSQL-backed version
+5. **watsonx**: IBM Watson X integration
+6. **nvidia**: NVIDIA-specific optimizations
+7. **open-benchmark**: For benchmarking
+
+### Distribution Lifecycle
+
+```
+llama stack run starter
+  ↓
+Resolve starter distribution template
+  ↓
+Merge with run.yaml config & environment variables
+  ↓
+Build/install dependencies (if needed)
+  ↓
+Start HTTP server (Uvicorn)
+  ↓
+Initialize all providers
+  ↓
+Register resources (models, shields, etc.)
+  ↓
+Ready for requests
+```
+
+---
+
+## 8. CLI Architecture
+
+**File**: `llama_stack/cli/`
+
+### Entry Point
+
+```bash
+$ llama [subcommand] [args]
+```
+
+Maps to **pyproject.toml**:
+```toml
+[project.scripts]
+llama = "llama_stack.cli.llama:main"
+```
+
+### Subcommands
+
+```
+llama stack [command]
+  ├── run [distro|config] [--port PORT]     # Start a distribution
+  ├── list-deps [distro]                    # Show dependencies to install
+  ├── list-apis                             # Show all APIs
+  ├── list-providers                        # Show all providers
+  └── list [NAME]                           # Show distributions
+```
+
+**Architecture**:
+- `llama.py` - Main parser with subcommands
+- `stack/stack.py` - Stack subcommand router
+- `stack/run.py` - Implementation of `llama stack run`
+- `stack/list_deps.py` - Dependency resolution & display
+
+---
+
+## 9. Testing Architecture
+
+**Location**: `tests/` directory
+
+### Test Types
+
+#### 1. **Unit Tests** (`tests/unit/`)
+- Fast, isolated component testing
+- Mock external dependencies
+- **Run with**: `uv run --group unit pytest tests/unit/`
+- **Examples**:
+  - `core/test_stack_validation.py` - Config validation
+  - `distribution/test_distribution.py` - Distribution loading
+  - `core/routers/test_vector_io.py` - Routing logic
+
+#### 2. **Integration Tests** (`tests/integration/`)
+- End-to-end workflows
+- **Record-Replay pattern**: Record real API responses once, replay for fast/cheap testing
+- **Run with**: `uv run --group test pytest tests/integration/ --stack-config=starter`
+- **Structure**:
+  ```
+  tests/integration/
+  ├── agents/
+  │   ├── test_agents.py
+  │   ├── test_persistence.py
+  │   └── cassettes/  # Recorded API responses (YAML)
+  ├── inference/
+  ├── safety/
+  ├── vector_io/
+  └── [more...]
+  ```
+
+### Record-Replay System
+
+**File**: `llama_stack/testing/api_recorder.py`
+
+**Benefits**:
+- **Cost Control**: Record real API calls once, replay thousands of times
+- **Speed**: Cached responses = instant test execution
+- **Reliability**: Deterministic results (no API variability)
+- **Provider Coverage**: Same test works with OpenAI, Anthropic, Ollama, etc.
+
+**How it works**:
+1. First run (with `LLAMA_STACK_TEST_INFERENCE_MODE=record`): Real API calls saved to YAML
+2. Subsequent runs: Load YAML and return matching responses
+3. CI automatically re-records when needed
+
+### Test Organization
+
+- **Common utilities**: `tests/common/`
+- **External provider tests**: `tests/external/` (test external APIs)
+- **Container tests**: `tests/containers/` (test Docker integration)
+- **Conftest**: pytest fixtures in each directory
+
+---
+
+## 10. Key Design Patterns
+
+### Pattern 1: Protocol-Based Abstraction
+```python
+# API definition (protocol)
+class Inference(Protocol):
+    async def post_chat_completion(...) -> AsyncIterator[...]: ...
+
+# Provider implementation
+class InferenceProvider:
+    async def post_chat_completion(...): ...
+```
+
+### Pattern 2: Dependency Injection
+```python
+class AgentProvider:
+    def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
+        self.inference = inference
+        self.safety = safety
+```
+
+### Pattern 3: Configuration-Driven Instantiation
+```yaml
+# run.yaml
+agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      max_depth: 5
+```
+
+### Pattern 4: Routing by Resource
+```python
+# Request: inference.post_chat_completion(model="llama-2-7b")
+# Router finds which provider has "llama-2-7b" and routes there
+```
+
+### Pattern 5: Registry Pattern for Resources
+```python
+# Register at startup
+await models.register_model(Model(
+    identifier="llama-2-7b",
+    provider_id="inference::meta-reference",
+    ...
+))
+
+# Later, query or filter
+models_list = await models.list_models()
+```
+
+---
+
+## 11. Configuration Management
+
+### Config Files
+
+#### 1. **run.yaml** - Runtime Configuration
+Location: `~/.llama/distributions/{name}/run.yaml`
+
+```yaml
+version: 2
+providers:
+  inference:
+    - provider_id: ollama
+      provider_type: remote::ollama
+      config:
+        host: localhost
+        port: 11434
+  safety:
+    - provider_id: llama-guard
+      provider_type: inline::llama-guard
+      config: {}
+default_models:
+  - identifier: llama-2-7b
+    provider_id: ollama
+vector_stores_config:
+  default_provider_id: faiss
+```
+
+#### 2. **build.yaml** - Build Configuration
+Specifies which providers to install.
+
+#### 3. Environment Variables
+Override config values at runtime:
+```bash
+INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter
+```
+
+### Config Resolution
+
+**File**: `llama_stack/core/utils/config_resolution.py`
+
+Order of precedence:
+1. Environment variables (highest)
+2. Runtime config (run.yaml)
+3. Distribution template defaults (lowest)
+
+---
+
+## 12. Extension Points for Developers
+
+### Adding a Custom Provider
+
+1. **Create provider module**:
+   ```python
+   llama_stack/providers/remote/inference/my_provider/
+   ├── __init__.py
+   ├── config.py          # MyProviderConfig
+   └── my_provider.py     # MyProviderImpl(InferenceProvider)
+   ```
+
+2. **Register in registry**:
+   ```python
+   # llama_stack/providers/registry/inference.py
+   RemoteProviderSpec(
+       api=Api.inference,
+       adapter_type="my_provider",
+       provider_type="remote::my_provider",
+       config_class="...MyProviderConfig",
+       module="llama_stack.providers.remote.inference.my_provider",
+   )
+   ```
+
+3. **Use in distribution**:
+   ```yaml
+   providers:
+     inference:
+       - provider_id: my_provider
+         provider_type: remote::my_provider
+         config: {...}
+   ```
+
+### Adding a Custom API
+
+1. Define protocol in `llama_stack/apis/my_api/my_api.py`
+2. Implement providers
+3. Register in resolver and distributions
+4. Add CLI support if needed
+
+---
+
+## 13. Storage & Persistence
+
+### Storage Backends
+
+**File**: `llama_stack/core/storage/datatypes.py`
+
+#### KV Store (Key-Value)
+- Store metadata: models, shields, vector stores
+- Backends: SQLite (inline), Redis, Postgres
+
+#### SQL Store
+- Store structured data: conversations, datasets
+- Backends: SQLite (inline), Postgres
+
+#### Inference Store
+- Cache inference results for recording/replay
+- Used in testing
+
+### Storage Configuration
+
+```yaml
+storage:
+  type: sqlite
+  config:
+    dir: ~/.llama/distributions/starter
+```
+
+---
+
+## 14. Telemetry & Tracing
+
+### Tracing System
+
+**File**: `llama_stack/providers/utils/telemetry/`
+
+- Automatic request tracing with OpenTelemetry
+- Trace context propagation across async calls
+- Integration with OpenTelemetry collectors
+
+### Telemetry API
+
+Providers can implement the Telemetry API to collect metrics:
+- Token usage
+- Latency
+- Error rates
+- Custom metrics
+
+---
+
+## 15. Model System
+
+### Model Registry
+
+**File**: `llama_stack/models/llama/sku_list.py`
+
+```python
+resolve_model("meta-llama/Llama-2-7b") 
+  → Llama2Model(...)
+```
+
+Maps model IDs to their:
+- Architecture
+- Tokenizer
+- Quantization options
+- Required resources
+
+### Supported Models
+
+- **Llama 3** - Full architecture support
+- **Llama 3.1** - Extended context
+- **Llama 3.2** - Multimodal support
+- **Llama 4** - Latest generation
+- **Custom models** - Via provider registration
+
+### Model Quantization
+
+- int8, int4
+- GPTQ
+- Hadamard transform
+- Custom quantizers
+
+---
+
+## 16. Key Files to Understand
+
+### For Understanding Core Concepts
+1. `llama_stack/core/datatypes.py` - Configuration data types
+2. `llama_stack/providers/datatypes.py` - Provider specs
+3. `llama_stack/apis/inference/inference.py` - Example API
+
+### For Understanding Runtime
+1. `llama_stack/core/stack.py` - Main runtime class
+2. `llama_stack/core/resolver.py` - Dependency resolution
+3. `llama_stack/core/library_client.py` - In-process client
+
+### For Understanding Providers
+1. `llama_stack/providers/registry/inference.py` - Inference provider registry
+2. `llama_stack/providers/inline/inference/meta_reference/inference.py` - Example inline
+3. `llama_stack/providers/remote/inference/openai/openai.py` - Example remote
+
+### For Understanding Distributions
+1. `llama_stack/distributions/template.py` - Distribution template
+2. `llama_stack/distributions/starter/starter.py` - Starter distro
+3. `llama_stack/cli/stack/run.py` - Distribution startup
+
+---
+
+## 17. Development Workflow
+
+### Running Locally
+
+```bash
+# Install dependencies
+uv sync --all-groups
+
+# Run a distribution (auto-starts server)
+llama stack run starter
+
+# In another terminal, interact with it
+curl http://localhost:8321/health
+```
+
+### Testing
+
+```bash
+# Unit tests (fast, no external dependencies)
+uv run --group unit pytest tests/unit/
+
+# Integration tests (with record-replay)
+uv run --group test pytest tests/integration/ --stack-config=starter
+
+# Re-record integration tests (record real API calls)
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+  uv run --group test pytest tests/integration/ --stack-config=starter
+```
+
+### Building Distributions
+
+```bash
+# Build Starter distribution
+llama stack build starter --name my-starter
+
+# Run it
+llama stack run my-starter
+```
+
+---
+
+## 18. Notable Implementation Details
+
+### Async-First Architecture
+- All I/O is async (using `asyncio`)
+- Streaming responses with `AsyncIterator`
+- FastAPI for HTTP server (built on Starlette)
+
+### Streaming Support
+- Inference responses stream tokens
+- Agents stream turn-by-turn updates
+- Proper async context preservation
+
+### Error Handling
+- Structured errors with detailed messages
+- Graceful degradation when dependencies unavailable
+- Provider health checks
+
+### Extensibility
+- External providers via module import
+- Custom APIs via ExternalApiSpec
+- Plugin discovery via provider registry
+
+---
+
+## 19. Typical Request Flow
+
+```
+User Request (e.g., chat completion)
+  ↓
+CLI or SDK Client
+  ↓
+HTTP Request → FastAPI Server (port 8321)
+  ↓
+Route Handler (e.g., /inference/chat-completion)
+  ↓
+Router (Auto-Routed API)
+  → Determine which provider has the model
+  ↓
+Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
+  ↓
+External Service or Local Execution
+  ↓
+Response (streaming or complete)
+  ↓
+Send back to Client
+```
+
+---
+
+## 20. Key Takeaways
+
+1. **Unified APIs**: Single abstraction for 27+ AI capabilities
+2. **Pluggable Providers**: 50+ implementations (inline & remote)
+3. **Configuration-Driven**: Switch providers via YAML, not code
+4. **Distributions**: Pre-verified bundles for common scenarios
+5. **Record-Replay Testing**: Cost-effective integration tests
+6. **Two Client Modes**: Library (in-process) or HTTP (distributed)
+7. **Smart Routing**: Automatic request routing to appropriate providers
+8. **Async-First**: Native streaming and concurrent request handling
+9. **Extensible**: Custom APIs and providers easily added
+10. **Production-Ready**: Health checks, telemetry, access control, storage
+
+---
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                      Client Applications                      │
+│               (CLI, SDK, Web UI, Custom Apps)               │
+└────────────────────┬────────────────────────────────────────┘
+                     │
+         ┌───────────┴────────────┐
+         │                        │
+    ┌────▼────────┐      ┌───────▼──────┐
+    │   Library   │      │  HTTP Server │
+    │   Client    │      │  (FastAPI)   │
+    └────┬────────┘      └───────┬──────┘
+         │                       │
+         └───────────┬───────────┘
+                     │
+          ┌──────────▼──────────┐
+          │   LlamaStack Class  │
+          │  (implements all    │
+          │   27 APIs)          │
+          └──────────┬──────────┘
+                     │
+      ┌──────────────┼──────────────┐
+      │              │              │
+      │    Router    │   Routing    │  Resource
+      │  (Auto-      │   Tables     │  Registries
+      │   routed     │  (Models,    │  (Models,
+      │   APIs)      │   Shields)   │   Shields,
+      │              │              │   etc.)
+      └──────────────┼──────────────┘
+                     │
+        ┌────────────┴──────────────┐
+        │                           │
+   ┌────▼──────────┐    ┌──────────▼─────┐
+   │ Inline        │    │ Remote          │
+   │ Providers     │    │ Providers       │
+   │               │    │                 │
+   │ • Meta Ref    │    │ • OpenAI        │
+   │ • FAISS       │    │ • Ollama        │
+   │ • Llama Guard │    │ • Qdrant        │
+   │ • etc.        │    │ • etc.          │
+   │               │    │                 │
+   └───────────────┘    └─────────────────┘
+        │                       │
+        │                       │
+   Local Execution         External Services
+   (GPUs/CPUs)            (APIs/Servers)
+```
+
--- a/QUICK_REFERENCE.md
+++ b/QUICK_REFERENCE.md
@ -0,0 +1,222 @@
+# Llama Stack - Quick Reference Guide
+
+## Key Concepts at a Glance
+
+### The Three Pillars
+1. **APIs** (`llama_stack/apis/`) - Abstract interfaces (27 total)
+2. **Providers** (`llama_stack/providers/`) - Implementations (50+ total)
+3. **Distributions** (`llama_stack/distributions/`) - Pre-configured bundles
+
+### Directory Map for Quick Navigation
+
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| Inference API | `apis/inference/inference.py` | LLM chat, completion, embeddings |
+| Agents API | `apis/agents/agents.py` | Multi-turn agent orchestration |
+| Safety API | `apis/safety/safety.py` | Content filtering |
+| Vector IO API | `apis/vector_io/vector_io.py` | Vector database operations |
+| Core Stack | `core/stack.py` | Main orchestrator (implements all APIs) |
+| Provider Resolver | `core/resolver.py` | Dependency injection & instantiation |
+| Inline Inference | `providers/inline/inference/` | Local model execution |
+| Remote Inference | `providers/remote/inference/` | API providers (OpenAI, Ollama, etc.) |
+| CLI Entry Point | `cli/llama.py` | Command-line interface |
+| Starter Distribution | `distributions/starter/` | Basic multi-provider setup |
+
+## Common Tasks
+
+### Understanding an API
+1. Read the API definition: `llama_stack/apis/[api_name]/[api_name].py`
+2. Check common types: `llama_stack/apis/common/`
+3. Look at providers: `llama_stack/providers/registry/[api_name].py`
+4. Examine an implementation: `llama_stack/providers/inline/[api_name]/[provider]/`
+
+### Adding a Provider
+1. Create module: `llama_stack/providers/remote/[api]/[provider_name]/`
+2. Implement class extending the API protocol
+3. Register in: `llama_stack/providers/registry/[api].py`
+4. Add to distribution: `llama_stack/distributions/[distro]/[distro].py`
+
+### Debugging a Request
+1. Check routing: `llama_stack/core/routers/` or `routing_tables/`
+2. Find provider: `llama_stack/providers/registry/[api].py`
+3. Read implementation: `llama_stack/providers/[inline|remote]/[api]/[provider]/`
+4. Check config: Look for `Config` class in provider module
+
+### Running Tests
+```bash
+# Unit tests (fast)
+uv run --group unit pytest tests/unit/
+
+# Integration tests (with replay)
+uv run --group test pytest tests/integration/ --stack-config=starter
+
+# Re-record tests
+LLAMA_STACK_TEST_INFERENCE_MODE=record uv run --group test pytest tests/integration/
+```
+
+## Core Classes to Know
+
+### ProviderSpec Hierarchy
+```
+ProviderSpec (base)
+├── InlineProviderSpec (in-process)
+└── RemoteProviderSpec (external services)
+```
+
+### Key Runtime Classes
+- **LlamaStack** (`core/stack.py`) - Main class implementing all APIs
+- **StackRunConfig** (`core/datatypes.py`) - Configuration for a stack
+- **ProviderRegistry** (`core/resolver.py`) - Maps APIs to providers
+
+### Key Data Classes
+- **Provider** - Concrete provider instance with config
+- **Model** - Registered model (from a provider)
+- **OpenAIChatCompletion** - Response format (from Inference API)
+
+## Configuration Files
+
+### run.yaml Structure
+```yaml
+version: 2
+providers:
+  [api_name]:
+    - provider_id: unique_name
+      provider_type: inline::name or remote::name
+      config: {}  # Provider-specific config
+default_models:
+  - identifier: model_id
+    provider_id: inference_provider_id
+vector_stores_config:
+  default_provider_id: faiss_or_other
+```
+
+### Environment Variables
+Override any config value:
+```bash
+INFERENCE_MODEL=llama-2-7b llama stack run starter
+```
+
+## Common File Patterns
+
+### Inline Provider Structure
+```
+llama_stack/providers/inline/[api]/[provider]/
+├── __init__.py          # Exports adapter class
+├── config.py            # ConfigClass
+├── [provider].py        # AdapterImpl(ProtocolClass)
+└── [utils].py           # Helper modules
+```
+
+### Remote Provider Structure  
+```
+llama_stack/providers/remote/[api]/[provider]/
+├── __init__.py          # Exports adapter class
+├── config.py            # ConfigClass
+└── [provider].py        # AdapterImpl with HTTP calls
+```
+
+### API Structure
+```
+llama_stack/apis/[api]/
+├── __init__.py          # Exports main protocol
+├── [api].py             # Main protocol definition
+└── [supporting].py      # Types and supporting classes
+```
+
+## Key Design Patterns
+
+### Pattern 1: Auto-Routed APIs
+Provider selected automatically based on resource ID
+```python
+# Router finds which provider has this model
+await inference.post_chat_completion(model="llama-2-7b")
+```
+
+### Pattern 2: Routing Tables
+Registry APIs that list/register resources
+```python
+# Returns merged list from all providers
+await models.list_models()
+
+# Router selects provider internally
+await models.register_model(model)
+```
+
+### Pattern 3: Dependency Injection
+Providers depend on other APIs
+```python
+class AgentProvider:
+    def __init__(self, inference: InferenceProvider, ...):
+        self.inference = inference
+```
+
+## Important Numbers
+
+- **27 APIs** total in Llama Stack
+- **30+ Inference Providers** (OpenAI, Anthropic, Groq, local, etc.)
+- **10+ Vector IO Providers** (FAISS, Qdrant, ChromaDB, etc.)
+- **5+ Safety Providers** (Llama Guard, Bedrock, etc.)
+- **7 Built-in Distributions** (starter, starter-gpu, meta-reference-gpu, etc.)
+
+## Quick Commands
+
+```bash
+# List all APIs
+llama stack list-apis
+
+# List all providers
+llama stack list-providers [api_name]
+
+# List distributions
+llama stack list
+
+# Show dependencies for a distribution
+llama stack list-deps starter
+
+# Start a distribution on custom port
+llama stack run starter --port 8322
+
+# Interact with running server
+curl http://localhost:8321/health
+```
+
+## File Size Reference (to judge complexity)
+
+| File | Size | Complexity |
+|------|------|-----------|
+| inference.py (API) | 46KB | High (30+ parameters) |
+| stack.py (core) | 21KB | High (orchestration) |
+| resolver.py (core) | 19KB | High (dependency resolution) |
+| library_client.py (core) | 20KB | Medium (client implementation) |
+| template.py (distributions) | 18KB | Medium (config generation) |
+
+## Testing Quick Reference
+
+### Record-Replay Testing
+1. **Record**: `LLAMA_STACK_TEST_INFERENCE_MODE=record pytest ...`
+2. **Replay**: `pytest ...` (default, no network calls)
+3. **Location**: `tests/integration/[api]/cassettes/`
+4. **Format**: YAML files with request/response pairs
+
+### Test Structure
+- Unit tests: No external dependencies
+- Integration tests: Use actual providers (record-replay)
+- Common fixtures: `tests/unit/conftest.py`, `tests/integration/conftest.py`
+
+## Common Debugging Tips
+
+1. **Provider not loading?** → Check `llama_stack/providers/registry/[api].py`
+2. **Config validation error?** → Check provider's `Config` class
+3. **Import error?** → Verify `pip_packages` in ProviderSpec
+4. **Routing not working?** → Check `llama_stack/core/routers/` or `routing_tables/`
+5. **Test failing?** → Check cassettes in `tests/integration/[api]/cassettes/`
+
+## Most Important Files for Beginners
+
+1. `pyproject.toml` - Project metadata & entry points
+2. `llama_stack/core/stack.py` - Understand the main class
+3. `llama_stack/core/resolver.py` - Understand how providers are loaded
+4. `llama_stack/apis/inference/inference.py` - Understand an API
+5. `llama_stack/providers/registry/inference.py` - See all inference providers
+6. `llama_stack/distributions/starter/starter.py` - See how distributions work
+
--- a/docs/NOTION_API_UPGRADE_ANALYSIS.md
+++ b/docs/NOTION_API_UPGRADE_ANALYSIS.md
@ -0,0 +1,354 @@
+# Notion API Upgrade Analysis (v2025-09-03)
+
+## Executive Summary
+
+Notion released API version `2025-09-03` with **breaking changes** introducing first-class support for multi-source databases. This document analyzes the impact on the llama-stack project documentation system.
+
+**Current Status:** ✅ No immediate action required
+**Recommendation:** Monitor announcements, prepare migration plan, stay on v2022-06-28
+
+---
+
+## Current Configuration
+
+### API Version in Use
+```bash
+NOTION_VERSION="2022-06-28"
+NOTION_API_BASE="https://api.notion.com/v1"
+```
+
+### Databases
+- **Llama Stack Database:** `299a94d48e1080f5bf20ef9b61b66daf`
+- **Documentation Database:** `1fba94d48e1080709d4df69e9c0f0532`
+- **Troubleshooting Database:** `1fda94d48e10804d843ee491d647b204`
+
+### Primary Operations
+- Creating pages in databases (`POST /v1/pages`)
+- Publishing markdown documentation
+- Registry tracking
+
+---
+
+## Breaking Changes Overview
+
+### What Changed
+
+Notion introduced **multi-source databases** - allowing a single database to contain multiple linked data sources. This fundamentally changes how pages are created and queried.
+
+**Key Concept Change:**
+```
+OLD: One database = one data source (implicit)
+NEW: One database = multiple data sources (explicit)
+```
+
+### When Changes Take Effect
+
+**Immediately** upon upgrading to `2025-09-03`. No grace period or compatibility mode.
+
+---
+
+## Detailed Impact Analysis
+
+### 1. Page Creation (CRITICAL)
+
+**Current Method (v2022-06-28):**
+```json
+{
+  "parent": {
+    "database_id": "299a94d48e1080f5bf20ef9b61b66daf"
+  },
+  "properties": {...},
+  "children": [...]
+}
+```
+
+**New Method (v2025-09-03):**
+```json
+{
+  "parent": {
+    "data_source_id": "xxxx-xxxx-xxxx"  // Must fetch first!
+  },
+  "properties": {...},
+  "children": [...]
+}
+```
+
+**Migration Required:**
+1. Fetch data source IDs for each database
+2. Replace `database_id` with `data_source_id`
+3. Update all JSON templates
+4. Update upload scripts
+
+### 2. Database Queries
+
+**Current:**
+```bash
+POST /v1/databases/{database_id}/query
+```
+
+**New:**
+```bash
+POST /v1/data_sources/{data_source_id}/query
+```
+
+**Impact:** Query scripts need complete rewrite
+
+### 3. Data Source ID Discovery
+
+**New Required Step:**
+```bash
+# Must call before any operations
+curl -X GET "https://api.notion.com/v1/databases/299a94d48e1080f5bf20ef9b61b66daf" \
+  -H "Authorization: Bearer $NOTION_BEARER_TOKEN" \
+  -H "Notion-Version: 2025-09-03"
+
+# Response includes data_sources array
+{
+  "data_sources": [
+    {"id": "actual-id-to-use-for-operations", ...}
+  ]
+}
+```
+
+### 4. Search API Changes
+
+**Current:**
+```json
+{
+  "filter": {
+    "property": "object",
+    "value": "database"
+  }
+}
+```
+
+**New:**
+```json
+{
+  "filter": {
+    "property": "object",
+    "value": "data_source"  // Changed!
+  }
+}
+```
+
+### 5. Webhook Events
+
+**Event Name Changes:**
+```
+database.created     → data_source.created
+database.updated     → data_source.updated
+database.deleted     → data_source.deleted
+```
+
+---
+
+## Risk Assessment
+
+### Low Risk Factors ✅
+- We control when to upgrade (explicit version in API calls)
+- Backward compatibility maintained for old versions
+- Simple migration path (mostly find/replace)
+- Limited scope (documentation publishing only)
+
+### Medium Risk Factors ⚠️
+- **No deprecation timeline announced** - could become urgent without warning
+- **User-triggered failures** - if database owners add multi-source to our databases
+- **Multiple databases to migrate** - 3+ databases to update
+
+### High Risk Factors ❌
+- None currently identified
+
+---
+
+## Migration Requirements
+
+### Configuration Updates
+
+**Add to .env:**
+```bash
+# Current
+NOTION_VERSION="2022-06-28"
+
+# After migration
+NOTION_VERSION="2025-09-03"
+
+# New variables needed
+LLAMA_STACK_DATA_SOURCE_ID="[to-be-fetched]"
+DOCS_DATA_SOURCE_ID="[to-be-fetched]"
+TROUBLESHOOTING_DATA_SOURCE_ID="[to-be-fetched]"
+```
+
+### Script Updates
+
+**Files requiring changes:**
+1. `scripts/upload_notion_with_gpg.sh`
+2. `docs/knowledge/upload_notion_validated.sh`
+3. All JSON templates in `docs/knowledge/*-notion.json`
+
+**Required modifications:**
+- Add data source ID fetching logic
+- Replace `database_id` with `data_source_id` in all API calls
+- Update error handling for new response formats
+
+### Documentation Updates
+
+**Files to update:**
+1. `docs/knowledge/notion-publishing-workflow.md`
+2. `docs/knowledge/notion-collaboration-guide.md`
+3. `docs/knowledge/doc_registry.md`
+4. Project README sections on Notion integration
+
+---
+
+## Migration Plan
+
+### Phase 1: Discovery (When Ready)
+
+```bash
+#!/bin/bash
+# fetch_data_source_ids.sh
+
+DATABASES=(
+  "299a94d48e1080f5bf20ef9b61b66daf:LLAMA_STACK"
+  "1fba94d48e1080709d4df69e9c0f0532:DOCS"
+  "1fda94d48e10804d843ee491d647b204:TROUBLESHOOTING"
+)
+
+for db_info in "${DATABASES[@]}"; do
+  db_id="${db_info%%:*}"
+  db_name="${db_info##*:}"
+
+  echo "Fetching data source for $db_name..."
+  curl -X GET "https://api.notion.com/v1/databases/$db_id" \
+    -H "Authorization: Bearer $NOTION_BEARER_TOKEN" \
+    -H "Notion-Version: 2025-09-03" | \
+    jq -r ".data_sources[0].id" > "/tmp/${db_name}_data_source_id.txt"
+done
+```
+
+### Phase 2: Update Configuration
+
+```bash
+# Update .env with fetched IDs
+LLAMA_STACK_DATA_SOURCE_ID=$(cat /tmp/LLAMA_STACK_data_source_id.txt)
+DOCS_DATA_SOURCE_ID=$(cat /tmp/DOCS_data_source_id.txt)
+TROUBLESHOOTING_DATA_SOURCE_ID=$(cat /tmp/TROUBLESHOOTING_data_source_id.txt)
+```
+
+### Phase 3: Update Scripts
+
+```bash
+# Find all JSON templates
+find docs/knowledge -name "*-notion.json" -type f
+
+# Update database_id to data_source_id
+sed -i 's/"database_id":/"data_source_id":/g' docs/knowledge/*-notion.json
+
+# Update shell scripts
+# (Manual review and update required)
+```
+
+### Phase 4: Test & Validate
+
+1. Create test page in development database
+2. Verify page creation works
+3. Test query operations
+4. Validate search functionality
+5. Check webhook events (if used)
+
+### Phase 5: Production Migration
+
+1. Backup current .env configuration
+2. Apply all changes
+3. Test with single document
+4. Roll out to all operations
+5. Update documentation
+
+---
+
+## Timeline & Recommendations
+
+### Immediate (Now)
+✅ Document analysis (this document)
+✅ Monitor Notion changelog
+✅ Create migration scripts (not execute)
+
+### Short Term (3 Months)
+- Stay on v2022-06-28
+- No action required
+- Continue monitoring
+
+### Medium Term (When Announced)
+- Execute migration when deprecation announced
+- Or when multi-source features needed
+- Or after 6+ months of stability
+
+### Long Term
+- Periodic reviews of Notion API changes
+- Keep migration scripts updated
+- Document all configuration changes
+
+---
+
+## Decision Matrix
+
+### Stay on v2022-06-28 IF:
+✅ Current version works without issues
+✅ No deprecation timeline announced
+✅ No need for multi-source features
+✅ Prefer stability over new features
+
+### Upgrade to v2025-09-03 IF:
+- Deprecation announced for v2022-06-28
+- Need multi-source database features
+- Databases modified by owners (forced upgrade)
+- 6+ months have passed (stability proven)
+
+---
+
+## Monitoring Strategy
+
+### Quarterly Checks
+1. Review Notion developer changelog
+2. Check for deprecation announcements
+3. Test current integration still works
+4. Update migration scripts if needed
+
+### Triggers for Immediate Action
+🚨 Deprecation notice for v2022-06-28
+🚨 Database owners add multi-source
+🚨 Current version shows instability
+🚨 Critical security fixes in new version
+
+---
+
+## Resources
+
+### Official Documentation
+- Upgrade Guide: https://developers.notion.com/docs/upgrade-guide-2025-09-03
+- API Reference: https://developers.notion.com/reference
+- Changelog: https://developers.notion.com/changelog
+
+### Internal Documentation
+- `docs/knowledge/notion-publishing-workflow.md`
+- `docs/knowledge/notion-collaboration-guide.md`
+- `docs/knowledge/doc_registry.md`
+
+---
+
+## Conclusion
+
+**Current Recommendation:** **Do NOT upgrade yet**
+
+**Rationale:**
+- No immediate benefit
+- No deprecation pressure
+- Current system stable
+- Migration effort not justified
+
+**Next Review:** 3 months from now (or when Notion announces deprecation)
+
+**Prepared By:** Claude Code
+**Date:** October 2025
+**Version:** 1.0
--- a/docs/UI_BUG_FIXES.md
+++ b/docs/UI_BUG_FIXES.md
@ -0,0 +1,228 @@
+# Llama Stack UI Bug Fixes
+
+This document details two critical UI bugs identified and fixed in the Llama Stack chat playground interface.
+
+## Bug Fix 1: Agent Instructions Overflow
+
+### Problem Description
+
+The Agent Instructions field in the chat playground settings panel was overflowing its container, causing text to overlap with the "Agent Tools" section below. This occurred when agent instructions exceeded the fixed height container (96px / h-24).
+
+**Symptoms:**
+- Long instruction text overflowed beyond the container boundaries
+- Text overlapped with "Agent Tools" section
+- No scrolling mechanism available
+- Poor user experience when viewing lengthy instructions
+
+**Location:** `llama_stack/ui/app/chat-playground/page.tsx:1467`
+
+### Root Cause
+
+The Agent Instructions display div had:
+- Fixed height (`h-24` = 96px)
+- No overflow handling
+- Text wrapping enabled but no scroll capability
+
+```tsx
+// BEFORE (Broken)
+<div className="w-full h-24 px-3 py-2 text-sm border border-input rounded-md bg-muted text-muted-foreground">
+  {instructions}
+</div>
+```
+
+### Solution
+
+Added `overflow-y-auto` to enable vertical scrolling when content exceeds the fixed height.
+
+```tsx
+// AFTER (Fixed)
+<div className="w-full h-24 px-3 py-2 text-sm border border-input rounded-md bg-muted text-muted-foreground overflow-y-auto">
+  {instructions}
+</div>
+```
+
+**Changes:**
+- File: `llama_stack/ui/app/chat-playground/page.tsx`
+- Line: 1467
+- Change: Added `overflow-y-auto` to className
+
+### Benefits
+
+- Text wraps naturally within container
+- Scrollbar appears automatically when needed
+- No overlap with sections below
+- Maintains read-only design intent
+- Improved user experience
+
+---
+
+## Bug Fix 2: Duplicate Content in Chat Responses
+
+### Problem Description
+
+Chat assistant responses were appearing twice within a single message bubble. The content would stream in correctly, then duplicate itself at the end of the response, resulting in confusing and unprofessional output.
+
+**Symptoms:**
+- Content appeared once during streaming
+- Same content duplicated after stream completion
+- Duplication occurred within single message bubble
+- Affected all assistant responses during streaming
+
+**Example from logs:**
+```
+Response:
+  <think>reasoning</think>
+  Answer content here
+  <think>reasoning</think>  ← DUPLICATE
+  Answer content here        ← DUPLICATE
+```
+
+**Location:** `llama_stack/ui/app/chat-playground/page.tsx:790-1094`
+
+### Root Cause Analysis
+
+The streaming API sends two types of events:
+
+1. **Delta chunks** (incremental):
+   - "Hello"
+   - " world"
+   - "!"
+   - Accumulated: `fullContent = "Hello world!"`
+
+2. **turn_complete event** (final):
+   - Contains the **complete accumulated content**
+   - Sent after streaming finishes
+
+**The Bug:** The `processChunk` function was extracting text from both:
+- Streaming deltas (lines 790-1025) ✅
+- `turn_complete` event's `turn.output_message.content` (lines 930-942) ❌
+
+This caused the accumulated content to be **appended again** to `fullContent`, resulting in duplication.
+
+### Solution
+
+Added an early return in `processChunk` to skip `turn_complete` events entirely, since we already have the complete content from streaming deltas.
+
+```tsx
+// AFTER (Fixed) - Added at line 795
+const processChunk = (
+  chunk: unknown
+): { text: string | null; isToolCall: boolean } => {
+  const chunkObj = chunk as Record<string, unknown>;
+
+  // Skip turn_complete events to avoid duplicate content
+  // These events contain the full accumulated content which we already have from streaming deltas
+  if (
+    chunkObj?.event &&
+    typeof chunkObj.event === "object" &&
+    chunkObj.event !== null
+  ) {
+    const event = chunkObj.event as Record<string, unknown>;
+    if (
+      event?.payload &&
+      typeof event.payload === "object" &&
+      event.payload !== null
+    ) {
+      const payload = event.payload as Record<string, unknown>;
+      if (payload.event_type === "turn_complete") {
+        return { text: null, isToolCall: false };
+      }
+    }
+  }
+
+  // ... rest of function continues
+}
+```
+
+**Changes:**
+- File: `llama_stack/ui/app/chat-playground/page.tsx`
+- Lines: 795-813 (new code block)
+- Change: Added early return check for `turn_complete` events
+
+### Validation
+
+Tested with actual log file from `/home/asallas/workarea/logs/applications/llama-stack/llm_req_res.log` which showed:
+- Original response (lines 5-62)
+- Duplicate content (lines 63-109)
+
+After fix:
+- Only original response appears once
+- No duplication
+- All content types work correctly (text, code blocks, thinking blocks)
+
+### Benefits
+
+- Clean, professional responses
+- No confusing duplicate content
+- Maintains all functionality (tool calls, RAG, etc.)
+- Improved user experience
+- Validates streaming architecture understanding
+
+---
+
+## Testing Performed
+
+### Agent Instructions Overflow
+✅ Tested with short instructions (no scrollbar needed)
+✅ Tested with long instructions (scrollbar appears)
+✅ Verified no overlap with sections below
+✅ Confirmed read-only behavior maintained
+
+### Duplicate Content Fix
+✅ Tested with simple text responses
+✅ Tested with multi-paragraph responses
+✅ Tested with code blocks
+✅ Tested with thinking blocks (`<think>`)
+✅ Tested with tool calls
+✅ Tested with RAG queries
+✅ Validated with production log files
+
+---
+
+## Files Modified
+
+1. `llama_stack/ui/app/chat-playground/page.tsx`
+   - Line 1467: Added `overflow-y-auto` for Agent Instructions
+   - Lines 795-813: Added `turn_complete` event filtering
+
+---
+
+## Related Documentation
+
+- Chat Playground Architecture: `llama_stack/ui/app/chat-playground/`
+- Message Components: `llama_stack/ui/components/chat-playground/`
+- API Integration: Llama Stack Agents API
+
+---
+
+## Future Considerations
+
+### Agent Instructions
+- Consider making instructions editable after creation (requires API change)
+- Add copy-to-clipboard button for long instructions
+- Implement instruction templates
+
+### Streaming Architecture
+- Monitor for other event types that might cause similar issues
+- Add debug mode to log event types during streaming
+- Consider telemetry for streaming errors
+
+---
+
+## Impact
+
+**User Experience:**
+- ✅ Professional, clean chat interface
+- ✅ No confusing duplicate content
+- ✅ Better handling of long agent instructions
+- ✅ Improved reliability
+
+**Code Quality:**
+- ✅ Better understanding of streaming event flow
+- ✅ More robust event handling
+- ✅ Clear separation of delta vs final events
+
+**Maintenance:**
+- ✅ Well-documented fixes
+- ✅ Clear root cause understanding
+- ✅ Testable and verifiable
--- a/docs/knowledge/doc_registry.md
+++ b/docs/knowledge/doc_registry.md
@ -0,0 +1,25 @@
+# Llama Stack Documentation Registry
+
+This registry tracks all Llama Stack documentation pages uploaded to Notion.
+
+## Registry Format
+
+| Page Title | Page ID | Tags | Created By | URL |
+|-----------|---------|------|------------|-----|
+| [Llama Stack Architecture Index - Part 1](https://www.notion.so/Llama-Stack-Architecture-Index-Part-2-299a94d48e1081d79946c5c538f4623e) | `299a94d48e1081d79946c5c538f4623e` | Architecture, Onboarding | Claude Code | https://www.notion.so/Llama-Stack-Architecture-Index-Part-2-299a94d48e1081d79946c5c538f4623e |
+| [Llama Stack Architecture Index - Part 2](https://www.notion.so/Llama-Stack-Architecture-Index-Part-2-299a94d48e10815883b6e0e34350f3db) | `299a94d48e10815883b6e0e34350f3db` | Architecture, Onboarding | Claude Code | https://www.notion.so/Llama-Stack-Architecture-Index-Part-2-299a94d48e10815883b6e0e34350f3db |
+| [Llama Stack UI Bug Fixes - Part 1](https://www.notion.so/Llama-Stack-UI-Bug-Fixes-Part-1-299a94d48e10812e883cfb902a50b106) | `299a94d48e10812e883cfb902a50b106` | Bug Fixes, UI, Development | Claude Code | https://www.notion.so/Llama-Stack-UI-Bug-Fixes-Part-1-299a94d48e10812e883cfb902a50b106 |
+| [Llama Stack UI Bug Fixes - Part 2](https://www.notion.so/Llama-Stack-UI-Bug-Fixes-Part-2-299a94d48e108104b479fe9012161d3e) | `299a94d48e108104b479fe9012161d3e` | Bug Fixes, UI, Development | Claude Code | https://www.notion.so/Llama-Stack-UI-Bug-Fixes-Part-2-299a94d48e108104b479fe9012161d3e |
+| [Llama Stack - Notion API Upgrade Analysis - Part 1](https://www.notion.so/Llama-Stack-Notion-API-Upgrade-Analysis-Part-1-299a94d48e108157a133cd8bacd78bfc) | `299a94d48e108157a133cd8bacd78bfc` | Notion API, Analysis, Documentation | Claude Code | https://www.notion.so/Llama-Stack-Notion-API-Upgrade-Analysis-Part-1-299a94d48e108157a133cd8bacd78bfc |
+| [Llama Stack - Notion API Upgrade Analysis - Part 2](https://www.notion.so/Llama-Stack-Notion-API-Upgrade-Analysis-Part-2-299a94d48e108146b552d50b34302a01) | `299a94d48e108146b552d50b34302a01` | Notion API, Analysis, Documentation | Claude Code | https://www.notion.so/Llama-Stack-Notion-API-Upgrade-Analysis-Part-2-299a94d48e108146b552d50b34302a01 |
+
+## Summary
+
+- **Total Pages**: 6
+- **Architecture Documentation**: 2 pages
+- **Bug Fixes Documentation**: 2 pages
+- **Analysis Documentation**: 2 pages
+
+## Database ID
+
+All pages are stored in Notion database: `299a94d48e1080f5bf20ef9b61b66daf`
--- a/llama_stack/ui/app/chat-playground/page.tsx
+++ b/llama_stack/ui/app/chat-playground/page.tsx
@ -792,6 +792,26 @@ export default function ChatPlaygroundPage() {
      ): { text: string | null; isToolCall: boolean } => {
        const chunkObj = chunk as Record<string, unknown>;

+        // Skip turn_complete events to avoid duplicate content
+        // These events contain the full accumulated content which we already have from streaming deltas
+        if (
+          chunkObj?.event &&
+          typeof chunkObj.event === "object" &&
+          chunkObj.event !== null
+        ) {
+          const event = chunkObj.event as Record<string, unknown>;
+          if (
+            event?.payload &&
+            typeof event.payload === "object" &&
+            event.payload !== null
+          ) {
+            const payload = event.payload as Record<string, unknown>;
+            if (payload.event_type === "turn_complete") {
+              return { text: null, isToolCall: false };
+            }
+          }
+        }
+
        // helper to check if content contains function call JSON
        const containsToolCall = (content: string): boolean => {
          return (
@ -1464,7 +1484,7 @@ export default function ChatPlaygroundPage() {
                <label className="text-sm font-medium block mb-2">
                  Agent Instructions
                </label>
-                <div className="w-full h-24 px-3 py-2 text-sm border border-input rounded-md bg-muted text-muted-foreground">
+                <div className="w-full h-24 px-3 py-2 text-sm border border-input rounded-md bg-muted text-muted-foreground overflow-y-auto">
                  {(selectedAgentId &&
                    agents.find(a => a.agent_id === selectedAgentId)
                      ?.agent_config?.instructions) ||