fix: UI bug fixes and comprehensive architecture documentation

- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
2025-12-11 19:56:03 +00:00 · 2025-10-27 13:34:02 +08:00 · 2025-10-27 13:34:02 +08:00 · 5ef6ccf90e
commit 5ef6ccf90e
parent 3059423cd7
8 changed files with 2603 additions and 1 deletions
--- a/ARCHITECTURE_INSIGHTS.md
+++ b/ARCHITECTURE_INSIGHTS.md
@ -0,0 +1,620 @@
+# Llama Stack - Architecture Insights for Developers
+
+## Why This Architecture Works
+
+### Problem It Solves
+Without Llama Stack, building AI applications requires:
+- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
+- Rewriting code to switch providers
+- Duplicating logic for common patterns (safety checks, vector search, etc.)
+- Managing complex dependencies manually
+
+### Solution: The Three Pillars
+```
+Single, Unified API Interface
+        ↓
+Multiple Provider Implementations
+        ↓
+Pre-configured Distributions
+```
+
+**Result**: Write once, run anywhere (locally, cloud, on-device)
+
+---
+
+## The Genius of the Plugin Architecture
+
+### How It Works
+1. **Define Abstract Interface** (Protocol in `apis/`)
+   ```python
+   class Inference(Protocol):
+       async def post_chat_completion(...) -> AsyncIterator[...]: ...
+   ```
+
+2. **Multiple Implementations** (in `providers/`)
+   - Local: Meta Reference, vLLM, Ollama
+   - Cloud: OpenAI, Anthropic, Groq, Bedrock
+   - Each implements same interface
+
+3. **Runtime Selection** (via YAML config)
+   ```yaml
+   providers:
+     inference:
+       - provider_type: remote::openai
+   ```
+
+4. **Zero Code Changes** to switch providers!
+
+### Why This Beats Individual SDKs
+- **Single SDK** vs 30+ provider SDKs
+- **Same API** vs learning each provider's quirks
+- **Easy migration** - change 1 config value
+- **Testing** - same tests work across all providers
+
+---
+
+## The Request Routing Intelligence
+
+### Two Clever Routing Strategies
+
+#### 1. Auto-Routed APIs (Smart Dispatch)
+**APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
+
+When you call:
+```python
+await inference.post_chat_completion(model="llama-2-7b")
+```
+
+Router automatically determines:
+- "Which provider has llama-2-7b?"
+- "Route this request there"
+- "Stream response back"
+
+**Implementation**: `routers/` directory contains auto-routers
+
+#### 2. Routing Table APIs (Registry Pattern)
+**APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
+
+When you call:
+```python
+models = await models.list_models()  # Merged list from ALL providers
+```
+
+Router:
+- Queries each provider
+- Merges results
+- Returns unified list
+
+**Implementation**: `routing_tables/` directory
+
+### Why This Matters
+- **Users don't think about providers** - just use the API
+- **Multiple implementations work** - router handles dispatch
+- **Easy scaling** - add new providers without touching user code
+- **Resource management** - router knows what's available
+
+---
+
+## Configuration as a Weapon
+
+### The Power of YAML Over Code
+Traditional approach:
+```python
+# Code changes needed for each provider!
+if use_openai:
+    from openai import OpenAI
+    client = OpenAI(api_key=...)
+elif use_ollama:
+    from ollama import Client
+    client = Client(url=...)
+# etc.
+```
+
+Llama Stack approach:
+```yaml
+# Zero code changes!
+providers:
+  inference:
+    - provider_type: remote::openai
+      config:
+        api_key: ${env.OPENAI_API_KEY}
+```
+
+Then later, change to:
+```yaml
+providers:
+  inference:
+    - provider_type: remote::ollama
+      config:
+        host: localhost
+```
+
+**Same application code** works with both!
+
+### Environment Variable Magic
+```bash
+# Change provider at runtime
+INFERENCE_MODEL=llama-2-70b llama stack run starter
+
+# No redeployment needed!
+```
+
+---
+
+## The Distributions Strategy
+
+### Problem: "Works on My Machine"
+- Different developers need different setups
+- Production needs different providers than development
+- CI/CD needs lightweight dependencies
+
+### Solution: Pre-verified Distributions
+```
+starter → Works on CPU with free APIs (Ollama + OpenAI)
+starter-gpu → Works on GPU machines
+meta-reference-gpu → Works with full local setup
+postgres-demo → Production-grade with persistent storage
+```
+
+Each distribution:
+- Pre-selects working providers
+- Sets sensible defaults
+- Bundles required dependencies
+- Tested end-to-end
+
+**Result**: `llama stack run starter` just works for 80% of use cases
+
+### Why This Beats Documentation
+- **No setup guides needed** - distribution does it
+- **No guessing** - curated, tested combinations
+- **Reproducible** - same distro always works same way
+- **Upgradeable** - update distro = get improvements
+
+---
+
+## The Testing Genius: Record-Replay
+
+### Traditional Testing Hell for AI
+Problem:
+- API calls cost money
+- API responses are non-deterministic
+- Each provider has different response formats
+- Tests become slow and flaky
+
+### The Record-Replay Solution
+
+First run (record):
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
+# Makes real API calls, saves responses to YAML
+```
+
+All subsequent runs (replay):
+```bash
+pytest tests/integration/
+# Returns cached responses, NO API calls, instant results
+```
+
+### Why This is Brilliant
+- **Cost**: Record once, replay 1000x. Save thousands of dollars
+- **Speed**: Cached responses = instant test execution
+- **Reliability**: Deterministic results (no API variability)
+- **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc.
+
+**File location**: `tests/integration/[api]/cassettes/`
+
+---
+
+## Core Runtime: The Stack Class
+
+### The Elegance of Inheritance
+```python
+class LlamaStack(
+    Inference,        # Chat completion, embeddings
+    Agents,           # Multi-turn orchestration
+    Safety,           # Content filtering
+    VectorIO,         # Vector operations
+    Tools,            # Function execution
+    Eval,             # Evaluation
+    Scoring,          # Response scoring
+    Models,           # Model registry
+    # ... 19 more APIs
+):
+    pass
+```
+
+A single `LlamaStack` instance:
+- Implements 27 different APIs
+- Has 50+ providers backing it
+- Routes requests intelligently
+- Manages dependencies
+
+All from a ~400 line file + lots of protocol definitions!
+
+---
+
+## Dependency Injection Without the Complexity
+
+### How Providers Depend on Each Other
+Problem: Agents need Inference, Inference needs Models registry
+```python
+class AgentProvider:
+    def __init__(self, 
+                 inference: InferenceProvider,
+                 safety: SafetyProvider,
+                 tool_runtime: ToolRuntimeProvider):
+        self.inference = inference
+        self.safety = safety
+        self.tool_runtime = tool_runtime
+```
+
+### How It Gets Resolved
+**File**: `core/resolver.py`
+
+1. Parse `run.yaml` - which providers enabled?
+2. Build dependency graph - who depends on whom?
+3. Topological sort - what order to instantiate?
+4. Instantiate in order - each gets its dependencies
+
+**Result**: Complex dependency chains handled automatically!
+
+---
+
+## The Client Duality
+
+### Two Ways to Use Llama Stack
+
+#### 1. Library Mode (In-Process)
+```python
+from llama_stack import AsyncLlamaStackAsLibraryClient
+
+client = await AsyncLlamaStackAsLibraryClient.create(run_config)
+response = await client.inference.post_chat_completion(...)
+```
+- No HTTP overhead
+- Direct Python API
+- Embedded in application
+- **File**: `core/library_client.py`
+
+#### 2. Server Mode (HTTP)
+```bash
+llama stack run starter  # Start server on port 8321
+```
+
+```python
+from llama_stack_client import AsyncLlamaStackClient
+
+client = AsyncLlamaStackClient(base_url="http://localhost:8321")
+response = await client.inference.post_chat_completion(...)
+```
+- Distributed architecture
+- Share single server across apps
+- Easy deployment
+- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
+
+**Result**: Same API, different deployment strategies!
+
+---
+
+## The Model System Insight
+
+### Why It Exists
+Problem: Different model IDs across providers
+- HuggingFace: `meta-llama/Llama-2-7b`
+- Ollama: `llama2`
+- OpenAI: `gpt-4`
+
+### Solution: Universal Model Registry
+**File**: `models/llama/sku_list.py`
+
+```python
+resolve_model("meta-llama/Llama-2-7b")
+# Returns Model object with:
+# - Architecture info
+# - Tokenizer
+# - Quantization options
+# - Resource requirements
+```
+
+Allows:
+- Consistent model IDs across providers
+- Intelligent resource allocation
+- Provider-agnostic inference
+
+---
+
+## The CLI Is Smart
+
+### It Does More Than You Think
+```bash
+llama stack run starter
+```
+
+This command:
+1. Resolves the starter distribution template
+2. Merges with environment variables
+3. Creates/updates `~/.llama/distributions/starter/run.yaml`
+4. Installs missing dependencies
+5. Starts HTTP server on port 8321
+6. Initializes all providers
+7. Registers available models
+8. Ready for requests
+
+**No separate build step needed!** (unless building Docker images)
+
+### Introspection Commands
+```bash
+llama stack list-apis           # See all 27 APIs
+llama stack list-providers      # See all 50+ providers
+llama stack list                # See all distributions
+llama stack list-deps starter   # See what to install
+```
+
+Used for documentation, debugging, and automation
+
+---
+
+## Storage: The Oft-Overlooked Component
+
+### Three Storage Types
+1. **KV Store** - Metadata (models, shields)
+2. **SQL Store** - Structured (conversations, datasets)  
+3. **Inference Store** - Caching (for testing)
+
+### Why Multiple Backends Matter
+- Development: SQLite (no dependencies)
+- Production: PostgreSQL (scalable)
+- Distributed: Redis (shared state)
+- Testing: In-memory (fast)
+
+**Files**: 
+- `core/storage/datatypes.py` - Interfaces
+- `providers/utils/kvstore/` - Implementations
+- `providers/utils/sqlstore/` - Implementations
+
+---
+
+## Telemetry: Built-In Observability
+
+### What Gets Traced
+- Every API call
+- Token usage (if provider supports it)
+- Latency
+- Errors
+- Custom metrics from providers
+
+### Integration
+- OpenTelemetry compatible
+- Automatic context propagation
+- Works across async boundaries
+- **File**: `providers/utils/telemetry/`
+
+---
+
+## Extension Strategy: How to Add Custom Functionality
+
+### Adding a Custom API
+1. Create protocol in `apis/my_api/my_api.py`
+2. Implement providers (inline and/or remote)
+3. Register in `core/resolver.py`
+4. Add to distributions
+
+### Adding a Custom Provider
+1. Create module in `providers/[inline|remote]/[api]/[provider]/`
+2. Implement config and adapter classes
+3. Register in `providers/registry/[api].py`
+4. Use in distribution YAML
+
+### Adding a Custom Distribution
+1. Create subdirectory in `distributions/[name]/`
+2. Implement template in `[name].py`
+3. Register in distribution discovery
+
+---
+
+## Common Misconceptions Clarified
+
+### "APIs are HTTP endpoints"
+**Wrong** - APIs are Python protocols. HTTP comes later via FastAPI.
+- The "Inference" API is just a Python Protocol
+- Providers implement it
+- Core wraps it with HTTP for server mode
+- Library mode uses it directly
+
+### "Providers are all external services"
+**Wrong** - Providers can be:
+- Inline (local execution): Meta Reference, FAISS, Llama Guard
+- Remote (external services): OpenAI, Ollama, Qdrant
+
+Inline providers have low latency and no dependency on external services.
+
+### "You must run a server"
+**Wrong** - Two modes:
+- Server mode: `llama stack run starter` (HTTP)
+- Library mode: Import and use directly in Python
+
+### "Distributions are just Docker images"
+**Wrong** - Distributions are:
+- Templates (what providers to use)
+- Configs (how to configure them)
+- Dependencies (what to install)
+- Can be Docker OR local Python
+
+---
+
+## Performance Implications
+
+### Inline Providers Are Fast
+```
+Inline (e.g., Meta Reference)
+├─ 0ms network latency
+├─ No HTTP serialization/deserialization
+├─ Direct GPU access
+└─ Fast (but high resource cost)
+
+Remote (e.g., OpenAI)
+├─ 100-500ms network latency
+├─ HTTP serialization overhead
+├─ Low resource cost
+└─ Slower (but cheap)
+```
+
+### Streaming Is Native
+```python
+response = await inference.post_chat_completion(model=..., stream=True)
+async for chunk in response:
+    print(chunk.delta)  # Process token by token
+```
+
+Tokens arrive as they're generated, no waiting for full response.
+
+---
+
+## Security Considerations
+
+### API Keys Are Config
+```yaml
+inference:
+  - provider_id: openai
+    config:
+      api_key: ${env.OPENAI_API_KEY}  # From environment
+```
+
+Never hardcoded, always from env vars.
+
+### Access Control
+**File**: `core/access_control/`
+
+Providers can implement access rules:
+- Per-user restrictions
+- Per-model restrictions
+- Rate limiting
+- Audit logging
+
+### Sensitive Field Redaction
+Config logging automatically redacts:
+- API keys
+- Passwords
+- Tokens
+
+---
+
+## Maturity Indicators
+
+### Signs of Production-Ready Design
+1. **Separated Concerns** - APIs, Providers, Distributions
+2. **Plugin Architecture** - Easy to extend
+3. **Configuration Over Code** - Deploy without recompiling
+4. **Comprehensive Testing** - Unit + Integration with record-replay
+5. **Multiple Client Options** - Library + Server modes
+6. **Storage Abstraction** - Multiple backends
+7. **Dependency Management** - Automatic resolution
+8. **Error Handling** - Structured, informative errors
+9. **Observability** - Built-in telemetry
+10. **Documentation** - Distributions + CLI introspection
+
+Llama Stack has all 10!
+
+---
+
+## Key Architectural Decisions
+
+### Why Async/Await Throughout?
+- Modern Python standard
+- Works well with streaming
+- Natural for I/O-heavy operations (API calls, GPU operations)
+
+### Why Pydantic for Config?
+- Type validation
+- Auto-documentation
+- JSON schema generation
+- Easy serialization
+
+### Why Protocol Classes for APIs?
+- Define interface without implementation
+- Multiple implementations possible
+- Type hints work with duck typing
+- Minimal magic
+
+### Why YAML for Config?
+- Human readable
+- Environment variable support
+- Comments allowed
+- Wide tool support
+
+### Why Record-Replay for Tests?
+- Cost efficient
+- Deterministic
+- Real behavior captured
+- Provider-agnostic
+
+---
+
+## The Learning Path for Contributors
+
+### Understanding Order
+1. **Start**: `pyproject.toml` - Entry point
+2. **Learn**: `core/datatypes.py` - Data structures
+3. **Understand**: `apis/inference/inference.py` - Example API
+4. **See**: `providers/registry/inference.py` - Provider registry
+5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider
+6. **Read**: `providers/remote/inference/openai/` - Remote provider
+7. **Study**: `core/resolver.py` - How it all connects
+8. **Understand**: `core/stack.py` - Main orchestrator
+9. **See**: `distributions/starter/` - How to use it
+10. **Run**: `tests/integration/` - How to test
+
+Each step builds on previous understanding.
+
+---
+
+## The Elegant Parts
+
+### Most Elegant: The Router
+The router system is beautiful:
+- Transparent to users
+- Automatic provider selection
+- Works with 1 or 100 providers
+- No hardcoding needed
+
+### Most Flexible: YAML Config
+Configuration as first-class citizen:
+- Switch providers without code
+- Override at runtime
+- Version control friendly
+- Documentation via config
+
+### Most Useful: Record-Replay Tests
+Testing pattern solves real problems:
+- Cost
+- Speed
+- Reliability
+- Coverage
+
+### Most Scalable: Distribution Templates
+Pre-configured bundles:
+- One command to start
+- Verified combinations
+- Easy to document
+- Simple to teach
+
+---
+
+## The Future
+
+### What's Being Built
+- More providers (Nvidia, SambaNova, etc.)
+- More APIs (more task types)
+- On-device execution (ExecuTorch)
+- Better observability (more telemetry)
+- Easier extensions (simpler API for custom providers)
+
+### How It Stays Maintainable
+- Protocol-based design limits coupling
+- Clear separation of concerns
+- Comprehensive testing
+- Configuration over code
+- Plugin architecture
+
+The architecture is **future-proof** by design.
+