mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-11 19:56:03 +00:00
fix: UI bug fixes and comprehensive architecture documentation
- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
This commit is contained in:
parent
3059423cd7
commit
5ef6ccf90e
8 changed files with 2603 additions and 1 deletions
620
ARCHITECTURE_INSIGHTS.md
Normal file
620
ARCHITECTURE_INSIGHTS.md
Normal file
|
|
@ -0,0 +1,620 @@
|
|||
# Llama Stack - Architecture Insights for Developers
|
||||
|
||||
## Why This Architecture Works
|
||||
|
||||
### Problem It Solves
|
||||
Without Llama Stack, building AI applications requires:
|
||||
- Learning different APIs for each provider (OpenAI, Anthropic, Groq, Ollama, etc.)
|
||||
- Rewriting code to switch providers
|
||||
- Duplicating logic for common patterns (safety checks, vector search, etc.)
|
||||
- Managing complex dependencies manually
|
||||
|
||||
### Solution: The Three Pillars
|
||||
```
|
||||
Single, Unified API Interface
|
||||
↓
|
||||
Multiple Provider Implementations
|
||||
↓
|
||||
Pre-configured Distributions
|
||||
```
|
||||
|
||||
**Result**: Write once, run anywhere (locally, cloud, on-device)
|
||||
|
||||
---
|
||||
|
||||
## The Genius of the Plugin Architecture
|
||||
|
||||
### How It Works
|
||||
1. **Define Abstract Interface** (Protocol in `apis/`)
|
||||
```python
|
||||
class Inference(Protocol):
|
||||
async def post_chat_completion(...) -> AsyncIterator[...]: ...
|
||||
```
|
||||
|
||||
2. **Multiple Implementations** (in `providers/`)
|
||||
- Local: Meta Reference, vLLM, Ollama
|
||||
- Cloud: OpenAI, Anthropic, Groq, Bedrock
|
||||
- Each implements same interface
|
||||
|
||||
3. **Runtime Selection** (via YAML config)
|
||||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
- provider_type: remote::openai
|
||||
```
|
||||
|
||||
4. **Zero Code Changes** to switch providers!
|
||||
|
||||
### Why This Beats Individual SDKs
|
||||
- **Single SDK** vs 30+ provider SDKs
|
||||
- **Same API** vs learning each provider's quirks
|
||||
- **Easy migration** - change 1 config value
|
||||
- **Testing** - same tests work across all providers
|
||||
|
||||
---
|
||||
|
||||
## The Request Routing Intelligence
|
||||
|
||||
### Two Clever Routing Strategies
|
||||
|
||||
#### 1. Auto-Routed APIs (Smart Dispatch)
|
||||
**APIs**: Inference, Safety, VectorIO, Eval, Scoring, DatasetIO, ToolRuntime
|
||||
|
||||
When you call:
|
||||
```python
|
||||
await inference.post_chat_completion(model="llama-2-7b")
|
||||
```
|
||||
|
||||
Router automatically determines:
|
||||
- "Which provider has llama-2-7b?"
|
||||
- "Route this request there"
|
||||
- "Stream response back"
|
||||
|
||||
**Implementation**: `routers/` directory contains auto-routers
|
||||
|
||||
#### 2. Routing Table APIs (Registry Pattern)
|
||||
**APIs**: Models, Shields, VectorStores, Datasets, Benchmarks, ToolGroups, ScoringFunctions
|
||||
|
||||
When you call:
|
||||
```python
|
||||
models = await models.list_models() # Merged list from ALL providers
|
||||
```
|
||||
|
||||
Router:
|
||||
- Queries each provider
|
||||
- Merges results
|
||||
- Returns unified list
|
||||
|
||||
**Implementation**: `routing_tables/` directory
|
||||
|
||||
### Why This Matters
|
||||
- **Users don't think about providers** - just use the API
|
||||
- **Multiple implementations work** - router handles dispatch
|
||||
- **Easy scaling** - add new providers without touching user code
|
||||
- **Resource management** - router knows what's available
|
||||
|
||||
---
|
||||
|
||||
## Configuration as a Weapon
|
||||
|
||||
### The Power of YAML Over Code
|
||||
Traditional approach:
|
||||
```python
|
||||
# Code changes needed for each provider!
|
||||
if use_openai:
|
||||
from openai import OpenAI
|
||||
client = OpenAI(api_key=...)
|
||||
elif use_ollama:
|
||||
from ollama import Client
|
||||
client = Client(url=...)
|
||||
# etc.
|
||||
```
|
||||
|
||||
Llama Stack approach:
|
||||
```yaml
|
||||
# Zero code changes!
|
||||
providers:
|
||||
inference:
|
||||
- provider_type: remote::openai
|
||||
config:
|
||||
api_key: ${env.OPENAI_API_KEY}
|
||||
```
|
||||
|
||||
Then later, change to:
|
||||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
- provider_type: remote::ollama
|
||||
config:
|
||||
host: localhost
|
||||
```
|
||||
|
||||
**Same application code** works with both!
|
||||
|
||||
### Environment Variable Magic
|
||||
```bash
|
||||
# Change provider at runtime
|
||||
INFERENCE_MODEL=llama-2-70b llama stack run starter
|
||||
|
||||
# No redeployment needed!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## The Distributions Strategy
|
||||
|
||||
### Problem: "Works on My Machine"
|
||||
- Different developers need different setups
|
||||
- Production needs different providers than development
|
||||
- CI/CD needs lightweight dependencies
|
||||
|
||||
### Solution: Pre-verified Distributions
|
||||
```
|
||||
starter → Works on CPU with free APIs (Ollama + OpenAI)
|
||||
starter-gpu → Works on GPU machines
|
||||
meta-reference-gpu → Works with full local setup
|
||||
postgres-demo → Production-grade with persistent storage
|
||||
```
|
||||
|
||||
Each distribution:
|
||||
- Pre-selects working providers
|
||||
- Sets sensible defaults
|
||||
- Bundles required dependencies
|
||||
- Tested end-to-end
|
||||
|
||||
**Result**: `llama stack run starter` just works for 80% of use cases
|
||||
|
||||
### Why This Beats Documentation
|
||||
- **No setup guides needed** - distribution does it
|
||||
- **No guessing** - curated, tested combinations
|
||||
- **Reproducible** - same distro always works same way
|
||||
- **Upgradeable** - update distro = get improvements
|
||||
|
||||
---
|
||||
|
||||
## The Testing Genius: Record-Replay
|
||||
|
||||
### Traditional Testing Hell for AI
|
||||
Problem:
|
||||
- API calls cost money
|
||||
- API responses are non-deterministic
|
||||
- Each provider has different response formats
|
||||
- Tests become slow and flaky
|
||||
|
||||
### The Record-Replay Solution
|
||||
|
||||
First run (record):
|
||||
```bash
|
||||
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
|
||||
# Makes real API calls, saves responses to YAML
|
||||
```
|
||||
|
||||
All subsequent runs (replay):
|
||||
```bash
|
||||
pytest tests/integration/
|
||||
# Returns cached responses, NO API calls, instant results
|
||||
```
|
||||
|
||||
### Why This is Brilliant
|
||||
- **Cost**: Record once, replay 1000x. Save thousands of dollars
|
||||
- **Speed**: Cached responses = instant test execution
|
||||
- **Reliability**: Deterministic results (no API variability)
|
||||
- **Coverage**: One test works with OpenAI, Ollama, Anthropic, etc.
|
||||
|
||||
**File location**: `tests/integration/[api]/cassettes/`
|
||||
|
||||
---
|
||||
|
||||
## Core Runtime: The Stack Class
|
||||
|
||||
### The Elegance of Inheritance
|
||||
```python
|
||||
class LlamaStack(
|
||||
Inference, # Chat completion, embeddings
|
||||
Agents, # Multi-turn orchestration
|
||||
Safety, # Content filtering
|
||||
VectorIO, # Vector operations
|
||||
Tools, # Function execution
|
||||
Eval, # Evaluation
|
||||
Scoring, # Response scoring
|
||||
Models, # Model registry
|
||||
# ... 19 more APIs
|
||||
):
|
||||
pass
|
||||
```
|
||||
|
||||
A single `LlamaStack` instance:
|
||||
- Implements 27 different APIs
|
||||
- Has 50+ providers backing it
|
||||
- Routes requests intelligently
|
||||
- Manages dependencies
|
||||
|
||||
All from a ~400 line file + lots of protocol definitions!
|
||||
|
||||
---
|
||||
|
||||
## Dependency Injection Without the Complexity
|
||||
|
||||
### How Providers Depend on Each Other
|
||||
Problem: Agents need Inference, Inference needs Models registry
|
||||
```python
|
||||
class AgentProvider:
|
||||
def __init__(self,
|
||||
inference: InferenceProvider,
|
||||
safety: SafetyProvider,
|
||||
tool_runtime: ToolRuntimeProvider):
|
||||
self.inference = inference
|
||||
self.safety = safety
|
||||
self.tool_runtime = tool_runtime
|
||||
```
|
||||
|
||||
### How It Gets Resolved
|
||||
**File**: `core/resolver.py`
|
||||
|
||||
1. Parse `run.yaml` - which providers enabled?
|
||||
2. Build dependency graph - who depends on whom?
|
||||
3. Topological sort - what order to instantiate?
|
||||
4. Instantiate in order - each gets its dependencies
|
||||
|
||||
**Result**: Complex dependency chains handled automatically!
|
||||
|
||||
---
|
||||
|
||||
## The Client Duality
|
||||
|
||||
### Two Ways to Use Llama Stack
|
||||
|
||||
#### 1. Library Mode (In-Process)
|
||||
```python
|
||||
from llama_stack import AsyncLlamaStackAsLibraryClient
|
||||
|
||||
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
|
||||
response = await client.inference.post_chat_completion(...)
|
||||
```
|
||||
- No HTTP overhead
|
||||
- Direct Python API
|
||||
- Embedded in application
|
||||
- **File**: `core/library_client.py`
|
||||
|
||||
#### 2. Server Mode (HTTP)
|
||||
```bash
|
||||
llama stack run starter # Start server on port 8321
|
||||
```
|
||||
|
||||
```python
|
||||
from llama_stack_client import AsyncLlamaStackClient
|
||||
|
||||
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
|
||||
response = await client.inference.post_chat_completion(...)
|
||||
```
|
||||
- Distributed architecture
|
||||
- Share single server across apps
|
||||
- Easy deployment
|
||||
- Language-agnostic clients (Python, TypeScript, Swift, Kotlin)
|
||||
|
||||
**Result**: Same API, different deployment strategies!
|
||||
|
||||
---
|
||||
|
||||
## The Model System Insight
|
||||
|
||||
### Why It Exists
|
||||
Problem: Different model IDs across providers
|
||||
- HuggingFace: `meta-llama/Llama-2-7b`
|
||||
- Ollama: `llama2`
|
||||
- OpenAI: `gpt-4`
|
||||
|
||||
### Solution: Universal Model Registry
|
||||
**File**: `models/llama/sku_list.py`
|
||||
|
||||
```python
|
||||
resolve_model("meta-llama/Llama-2-7b")
|
||||
# Returns Model object with:
|
||||
# - Architecture info
|
||||
# - Tokenizer
|
||||
# - Quantization options
|
||||
# - Resource requirements
|
||||
```
|
||||
|
||||
Allows:
|
||||
- Consistent model IDs across providers
|
||||
- Intelligent resource allocation
|
||||
- Provider-agnostic inference
|
||||
|
||||
---
|
||||
|
||||
## The CLI Is Smart
|
||||
|
||||
### It Does More Than You Think
|
||||
```bash
|
||||
llama stack run starter
|
||||
```
|
||||
|
||||
This command:
|
||||
1. Resolves the starter distribution template
|
||||
2. Merges with environment variables
|
||||
3. Creates/updates `~/.llama/distributions/starter/run.yaml`
|
||||
4. Installs missing dependencies
|
||||
5. Starts HTTP server on port 8321
|
||||
6. Initializes all providers
|
||||
7. Registers available models
|
||||
8. Ready for requests
|
||||
|
||||
**No separate build step needed!** (unless building Docker images)
|
||||
|
||||
### Introspection Commands
|
||||
```bash
|
||||
llama stack list-apis # See all 27 APIs
|
||||
llama stack list-providers # See all 50+ providers
|
||||
llama stack list # See all distributions
|
||||
llama stack list-deps starter # See what to install
|
||||
```
|
||||
|
||||
Used for documentation, debugging, and automation
|
||||
|
||||
---
|
||||
|
||||
## Storage: The Oft-Overlooked Component
|
||||
|
||||
### Three Storage Types
|
||||
1. **KV Store** - Metadata (models, shields)
|
||||
2. **SQL Store** - Structured (conversations, datasets)
|
||||
3. **Inference Store** - Caching (for testing)
|
||||
|
||||
### Why Multiple Backends Matter
|
||||
- Development: SQLite (no dependencies)
|
||||
- Production: PostgreSQL (scalable)
|
||||
- Distributed: Redis (shared state)
|
||||
- Testing: In-memory (fast)
|
||||
|
||||
**Files**:
|
||||
- `core/storage/datatypes.py` - Interfaces
|
||||
- `providers/utils/kvstore/` - Implementations
|
||||
- `providers/utils/sqlstore/` - Implementations
|
||||
|
||||
---
|
||||
|
||||
## Telemetry: Built-In Observability
|
||||
|
||||
### What Gets Traced
|
||||
- Every API call
|
||||
- Token usage (if provider supports it)
|
||||
- Latency
|
||||
- Errors
|
||||
- Custom metrics from providers
|
||||
|
||||
### Integration
|
||||
- OpenTelemetry compatible
|
||||
- Automatic context propagation
|
||||
- Works across async boundaries
|
||||
- **File**: `providers/utils/telemetry/`
|
||||
|
||||
---
|
||||
|
||||
## Extension Strategy: How to Add Custom Functionality
|
||||
|
||||
### Adding a Custom API
|
||||
1. Create protocol in `apis/my_api/my_api.py`
|
||||
2. Implement providers (inline and/or remote)
|
||||
3. Register in `core/resolver.py`
|
||||
4. Add to distributions
|
||||
|
||||
### Adding a Custom Provider
|
||||
1. Create module in `providers/[inline|remote]/[api]/[provider]/`
|
||||
2. Implement config and adapter classes
|
||||
3. Register in `providers/registry/[api].py`
|
||||
4. Use in distribution YAML
|
||||
|
||||
### Adding a Custom Distribution
|
||||
1. Create subdirectory in `distributions/[name]/`
|
||||
2. Implement template in `[name].py`
|
||||
3. Register in distribution discovery
|
||||
|
||||
---
|
||||
|
||||
## Common Misconceptions Clarified
|
||||
|
||||
### "APIs are HTTP endpoints"
|
||||
**Wrong** - APIs are Python protocols. HTTP comes later via FastAPI.
|
||||
- The "Inference" API is just a Python Protocol
|
||||
- Providers implement it
|
||||
- Core wraps it with HTTP for server mode
|
||||
- Library mode uses it directly
|
||||
|
||||
### "Providers are all external services"
|
||||
**Wrong** - Providers can be:
|
||||
- Inline (local execution): Meta Reference, FAISS, Llama Guard
|
||||
- Remote (external services): OpenAI, Ollama, Qdrant
|
||||
|
||||
Inline providers have low latency and no dependency on external services.
|
||||
|
||||
### "You must run a server"
|
||||
**Wrong** - Two modes:
|
||||
- Server mode: `llama stack run starter` (HTTP)
|
||||
- Library mode: Import and use directly in Python
|
||||
|
||||
### "Distributions are just Docker images"
|
||||
**Wrong** - Distributions are:
|
||||
- Templates (what providers to use)
|
||||
- Configs (how to configure them)
|
||||
- Dependencies (what to install)
|
||||
- Can be Docker OR local Python
|
||||
|
||||
---
|
||||
|
||||
## Performance Implications
|
||||
|
||||
### Inline Providers Are Fast
|
||||
```
|
||||
Inline (e.g., Meta Reference)
|
||||
├─ 0ms network latency
|
||||
├─ No HTTP serialization/deserialization
|
||||
├─ Direct GPU access
|
||||
└─ Fast (but high resource cost)
|
||||
|
||||
Remote (e.g., OpenAI)
|
||||
├─ 100-500ms network latency
|
||||
├─ HTTP serialization overhead
|
||||
├─ Low resource cost
|
||||
└─ Slower (but cheap)
|
||||
```
|
||||
|
||||
### Streaming Is Native
|
||||
```python
|
||||
response = await inference.post_chat_completion(model=..., stream=True)
|
||||
async for chunk in response:
|
||||
print(chunk.delta) # Process token by token
|
||||
```
|
||||
|
||||
Tokens arrive as they're generated, no waiting for full response.
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### API Keys Are Config
|
||||
```yaml
|
||||
inference:
|
||||
- provider_id: openai
|
||||
config:
|
||||
api_key: ${env.OPENAI_API_KEY} # From environment
|
||||
```
|
||||
|
||||
Never hardcoded, always from env vars.
|
||||
|
||||
### Access Control
|
||||
**File**: `core/access_control/`
|
||||
|
||||
Providers can implement access rules:
|
||||
- Per-user restrictions
|
||||
- Per-model restrictions
|
||||
- Rate limiting
|
||||
- Audit logging
|
||||
|
||||
### Sensitive Field Redaction
|
||||
Config logging automatically redacts:
|
||||
- API keys
|
||||
- Passwords
|
||||
- Tokens
|
||||
|
||||
---
|
||||
|
||||
## Maturity Indicators
|
||||
|
||||
### Signs of Production-Ready Design
|
||||
1. **Separated Concerns** - APIs, Providers, Distributions
|
||||
2. **Plugin Architecture** - Easy to extend
|
||||
3. **Configuration Over Code** - Deploy without recompiling
|
||||
4. **Comprehensive Testing** - Unit + Integration with record-replay
|
||||
5. **Multiple Client Options** - Library + Server modes
|
||||
6. **Storage Abstraction** - Multiple backends
|
||||
7. **Dependency Management** - Automatic resolution
|
||||
8. **Error Handling** - Structured, informative errors
|
||||
9. **Observability** - Built-in telemetry
|
||||
10. **Documentation** - Distributions + CLI introspection
|
||||
|
||||
Llama Stack has all 10!
|
||||
|
||||
---
|
||||
|
||||
## Key Architectural Decisions
|
||||
|
||||
### Why Async/Await Throughout?
|
||||
- Modern Python standard
|
||||
- Works well with streaming
|
||||
- Natural for I/O-heavy operations (API calls, GPU operations)
|
||||
|
||||
### Why Pydantic for Config?
|
||||
- Type validation
|
||||
- Auto-documentation
|
||||
- JSON schema generation
|
||||
- Easy serialization
|
||||
|
||||
### Why Protocol Classes for APIs?
|
||||
- Define interface without implementation
|
||||
- Multiple implementations possible
|
||||
- Type hints work with duck typing
|
||||
- Minimal magic
|
||||
|
||||
### Why YAML for Config?
|
||||
- Human readable
|
||||
- Environment variable support
|
||||
- Comments allowed
|
||||
- Wide tool support
|
||||
|
||||
### Why Record-Replay for Tests?
|
||||
- Cost efficient
|
||||
- Deterministic
|
||||
- Real behavior captured
|
||||
- Provider-agnostic
|
||||
|
||||
---
|
||||
|
||||
## The Learning Path for Contributors
|
||||
|
||||
### Understanding Order
|
||||
1. **Start**: `pyproject.toml` - Entry point
|
||||
2. **Learn**: `core/datatypes.py` - Data structures
|
||||
3. **Understand**: `apis/inference/inference.py` - Example API
|
||||
4. **See**: `providers/registry/inference.py` - Provider registry
|
||||
5. **Read**: `providers/inline/inference/meta_reference/` - Inline provider
|
||||
6. **Read**: `providers/remote/inference/openai/` - Remote provider
|
||||
7. **Study**: `core/resolver.py` - How it all connects
|
||||
8. **Understand**: `core/stack.py` - Main orchestrator
|
||||
9. **See**: `distributions/starter/` - How to use it
|
||||
10. **Run**: `tests/integration/` - How to test
|
||||
|
||||
Each step builds on previous understanding.
|
||||
|
||||
---
|
||||
|
||||
## The Elegant Parts
|
||||
|
||||
### Most Elegant: The Router
|
||||
The router system is beautiful:
|
||||
- Transparent to users
|
||||
- Automatic provider selection
|
||||
- Works with 1 or 100 providers
|
||||
- No hardcoding needed
|
||||
|
||||
### Most Flexible: YAML Config
|
||||
Configuration as first-class citizen:
|
||||
- Switch providers without code
|
||||
- Override at runtime
|
||||
- Version control friendly
|
||||
- Documentation via config
|
||||
|
||||
### Most Useful: Record-Replay Tests
|
||||
Testing pattern solves real problems:
|
||||
- Cost
|
||||
- Speed
|
||||
- Reliability
|
||||
- Coverage
|
||||
|
||||
### Most Scalable: Distribution Templates
|
||||
Pre-configured bundles:
|
||||
- One command to start
|
||||
- Verified combinations
|
||||
- Easy to document
|
||||
- Simple to teach
|
||||
|
||||
---
|
||||
|
||||
## The Future
|
||||
|
||||
### What's Being Built
|
||||
- More providers (Nvidia, SambaNova, etc.)
|
||||
- More APIs (more task types)
|
||||
- On-device execution (ExecuTorch)
|
||||
- Better observability (more telemetry)
|
||||
- Easier extensions (simpler API for custom providers)
|
||||
|
||||
### How It Stays Maintainable
|
||||
- Protocol-based design limits coupling
|
||||
- Clear separation of concerns
|
||||
- Comprehensive testing
|
||||
- Configuration over code
|
||||
- Plugin architecture
|
||||
|
||||
The architecture is **future-proof** by design.
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue