mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-11 19:56:03 +00:00
fix: UI bug fixes and comprehensive architecture documentation
- Fixed Agent Instructions overflow by adding vertical scrolling - Fixed duplicate chat content by skipping turn_complete events - Added comprehensive architecture documentation (4 files) - Added UI bug fixes documentation - Added Notion API upgrade analysis - Created documentation registry for Notion pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
This commit is contained in:
parent
3059423cd7
commit
5ef6ccf90e
8 changed files with 2603 additions and 1 deletions
925
ARCHITECTURE_SUMMARY.md
Normal file
925
ARCHITECTURE_SUMMARY.md
Normal file
|
|
@ -0,0 +1,925 @@
|
|||
# Llama Stack Architecture - Comprehensive Overview
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Llama Stack is a comprehensive framework for building AI applications with Llama models. It provides a **unified API layer** with a **plugin architecture for providers**, allowing developers to seamlessly switch between local and cloud-hosted implementations without changing application code. The system is organized around three main pillars: APIs (abstract interfaces), Providers (concrete implementations), and Distributions (pre-configured bundles).
|
||||
|
||||
---
|
||||
|
||||
## 1. Core Architecture Philosophy
|
||||
|
||||
### Separation of Concerns
|
||||
- **APIs**: Define abstract interfaces for functionality (e.g., Inference, Safety, VectorIO)
|
||||
- **Providers**: Implement those interfaces (inline for local, remote for external services)
|
||||
- **Distributions**: Pre-configure and bundle providers for specific deployment scenarios
|
||||
|
||||
### Key Design Patterns
|
||||
- **Plugin Architecture**: Dynamically load providers based on configuration
|
||||
- **Dependency Injection**: Providers declare dependencies on other APIs/providers
|
||||
- **Routing**: Smart routing directs requests to appropriate provider implementations
|
||||
- **Configuration-Driven**: YAML-based configuration enables flexibility without code changes
|
||||
|
||||
---
|
||||
|
||||
## 2. Directory Structure (`llama_stack/`)
|
||||
|
||||
```
|
||||
llama_stack/
|
||||
├── apis/ # Abstract API definitions (27 APIs total)
|
||||
│ ├── inference/ # LLM inference interface
|
||||
│ ├── agents/ # Agent orchestration
|
||||
│ ├── safety/ # Content filtering & safety
|
||||
│ ├── vector_io/ # Vector database operations
|
||||
│ ├── tools/ # Tool/function calling runtime
|
||||
│ ├── scoring/ # Response scoring
|
||||
│ ├── eval/ # Evaluation framework
|
||||
│ ├── post_training/ # Fine-tuning & training
|
||||
│ ├── datasetio/ # Dataset loading/management
|
||||
│ ├── conversations/ # Conversation management
|
||||
│ ├── common/ # Shared datatypes (SamplingParams, etc.)
|
||||
│ └── [22 more...] # Models, Shields, Benchmarks, etc.
|
||||
│
|
||||
├── providers/ # Provider implementations (inline & remote)
|
||||
│ ├── inline/ # In-process implementations
|
||||
│ │ ├── inference/ # Meta Reference, Sentence Transformers
|
||||
│ │ ├── agents/ # Agent orchestration implementations
|
||||
│ │ ├── safety/ # Llama Guard, Code Scanner
|
||||
│ │ ├── vector_io/ # FAISS, SQLite-vec, Milvus
|
||||
│ │ ├── post_training/ # TorchTune
|
||||
│ │ ├── eval/ # Evaluation implementations
|
||||
│ │ ├── tool_runtime/ # RAG runtime, MCP protocol
|
||||
│ │ └── [more...]
|
||||
│ │
|
||||
│ ├── remote/ # External service adapters
|
||||
│ │ ├── inference/ # OpenAI, Anthropic, Groq, Ollama, vLLM, TGI, etc.
|
||||
│ │ ├── vector_io/ # ChromaDB, Qdrant, Weaviate, Postgres
|
||||
│ │ ├── safety/ # Bedrock, SambaNova, Nvidia
|
||||
│ │ ├── agents/ # Sample implementations
|
||||
│ │ ├── tool_runtime/ # Brave Search, Tavily, Wolfram Alpha
|
||||
│ │ └── [more...]
|
||||
│ │
|
||||
│ ├── registry/ # Provider discovery/registration (inference.py, agents.py, etc.)
|
||||
│ │ └── [One file per API with all providers for that API]
|
||||
│ │
|
||||
│ ├── utils/ # Shared provider utilities
|
||||
│ │ ├── inference/ # Embedding mixin, OpenAI compat
|
||||
│ │ ├── kvstore/ # Key-value store abstractions
|
||||
│ │ ├── sqlstore/ # SQL storage abstractions
|
||||
│ │ ├── telemetry/ # Tracing, metrics
|
||||
│ │ └── [more...]
|
||||
│ │
|
||||
│ └── datatypes.py # ProviderSpec, InlineProviderSpec, RemoteProviderSpec
|
||||
│
|
||||
├── core/ # Core runtime & orchestration
|
||||
│ ├── stack.py # Main LlamaStack class (implements all APIs)
|
||||
│ ├── datatypes.py # Config models (StackRunConfig, Provider, etc.)
|
||||
│ ├── resolver.py # Provider resolution & dependency injection
|
||||
│ ├── library_client.py # In-process client for library usage
|
||||
│ ├── build.py # Distribution building
|
||||
│ ├── configure.py # Configuration handling
|
||||
│ ├── distribution.py # Distribution management
|
||||
│ ├── routers/ # Auto-routed API implementations (infer route based on routing key)
|
||||
│ ├── routing_tables/ # Manual routing tables (e.g., Models, Shields, VectorStores)
|
||||
│ ├── server/ # FastAPI HTTP server setup
|
||||
│ ├── storage/ # Backend storage abstractions (KVStore, SqlStore)
|
||||
│ ├── utils/ # Config resolution, dynamic imports
|
||||
│ └── conversations/ # Conversation service implementation
|
||||
│
|
||||
├── cli/ # Command-line interface
|
||||
│ ├── llama.py # Main entry point
|
||||
│ └── stack/ # Stack management commands
|
||||
│ ├── run.py # Start a distribution
|
||||
│ ├── list_apis.py # List available APIs
|
||||
│ ├── list_providers.py # List providers
|
||||
│ ├── list_deps.py # List dependencies
|
||||
│ └── [more...]
|
||||
│
|
||||
├── distributions/ # Pre-configured distribution templates
|
||||
│ ├── starter/ # CPU-friendly multi-provider starter
|
||||
│ ├── starter-gpu/ # GPU-optimized starter
|
||||
│ ├── meta-reference-gpu/ # Full-featured Meta reference
|
||||
│ ├── postgres-demo/ # PostgreSQL-based demo
|
||||
│ ├── template.py # Distribution template base class
|
||||
│ └── [more...]
|
||||
│
|
||||
├── models/ # Llama model implementations
|
||||
│ └── llama/
|
||||
│ ├── llama3/ # Llama 3 implementation
|
||||
│ ├── llama4/ # Llama 4 implementation
|
||||
│ ├── sku_list.py # Model registry (maps model IDs to implementations)
|
||||
│ ├── checkpoint.py # Model checkpoint handling
|
||||
│ ├── datatypes.py # ToolDefinition, StopReason, etc.
|
||||
│ └── [more...]
|
||||
│
|
||||
├── testing/ # Testing utilities
|
||||
│ └── api_recorder.py # Record/replay infrastructure for integration tests
|
||||
│
|
||||
└── ui/ # Web UI (Streamlit-based)
|
||||
├── app/
|
||||
├── components/
|
||||
├── pages/
|
||||
└── [React/TypeScript frontend]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. API Layer (27 APIs)
|
||||
|
||||
### What is an API?
|
||||
Each API is an abstract **protocol** (Python Protocol class) that defines an interface. APIs are located in `llama_stack/apis/` with a structure like:
|
||||
|
||||
```
|
||||
apis/inference/
|
||||
├── __init__.py # Exports the Inference protocol
|
||||
├── inference.py # Full API definition (300+ lines)
|
||||
└── event_logger.py # Supporting types
|
||||
```
|
||||
|
||||
### Key APIs
|
||||
|
||||
#### Core Inference API
|
||||
- **Path**: `llama_stack/apis/inference/inference.py`
|
||||
- **Methods**: `post_chat_completion()`, `post_completion()`, `post_embedding()`, `get_models()`
|
||||
- **Types**: `SamplingParams`, `SamplingStrategy` (greedy/top-p/top-k), `OpenAIChatCompletion`
|
||||
- **Providers**: 30+ (OpenAI, Claude, Ollama, vLLM, TGI, Fireworks, etc.)
|
||||
|
||||
#### Agents API
|
||||
- **Path**: `llama_stack/apis/agents/agents.py`
|
||||
- **Methods**: `create_agent()`, `update_agent()`, `create_session()`, `agentic_loop_turn()`
|
||||
- **Features**: Multi-turn conversations, tool calling, streaming
|
||||
- **Providers**: Meta Reference (inline), Fireworks, Together
|
||||
|
||||
#### Safety API
|
||||
- **Path**: `llama_stack/apis/safety/safety.py`
|
||||
- **Methods**: `run_shields()` - filter content before/after inference
|
||||
- **Providers**: Llama Guard (inline), AWS Bedrock, SambaNova, Nvidia
|
||||
|
||||
#### Vector IO API
|
||||
- **Path**: `llama_stack/apis/vector_io/vector_io.py`
|
||||
- **Methods**: `insert()`, `query()`, `delete()` - vector database operations
|
||||
- **Providers**: FAISS, SQLite-vec, Milvus (inline), ChromaDB, Qdrant, Weaviate, PG Vector (remote)
|
||||
|
||||
#### Tools / Tool Runtime API
|
||||
- **Path**: `llama_stack/apis/tools/tool_runtime.py`
|
||||
- **Methods**: `execute_tool()` - execute functions during agent loops
|
||||
- **Providers**: RAG runtime (inline), Brave Search, Tavily, Wolfram Alpha, Model Context Protocol
|
||||
|
||||
#### Other Major APIs
|
||||
- **Post Training**: Fine-tuning & model training (HuggingFace, TorchTune, Nvidia)
|
||||
- **Eval**: Evaluation frameworks (Meta Reference with autoevals)
|
||||
- **Scoring**: Response scoring (Basic, LLM-as-Judge, Braintrust)
|
||||
- **Datasets**: Dataset management
|
||||
- **DatasetIO**: Dataset loading from HuggingFace, Nvidia, local files
|
||||
- **Conversations**: Multi-turn conversation state management
|
||||
- **Vector Stores**: Vector store metadata & configuration
|
||||
- **Shields**: Shield (safety filter) registry
|
||||
- **Models**: Model registry management
|
||||
- **Batches**: Batch processing
|
||||
- **Prompts**: Prompt templates & management
|
||||
- **Telemetry**: Tracing & metrics collection
|
||||
- **Inspect**: Introspection & debugging
|
||||
|
||||
---
|
||||
|
||||
## 4. Provider System
|
||||
|
||||
### Provider Types
|
||||
|
||||
#### 1. **Inline Providers** (`InlineProviderSpec`)
|
||||
- Run in-process (same Python process as server)
|
||||
- High performance, low latency
|
||||
- No network overhead
|
||||
- Heavier resource requirements
|
||||
- Examples: Meta Reference (inference), Llama Guard (safety), FAISS (vector IO)
|
||||
|
||||
**Structure**:
|
||||
```python
|
||||
InlineProviderSpec(
|
||||
api=Api.inference,
|
||||
provider_type="inline::meta-reference",
|
||||
module="llama_stack.providers.inline.inference.meta_reference",
|
||||
config_class="...MetaReferenceInferenceConfig",
|
||||
pip_packages=[...],
|
||||
container_image="..." # Optional for containerization
|
||||
)
|
||||
```
|
||||
|
||||
#### 2. **Remote Providers** (`RemoteProviderSpec`)
|
||||
- Connect to external services via HTTP/API
|
||||
- Lower resource requirements
|
||||
- Network latency
|
||||
- Cloud-based (OpenAI, Anthropic, Groq) or self-hosted (Ollama, vLLM, Qdrant)
|
||||
- Examples: OpenAI, Anthropic, Groq, Ollama, Qdrant, ChromaDB
|
||||
|
||||
**Structure**:
|
||||
```python
|
||||
RemoteProviderSpec(
|
||||
api=Api.inference,
|
||||
adapter_type="openai",
|
||||
provider_type="remote::openai",
|
||||
module="llama_stack.providers.remote.inference.openai",
|
||||
config_class="...OpenAIInferenceConfig",
|
||||
pip_packages=[...]
|
||||
)
|
||||
```
|
||||
|
||||
### Provider Registration
|
||||
|
||||
Providers are registered in **registry files** (`llama_stack/providers/registry/`):
|
||||
- `inference.py` - All inference providers (30+)
|
||||
- `agents.py` - All agent providers
|
||||
- `safety.py` - All safety providers
|
||||
- `vector_io.py` - All vector IO providers
|
||||
- `tool_runtime.py` - All tool runtime providers
|
||||
- [etc.]
|
||||
|
||||
Each registry file has an `available_providers()` function returning a list of `ProviderSpec`.
|
||||
|
||||
### Provider Config
|
||||
|
||||
Each provider has a config class (e.g., `MetaReferenceInferenceConfig`):
|
||||
```python
|
||||
class MetaReferenceInferenceConfig(BaseModel):
|
||||
max_batch_size: int = 1
|
||||
enable_pydantic_sampling: bool = True
|
||||
# sample_run_config() - provides default values for testing
|
||||
# pip_packages() - lists dependencies
|
||||
```
|
||||
|
||||
### Provider Implementation
|
||||
|
||||
Inline providers look like:
|
||||
```python
|
||||
class MetaReferenceInferenceImpl(InferenceProvider):
|
||||
async def post_chat_completion(
|
||||
self,
|
||||
model: str,
|
||||
request: OpenAIChatCompletionRequestWithExtraBody,
|
||||
) -> AsyncIterator[OpenAIChatCompletionChunk]:
|
||||
# Load model, run inference, yield streaming results
|
||||
...
|
||||
```
|
||||
|
||||
Remote providers implement HTTP adapters:
|
||||
```python
|
||||
class OllamaInferenceImpl(InferenceProvider):
|
||||
async def post_chat_completion(...):
|
||||
# Make HTTP requests to Ollama server
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Core Runtime & Resolution
|
||||
|
||||
### Stack Resolution Process
|
||||
|
||||
**File**: `llama_stack/core/resolver.py`
|
||||
|
||||
1. **Load Configuration** → Parse `run.yaml` with enabled providers
|
||||
2. **Resolve Dependencies** → Build dependency graph (e.g., agents may depend on inference)
|
||||
3. **Instantiate Providers** → Create provider instances with configs
|
||||
4. **Create Router/Routed Impls** → Set up request routing
|
||||
5. **Register Resources** → Register models, shields, datasets, etc.
|
||||
|
||||
### The LlamaStack Class
|
||||
|
||||
**File**: `llama_stack/core/stack.py`
|
||||
|
||||
```python
|
||||
class LlamaStack(
|
||||
Providers, # Meta API for provider management
|
||||
Inference, # LLM inference
|
||||
Agents, # Agent orchestration
|
||||
Safety, # Content safety
|
||||
VectorIO, # Vector operations
|
||||
Tools, # Tool runtime
|
||||
Eval, # Evaluation
|
||||
# ... 15 more APIs ...
|
||||
):
|
||||
pass
|
||||
```
|
||||
|
||||
This class **inherits from all APIs**, making a single `LlamaStack` instance support all functionality.
|
||||
|
||||
### Two Client Modes
|
||||
|
||||
#### 1. **Library Client** (In-Process)
|
||||
```python
|
||||
from llama_stack import AsyncLlamaStackAsLibraryClient
|
||||
|
||||
client = await AsyncLlamaStackAsLibraryClient.create(run_config)
|
||||
response = await client.inference.post_chat_completion(...)
|
||||
```
|
||||
**File**: `llama_stack/core/library_client.py`
|
||||
|
||||
#### 2. **Server Client** (HTTP)
|
||||
```python
|
||||
from llama_stack_client import AsyncLlamaStackClient
|
||||
|
||||
client = AsyncLlamaStackClient(base_url="http://localhost:8321")
|
||||
response = await client.inference.post_chat_completion(...)
|
||||
```
|
||||
Uses the separate `llama-stack-client` package.
|
||||
|
||||
---
|
||||
|
||||
## 6. Request Routing
|
||||
|
||||
### Two Routing Strategies
|
||||
|
||||
#### 1. **Auto-Routed APIs** (e.g., Inference, Safety, VectorIO)
|
||||
- Routing key = provider instance
|
||||
- Router automatically selects provider based on resource ID
|
||||
- **Implementation**: `AutoRoutedProviderSpec` → `routers/` directory
|
||||
|
||||
```python
|
||||
# inference.post_chat_completion(model_id="meta-llama/Llama-2-7b")
|
||||
# Router selects provider based on which provider has that model
|
||||
```
|
||||
|
||||
**Routed APIs**:
|
||||
- Inference, Safety, VectorIO, DatasetIO, Scoring, Eval, ToolRuntime
|
||||
|
||||
#### 2. **Routing Table APIs** (e.g., Models, Shields, VectorStores)
|
||||
- Registry APIs that list/register resources
|
||||
- **Implementation**: `RoutingTableProviderSpec` → `routing_tables/` directory
|
||||
|
||||
```python
|
||||
# models.list_models() → merged list from all providers
|
||||
# models.register_model(...) → router selects provider
|
||||
```
|
||||
|
||||
**Registry APIs**:
|
||||
- Models, Shields, VectorStores, Datasets, ScoringFunctions, Benchmarks, ToolGroups
|
||||
|
||||
---
|
||||
|
||||
## 7. Distributions
|
||||
|
||||
### What is a Distribution?
|
||||
|
||||
A **Distribution** is a pre-configured, verified bundle of providers for a specific deployment scenario.
|
||||
|
||||
**File**: `llama_stack/distributions/template.py` (base) → specific distros in subdirectories
|
||||
|
||||
### Example: Starter Distribution
|
||||
|
||||
**File**: `llama_stack/distributions/starter/starter.py`
|
||||
|
||||
```python
|
||||
def get_distribution_template(name: str = "starter"):
|
||||
providers = {
|
||||
"inference": [
|
||||
remote::ollama,
|
||||
remote::vllm,
|
||||
remote::openai,
|
||||
# ... others ...
|
||||
],
|
||||
"vector_io": [
|
||||
inline::faiss,
|
||||
inline::sqlite-vec,
|
||||
remote::qdrant,
|
||||
# ... others ...
|
||||
],
|
||||
"safety": [
|
||||
inline::llama-guard,
|
||||
inline::code-scanner,
|
||||
],
|
||||
# ... other APIs ...
|
||||
}
|
||||
return DistributionTemplate(
|
||||
name="starter",
|
||||
providers=providers,
|
||||
run_configs={
|
||||
"run.yaml": RunConfigSettings(...)
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Built-in Distributions
|
||||
|
||||
1. **starter**: CPU-only, multi-provider (Ollama, OpenAI, etc.)
|
||||
2. **starter-gpu**: GPU-optimized version
|
||||
3. **meta-reference-gpu**: Full Meta reference implementation
|
||||
4. **postgres-demo**: PostgreSQL-backed version
|
||||
5. **watsonx**: IBM Watson X integration
|
||||
6. **nvidia**: NVIDIA-specific optimizations
|
||||
7. **open-benchmark**: For benchmarking
|
||||
|
||||
### Distribution Lifecycle
|
||||
|
||||
```
|
||||
llama stack run starter
|
||||
↓
|
||||
Resolve starter distribution template
|
||||
↓
|
||||
Merge with run.yaml config & environment variables
|
||||
↓
|
||||
Build/install dependencies (if needed)
|
||||
↓
|
||||
Start HTTP server (Uvicorn)
|
||||
↓
|
||||
Initialize all providers
|
||||
↓
|
||||
Register resources (models, shields, etc.)
|
||||
↓
|
||||
Ready for requests
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. CLI Architecture
|
||||
|
||||
**File**: `llama_stack/cli/`
|
||||
|
||||
### Entry Point
|
||||
|
||||
```bash
|
||||
$ llama [subcommand] [args]
|
||||
```
|
||||
|
||||
Maps to **pyproject.toml**:
|
||||
```toml
|
||||
[project.scripts]
|
||||
llama = "llama_stack.cli.llama:main"
|
||||
```
|
||||
|
||||
### Subcommands
|
||||
|
||||
```
|
||||
llama stack [command]
|
||||
├── run [distro|config] [--port PORT] # Start a distribution
|
||||
├── list-deps [distro] # Show dependencies to install
|
||||
├── list-apis # Show all APIs
|
||||
├── list-providers # Show all providers
|
||||
└── list [NAME] # Show distributions
|
||||
```
|
||||
|
||||
**Architecture**:
|
||||
- `llama.py` - Main parser with subcommands
|
||||
- `stack/stack.py` - Stack subcommand router
|
||||
- `stack/run.py` - Implementation of `llama stack run`
|
||||
- `stack/list_deps.py` - Dependency resolution & display
|
||||
|
||||
---
|
||||
|
||||
## 9. Testing Architecture
|
||||
|
||||
**Location**: `tests/` directory
|
||||
|
||||
### Test Types
|
||||
|
||||
#### 1. **Unit Tests** (`tests/unit/`)
|
||||
- Fast, isolated component testing
|
||||
- Mock external dependencies
|
||||
- **Run with**: `uv run --group unit pytest tests/unit/`
|
||||
- **Examples**:
|
||||
- `core/test_stack_validation.py` - Config validation
|
||||
- `distribution/test_distribution.py` - Distribution loading
|
||||
- `core/routers/test_vector_io.py` - Routing logic
|
||||
|
||||
#### 2. **Integration Tests** (`tests/integration/`)
|
||||
- End-to-end workflows
|
||||
- **Record-Replay pattern**: Record real API responses once, replay for fast/cheap testing
|
||||
- **Run with**: `uv run --group test pytest tests/integration/ --stack-config=starter`
|
||||
- **Structure**:
|
||||
```
|
||||
tests/integration/
|
||||
├── agents/
|
||||
│ ├── test_agents.py
|
||||
│ ├── test_persistence.py
|
||||
│ └── cassettes/ # Recorded API responses (YAML)
|
||||
├── inference/
|
||||
├── safety/
|
||||
├── vector_io/
|
||||
└── [more...]
|
||||
```
|
||||
|
||||
### Record-Replay System
|
||||
|
||||
**File**: `llama_stack/testing/api_recorder.py`
|
||||
|
||||
**Benefits**:
|
||||
- **Cost Control**: Record real API calls once, replay thousands of times
|
||||
- **Speed**: Cached responses = instant test execution
|
||||
- **Reliability**: Deterministic results (no API variability)
|
||||
- **Provider Coverage**: Same test works with OpenAI, Anthropic, Ollama, etc.
|
||||
|
||||
**How it works**:
|
||||
1. First run (with `LLAMA_STACK_TEST_INFERENCE_MODE=record`): Real API calls saved to YAML
|
||||
2. Subsequent runs: Load YAML and return matching responses
|
||||
3. CI automatically re-records when needed
|
||||
|
||||
### Test Organization
|
||||
|
||||
- **Common utilities**: `tests/common/`
|
||||
- **External provider tests**: `tests/external/` (test external APIs)
|
||||
- **Container tests**: `tests/containers/` (test Docker integration)
|
||||
- **Conftest**: pytest fixtures in each directory
|
||||
|
||||
---
|
||||
|
||||
## 10. Key Design Patterns
|
||||
|
||||
### Pattern 1: Protocol-Based Abstraction
|
||||
```python
|
||||
# API definition (protocol)
|
||||
class Inference(Protocol):
|
||||
async def post_chat_completion(...) -> AsyncIterator[...]: ...
|
||||
|
||||
# Provider implementation
|
||||
class InferenceProvider:
|
||||
async def post_chat_completion(...): ...
|
||||
```
|
||||
|
||||
### Pattern 2: Dependency Injection
|
||||
```python
|
||||
class AgentProvider:
|
||||
def __init__(self, inference: InferenceProvider, safety: SafetyProvider):
|
||||
self.inference = inference
|
||||
self.safety = safety
|
||||
```
|
||||
|
||||
### Pattern 3: Configuration-Driven Instantiation
|
||||
```yaml
|
||||
# run.yaml
|
||||
agents:
|
||||
- provider_id: meta-reference
|
||||
provider_type: inline::meta-reference
|
||||
config:
|
||||
max_depth: 5
|
||||
```
|
||||
|
||||
### Pattern 4: Routing by Resource
|
||||
```python
|
||||
# Request: inference.post_chat_completion(model="llama-2-7b")
|
||||
# Router finds which provider has "llama-2-7b" and routes there
|
||||
```
|
||||
|
||||
### Pattern 5: Registry Pattern for Resources
|
||||
```python
|
||||
# Register at startup
|
||||
await models.register_model(Model(
|
||||
identifier="llama-2-7b",
|
||||
provider_id="inference::meta-reference",
|
||||
...
|
||||
))
|
||||
|
||||
# Later, query or filter
|
||||
models_list = await models.list_models()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Configuration Management
|
||||
|
||||
### Config Files
|
||||
|
||||
#### 1. **run.yaml** - Runtime Configuration
|
||||
Location: `~/.llama/distributions/{name}/run.yaml`
|
||||
|
||||
```yaml
|
||||
version: 2
|
||||
providers:
|
||||
inference:
|
||||
- provider_id: ollama
|
||||
provider_type: remote::ollama
|
||||
config:
|
||||
host: localhost
|
||||
port: 11434
|
||||
safety:
|
||||
- provider_id: llama-guard
|
||||
provider_type: inline::llama-guard
|
||||
config: {}
|
||||
default_models:
|
||||
- identifier: llama-2-7b
|
||||
provider_id: ollama
|
||||
vector_stores_config:
|
||||
default_provider_id: faiss
|
||||
```
|
||||
|
||||
#### 2. **build.yaml** - Build Configuration
|
||||
Specifies which providers to install.
|
||||
|
||||
#### 3. Environment Variables
|
||||
Override config values at runtime:
|
||||
```bash
|
||||
INFERENCE_MODEL=llama-2-70b SAFETY_MODEL=llama-guard llama stack run starter
|
||||
```
|
||||
|
||||
### Config Resolution
|
||||
|
||||
**File**: `llama_stack/core/utils/config_resolution.py`
|
||||
|
||||
Order of precedence:
|
||||
1. Environment variables (highest)
|
||||
2. Runtime config (run.yaml)
|
||||
3. Distribution template defaults (lowest)
|
||||
|
||||
---
|
||||
|
||||
## 12. Extension Points for Developers
|
||||
|
||||
### Adding a Custom Provider
|
||||
|
||||
1. **Create provider module**:
|
||||
```python
|
||||
llama_stack/providers/remote/inference/my_provider/
|
||||
├── __init__.py
|
||||
├── config.py # MyProviderConfig
|
||||
└── my_provider.py # MyProviderImpl(InferenceProvider)
|
||||
```
|
||||
|
||||
2. **Register in registry**:
|
||||
```python
|
||||
# llama_stack/providers/registry/inference.py
|
||||
RemoteProviderSpec(
|
||||
api=Api.inference,
|
||||
adapter_type="my_provider",
|
||||
provider_type="remote::my_provider",
|
||||
config_class="...MyProviderConfig",
|
||||
module="llama_stack.providers.remote.inference.my_provider",
|
||||
)
|
||||
```
|
||||
|
||||
3. **Use in distribution**:
|
||||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
- provider_id: my_provider
|
||||
provider_type: remote::my_provider
|
||||
config: {...}
|
||||
```
|
||||
|
||||
### Adding a Custom API
|
||||
|
||||
1. Define protocol in `llama_stack/apis/my_api/my_api.py`
|
||||
2. Implement providers
|
||||
3. Register in resolver and distributions
|
||||
4. Add CLI support if needed
|
||||
|
||||
---
|
||||
|
||||
## 13. Storage & Persistence
|
||||
|
||||
### Storage Backends
|
||||
|
||||
**File**: `llama_stack/core/storage/datatypes.py`
|
||||
|
||||
#### KV Store (Key-Value)
|
||||
- Store metadata: models, shields, vector stores
|
||||
- Backends: SQLite (inline), Redis, Postgres
|
||||
|
||||
#### SQL Store
|
||||
- Store structured data: conversations, datasets
|
||||
- Backends: SQLite (inline), Postgres
|
||||
|
||||
#### Inference Store
|
||||
- Cache inference results for recording/replay
|
||||
- Used in testing
|
||||
|
||||
### Storage Configuration
|
||||
|
||||
```yaml
|
||||
storage:
|
||||
type: sqlite
|
||||
config:
|
||||
dir: ~/.llama/distributions/starter
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Telemetry & Tracing
|
||||
|
||||
### Tracing System
|
||||
|
||||
**File**: `llama_stack/providers/utils/telemetry/`
|
||||
|
||||
- Automatic request tracing with OpenTelemetry
|
||||
- Trace context propagation across async calls
|
||||
- Integration with OpenTelemetry collectors
|
||||
|
||||
### Telemetry API
|
||||
|
||||
Providers can implement the Telemetry API to collect metrics:
|
||||
- Token usage
|
||||
- Latency
|
||||
- Error rates
|
||||
- Custom metrics
|
||||
|
||||
---
|
||||
|
||||
## 15. Model System
|
||||
|
||||
### Model Registry
|
||||
|
||||
**File**: `llama_stack/models/llama/sku_list.py`
|
||||
|
||||
```python
|
||||
resolve_model("meta-llama/Llama-2-7b")
|
||||
→ Llama2Model(...)
|
||||
```
|
||||
|
||||
Maps model IDs to their:
|
||||
- Architecture
|
||||
- Tokenizer
|
||||
- Quantization options
|
||||
- Required resources
|
||||
|
||||
### Supported Models
|
||||
|
||||
- **Llama 3** - Full architecture support
|
||||
- **Llama 3.1** - Extended context
|
||||
- **Llama 3.2** - Multimodal support
|
||||
- **Llama 4** - Latest generation
|
||||
- **Custom models** - Via provider registration
|
||||
|
||||
### Model Quantization
|
||||
|
||||
- int8, int4
|
||||
- GPTQ
|
||||
- Hadamard transform
|
||||
- Custom quantizers
|
||||
|
||||
---
|
||||
|
||||
## 16. Key Files to Understand
|
||||
|
||||
### For Understanding Core Concepts
|
||||
1. `llama_stack/core/datatypes.py` - Configuration data types
|
||||
2. `llama_stack/providers/datatypes.py` - Provider specs
|
||||
3. `llama_stack/apis/inference/inference.py` - Example API
|
||||
|
||||
### For Understanding Runtime
|
||||
1. `llama_stack/core/stack.py` - Main runtime class
|
||||
2. `llama_stack/core/resolver.py` - Dependency resolution
|
||||
3. `llama_stack/core/library_client.py` - In-process client
|
||||
|
||||
### For Understanding Providers
|
||||
1. `llama_stack/providers/registry/inference.py` - Inference provider registry
|
||||
2. `llama_stack/providers/inline/inference/meta_reference/inference.py` - Example inline
|
||||
3. `llama_stack/providers/remote/inference/openai/openai.py` - Example remote
|
||||
|
||||
### For Understanding Distributions
|
||||
1. `llama_stack/distributions/template.py` - Distribution template
|
||||
2. `llama_stack/distributions/starter/starter.py` - Starter distro
|
||||
3. `llama_stack/cli/stack/run.py` - Distribution startup
|
||||
|
||||
---
|
||||
|
||||
## 17. Development Workflow
|
||||
|
||||
### Running Locally
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
uv sync --all-groups
|
||||
|
||||
# Run a distribution (auto-starts server)
|
||||
llama stack run starter
|
||||
|
||||
# In another terminal, interact with it
|
||||
curl http://localhost:8321/health
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
```bash
|
||||
# Unit tests (fast, no external dependencies)
|
||||
uv run --group unit pytest tests/unit/
|
||||
|
||||
# Integration tests (with record-replay)
|
||||
uv run --group test pytest tests/integration/ --stack-config=starter
|
||||
|
||||
# Re-record integration tests (record real API calls)
|
||||
LLAMA_STACK_TEST_INFERENCE_MODE=record \
|
||||
uv run --group test pytest tests/integration/ --stack-config=starter
|
||||
```
|
||||
|
||||
### Building Distributions
|
||||
|
||||
```bash
|
||||
# Build Starter distribution
|
||||
llama stack build starter --name my-starter
|
||||
|
||||
# Run it
|
||||
llama stack run my-starter
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 18. Notable Implementation Details
|
||||
|
||||
### Async-First Architecture
|
||||
- All I/O is async (using `asyncio`)
|
||||
- Streaming responses with `AsyncIterator`
|
||||
- FastAPI for HTTP server (built on Starlette)
|
||||
|
||||
### Streaming Support
|
||||
- Inference responses stream tokens
|
||||
- Agents stream turn-by-turn updates
|
||||
- Proper async context preservation
|
||||
|
||||
### Error Handling
|
||||
- Structured errors with detailed messages
|
||||
- Graceful degradation when dependencies unavailable
|
||||
- Provider health checks
|
||||
|
||||
### Extensibility
|
||||
- External providers via module import
|
||||
- Custom APIs via ExternalApiSpec
|
||||
- Plugin discovery via provider registry
|
||||
|
||||
---
|
||||
|
||||
## 19. Typical Request Flow
|
||||
|
||||
```
|
||||
User Request (e.g., chat completion)
|
||||
↓
|
||||
CLI or SDK Client
|
||||
↓
|
||||
HTTP Request → FastAPI Server (port 8321)
|
||||
↓
|
||||
Route Handler (e.g., /inference/chat-completion)
|
||||
↓
|
||||
Router (Auto-Routed API)
|
||||
→ Determine which provider has the model
|
||||
↓
|
||||
Provider Implementation (e.g., OpenAI, Ollama, Meta Reference)
|
||||
↓
|
||||
External Service or Local Execution
|
||||
↓
|
||||
Response (streaming or complete)
|
||||
↓
|
||||
Send back to Client
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 20. Key Takeaways
|
||||
|
||||
1. **Unified APIs**: Single abstraction for 27+ AI capabilities
|
||||
2. **Pluggable Providers**: 50+ implementations (inline & remote)
|
||||
3. **Configuration-Driven**: Switch providers via YAML, not code
|
||||
4. **Distributions**: Pre-verified bundles for common scenarios
|
||||
5. **Record-Replay Testing**: Cost-effective integration tests
|
||||
6. **Two Client Modes**: Library (in-process) or HTTP (distributed)
|
||||
7. **Smart Routing**: Automatic request routing to appropriate providers
|
||||
8. **Async-First**: Native streaming and concurrent request handling
|
||||
9. **Extensible**: Custom APIs and providers easily added
|
||||
10. **Production-Ready**: Health checks, telemetry, access control, storage
|
||||
|
||||
---
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Client Applications │
|
||||
│ (CLI, SDK, Web UI, Custom Apps) │
|
||||
└────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────┴────────────┐
|
||||
│ │
|
||||
┌────▼────────┐ ┌───────▼──────┐
|
||||
│ Library │ │ HTTP Server │
|
||||
│ Client │ │ (FastAPI) │
|
||||
└────┬────────┘ └───────┬──────┘
|
||||
│ │
|
||||
└───────────┬───────────┘
|
||||
│
|
||||
┌──────────▼──────────┐
|
||||
│ LlamaStack Class │
|
||||
│ (implements all │
|
||||
│ 27 APIs) │
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
│ Router │ Routing │ Resource
|
||||
│ (Auto- │ Tables │ Registries
|
||||
│ routed │ (Models, │ (Models,
|
||||
│ APIs) │ Shields) │ Shields,
|
||||
│ │ │ etc.)
|
||||
└──────────────┼──────────────┘
|
||||
│
|
||||
┌────────────┴──────────────┐
|
||||
│ │
|
||||
┌────▼──────────┐ ┌──────────▼─────┐
|
||||
│ Inline │ │ Remote │
|
||||
│ Providers │ │ Providers │
|
||||
│ │ │ │
|
||||
│ • Meta Ref │ │ • OpenAI │
|
||||
│ • FAISS │ │ • Ollama │
|
||||
│ • Llama Guard │ │ • Qdrant │
|
||||
│ • etc. │ │ • etc. │
|
||||
│ │ │ │
|
||||
└───────────────┘ └─────────────────┘
|
||||
│ │
|
||||
│ │
|
||||
Local Execution External Services
|
||||
(GPUs/CPUs) (APIs/Servers)
|
||||
```
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue