docs: static docs

This commit is contained in:
Alexey Rybak 2025-09-22 09:54:15 -07:00
parent ee67381009
commit 2666122ff6
137 changed files with 14389 additions and 0 deletions

View file

@ -0,0 +1,118 @@
# Evaluation
The Evaluation API in Llama Stack allows you to run evaluation tasks on your GenAI applications and datasets. This section covers all available evaluation providers and their configuration.
## Overview
Llama Stack provides multiple evaluation providers:
- **Meta Reference** (`inline::meta-reference`) - Meta's reference implementation with multi-language support
- **NVIDIA** (`remote::nvidia`) - NVIDIA's evaluation platform integration
The Evaluation API works with several related APIs to provide comprehensive evaluation capabilities:
- `/datasetio` + `/datasets` API - Interface with datasets and data loaders
- `/scoring` + `/scoring_functions` API - Evaluate outputs of the system
- `/eval` + `/benchmarks` API - Generate outputs and perform scoring
:::tip
For conceptual information about evaluations, see our [Evaluation Concepts](../concepts/evaluation-concepts.mdx) guide.
:::
## Meta Reference
Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics.
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `kvstore` | `RedisKVStoreConfig \| SqliteKVStoreConfig \| PostgresKVStoreConfig \| MongoDBKVStoreConfig` | No | sqlite | Key-value store configuration |
### Sample Configuration
```yaml
kvstore:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/meta_reference_eval.db
```
### Features
- Multi-language evaluation support
- Comprehensive evaluation metrics
- Integration with various key-value stores (SQLite, Redis, PostgreSQL, MongoDB)
- Built-in support for popular benchmarks
## NVIDIA
NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform.
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `evaluator_url` | `str` | No | http://0.0.0.0:7331 | The url for accessing the evaluator service |
### Sample Configuration
```yaml
evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331}
```
### Features
- Integration with NVIDIA's evaluation platform
- Remote evaluation capabilities
- Scalable evaluation processing
## Usage Example
Here's a basic example of using the evaluation API:
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
# Register a dataset for evaluation
client.datasets.register(
purpose="evaluation",
source={
"type": "uri",
"uri": "huggingface://datasets/llamastack/evaluation_dataset"
},
dataset_id="my_eval_dataset"
)
# Run evaluation
eval_result = client.eval.run_evaluation(
dataset_id="my_eval_dataset",
scoring_functions=["accuracy", "bleu"],
model_id="my_model"
)
print(f"Evaluation completed: {eval_result}")
```
## Supported Benchmarks
Llama Stack pre-registers several popular open-benchmarks for easy model evaluation:
- **MMLU-COT** - Measuring Massive Multitask Language Understanding with Chain of Thought
- **GPQA-COT** - Graduate-level Google-Proof Q&A with Chain of Thought
- **SimpleQA** - Short fact-seeking question benchmark
- **MMMU** - Multimodal understanding and reasoning benchmark
## Best Practices
- **Choose appropriate providers**: Use Meta Reference for comprehensive evaluation, NVIDIA for platform-specific needs
- **Configure storage properly**: Ensure your key-value store configuration matches your performance requirements
- **Monitor evaluation progress**: Large evaluations can take time - implement proper monitoring
- **Use appropriate scoring functions**: Select scoring metrics that align with your evaluation goals
## Next Steps
- Check out the [Evaluation Concepts](../concepts/evaluation-concepts.mdx) guide for detailed conceptual information
- See the [Building Applications - Evaluation](../building-applications/evals.mdx) guide for application examples
- Review the [Evaluation Reference](../references/evals-reference.mdx) for comprehensive CLI and API usage
- Explore the [Scoring](./scoring.mdx) documentation for available scoring functions

View file

@ -0,0 +1,305 @@
# Post-Training
Post-training in Llama Stack allows you to fine-tune models using various providers and frameworks. This section covers all available post-training providers and how to use them effectively.
## Overview
Llama Stack provides multiple post-training providers:
- **HuggingFace SFTTrainer** (`inline::huggingface`) - Fine-tuning using HuggingFace ecosystem
- **TorchTune** (`inline::torchtune`) - Fine-tuning using Meta's TorchTune framework
- **NVIDIA** (`remote::nvidia`) - Fine-tuning using NVIDIA's platform
## HuggingFace SFTTrainer
[HuggingFace SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) is an inline post training provider for Llama Stack. It allows you to run supervised fine tuning on a variety of models using many datasets.
### Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support, CPU support, and MPS support (MacOS Metal Performance Shaders)
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `device` | `str` | No | cuda | |
| `distributed_backend` | `Literal['fsdp', 'deepspeed']` | No | | |
| `checkpoint_format` | `Literal['full_state', 'huggingface']` | No | huggingface | |
| `chat_template` | `str` | No | |
| `model_specific_config` | `dict` | No | `{'trust_remote_code': True, 'attn_implementation': 'sdpa'}` | |
| `max_seq_length` | `int` | No | 2048 | |
| `gradient_checkpointing` | `bool` | No | False | |
| `save_total_limit` | `int` | No | 3 | |
| `logging_steps` | `int` | No | 10 | |
| `warmup_ratio` | `float` | No | 0.1 | |
| `weight_decay` | `float` | No | 0.01 | |
| `dataloader_num_workers` | `int` | No | 4 | |
| `dataloader_pin_memory` | `bool` | No | True | |
### Sample Configuration
```yaml
checkpoint_format: huggingface
distributed_backend: null
device: cpu
```
### Setup
You can access the HuggingFace trainer via the `starter` distribution:
```bash
llama stack build --distro starter --image-type venv
llama stack run --image-type venv ~/.llama/distributions/starter/starter-run.yaml
```
### Usage Example
```python
import time
import uuid
from llama_stack_client.types import (
post_training_supervised_fine_tune_params,
algorithm_config_param,
)
def create_http_client():
from llama_stack_client import LlamaStackClient
return LlamaStackClient(base_url="http://localhost:8321")
client = create_http_client()
# Example Dataset
client.datasets.register(
purpose="post-training/messages",
source={
"type": "uri",
"uri": "huggingface://datasets/llamastack/simpleqa?split=train",
},
dataset_id="simpleqa",
)
training_config = post_training_supervised_fine_tune_params.TrainingConfig(
data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
batch_size=32,
data_format="instruct",
dataset_id="simpleqa",
shuffle=True,
),
gradient_accumulation_steps=1,
max_steps_per_epoch=0,
max_validation_steps=1,
n_epochs=4,
)
algorithm_config = algorithm_config_param.LoraFinetuningConfig(
alpha=1,
apply_lora_to_mlp=True,
apply_lora_to_output=False,
lora_attn_modules=["q_proj"],
rank=1,
type="LoRA",
)
job_uuid = f"test-job{uuid.uuid4()}"
# Example Model
training_model = "ibm-granite/granite-3.3-8b-instruct"
start_time = time.time()
response = client.post_training.supervised_fine_tune(
job_uuid=job_uuid,
logger_config={},
model=training_model,
hyperparam_search_config={},
training_config=training_config,
algorithm_config=algorithm_config,
checkpoint_dir="output",
)
print("Job: ", job_uuid)
# Wait for the job to complete!
while True:
status = client.post_training.job.status(job_uuid=job_uuid)
if not status:
print("Job not found")
break
print(status)
if status.status == "completed":
break
print("Waiting for job to complete...")
time.sleep(5)
end_time = time.time()
print("Job completed in", end_time - start_time, "seconds!")
print("Artifacts:")
print(client.post_training.job.artifacts(job_uuid=job_uuid))
```
## TorchTune
[TorchTune](https://github.com/pytorch/torchtune) is an inline post training provider for Llama Stack. It provides a simple and efficient way to fine-tune language models using PyTorch.
### Features
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support and single device capabilities
- Support for LoRA
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `torch_seed` | `int \| None` | No | | |
| `checkpoint_format` | `Literal['meta', 'huggingface']` | No | meta | |
### Sample Configuration
```yaml
checkpoint_format: meta
```
### Setup
You can access the TorchTune trainer by writing your own yaml pointing to the provider:
```yaml
post_training:
- provider_id: torchtune
provider_type: inline::torchtune
config: {}
```
You can then build and run your own stack with this provider.
### Usage Example
```python
import time
import uuid
from llama_stack_client.types import (
post_training_supervised_fine_tune_params,
algorithm_config_param,
)
def create_http_client():
from llama_stack_client import LlamaStackClient
return LlamaStackClient(base_url="http://localhost:8321")
client = create_http_client()
# Example Dataset
client.datasets.register(
purpose="post-training/messages",
source={
"type": "uri",
"uri": "huggingface://datasets/llamastack/simpleqa?split=train",
},
dataset_id="simpleqa",
)
training_config = post_training_supervised_fine_tune_params.TrainingConfig(
data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
batch_size=32,
data_format="instruct",
dataset_id="simpleqa",
shuffle=True,
),
gradient_accumulation_steps=1,
max_steps_per_epoch=0,
max_validation_steps=1,
n_epochs=4,
)
algorithm_config = algorithm_config_param.LoraFinetuningConfig(
alpha=1,
apply_lora_to_mlp=True,
apply_lora_to_output=False,
lora_attn_modules=["q_proj"],
rank=1,
type="LoRA",
)
job_uuid = f"test-job{uuid.uuid4()}"
# Example Model
training_model = "meta-llama/Llama-2-7b-hf"
start_time = time.time()
response = client.post_training.supervised_fine_tune(
job_uuid=job_uuid,
logger_config={},
model=training_model,
hyperparam_search_config={},
training_config=training_config,
algorithm_config=algorithm_config,
checkpoint_dir="output",
)
print("Job: ", job_uuid)
# Wait for the job to complete!
while True:
status = client.post_training.job.status(job_uuid=job_uuid)
if not status:
print("Job not found")
break
print(status)
if status.status == "completed":
break
print("Waiting for job to complete...")
time.sleep(5)
end_time = time.time()
print("Job completed in", end_time - start_time, "seconds!")
print("Artifacts:")
print(client.post_training.job.artifacts(job_uuid=job_uuid))
```
## NVIDIA
NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The NVIDIA API key. |
| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
| `project_id` | `str \| None` | No | test-example-model@v1 | The NVIDIA project ID. |
| `customizer_url` | `str \| None` | No | | Base URL for the NeMo Customizer API |
| `timeout` | `int` | No | 300 | Timeout for the NVIDIA Post Training API |
| `max_retries` | `int` | No | 3 | Maximum number of retries for the NVIDIA Post Training API |
| `output_model_dir` | `str` | No | test-example-model@v1 | Directory to save the output model |
### Sample Configuration
```yaml
api_key: ${env.NVIDIA_API_KEY:=}
dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
```
## Best Practices
- **Choose the right provider**: Use HuggingFace for broader compatibility, TorchTune for Meta models, or NVIDIA for their ecosystem
- **Configure hardware appropriately**: Ensure your configuration matches your available hardware (CPU, GPU, MPS)
- **Monitor jobs**: Always monitor job status and handle completion appropriately
- **Use appropriate datasets**: Ensure your dataset format matches the expected input format for your chosen provider
## Next Steps
- Check out the [Building Applications - Fine-tuning](../building-applications/index.mdx) guide for application-level examples
- See the [Providers](../providers/post_training/index.mdx) section for detailed provider documentation
- Review the [API Reference](../api-reference/post-training.mdx) for complete API documentation

View file

@ -0,0 +1,193 @@
# Scoring
The Scoring API in Llama Stack allows you to evaluate outputs of your GenAI system using various scoring functions and metrics. This section covers all available scoring providers and their configuration.
## Overview
Llama Stack provides multiple scoring providers:
- **Basic** (`inline::basic`) - Simple evaluation metrics and scoring functions
- **Braintrust** (`inline::braintrust`) - Advanced evaluation using the Braintrust platform
- **LLM-as-Judge** (`inline::llm-as-judge`) - Uses language models to evaluate responses
The Scoring API is associated with `ScoringFunction` resources and provides a suite of out-of-the-box scoring functions. You can also add custom evaluators to meet specific evaluation needs.
## Basic Scoring
Basic scoring provider for simple evaluation metrics and scoring functions. This provider offers fundamental scoring capabilities without external dependencies.
### Configuration
No configuration required - this provider works out of the box.
```yaml
{}
```
### Features
- Simple evaluation metrics (accuracy, precision, recall, F1-score)
- String matching and similarity metrics
- Basic statistical scoring functions
- No external dependencies required
- Fast execution for standard metrics
### Use Cases
- Quick evaluation of basic accuracy metrics
- String similarity comparisons
- Statistical analysis of model outputs
- Development and testing scenarios
## Braintrust
Braintrust scoring provider for evaluation and scoring using the [Braintrust platform](https://braintrustdata.com/). Braintrust provides advanced evaluation capabilities and experiment tracking.
### Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `openai_api_key` | `str \| None` | No | | The OpenAI API Key for LLM-powered evaluations |
### Sample Configuration
```yaml
openai_api_key: ${env.OPENAI_API_KEY:=}
```
### Features
- Advanced evaluation metrics
- Experiment tracking and comparison
- LLM-powered evaluation functions
- Integration with Braintrust's evaluation suite
- Detailed scoring analytics and insights
### Use Cases
- Production evaluation pipelines
- A/B testing of model versions
- Advanced scoring with custom metrics
- Detailed evaluation reporting and analysis
## LLM-as-Judge
LLM-as-judge scoring provider that uses language models to evaluate and score responses. This approach leverages the reasoning capabilities of large language models to assess quality, relevance, and other subjective metrics.
### Configuration
No configuration required - this provider works out of the box.
```yaml
{}
```
### Features
- Subjective quality evaluation using LLMs
- Flexible evaluation criteria definition
- Natural language evaluation explanations
- Support for complex evaluation scenarios
- Contextual understanding of responses
### Use Cases
- Evaluating response quality and relevance
- Assessing creativity and coherence
- Subjective metric evaluation
- Human-like judgment for complex tasks
## Usage Examples
### Basic Scoring Example
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
# Register a basic accuracy scoring function
client.scoring_functions.register(
scoring_function_id="basic_accuracy",
provider_id="basic",
provider_scoring_function_id="accuracy"
)
# Use the scoring function
result = client.scoring.score(
input_rows=[
{"expected": "Paris", "actual": "Paris"},
{"expected": "London", "actual": "Paris"}
],
scoring_function_id="basic_accuracy"
)
print(f"Accuracy: {result.results[0].score}")
```
### LLM-as-Judge Example
```python
# Register an LLM-as-judge scoring function
client.scoring_functions.register(
scoring_function_id="quality_judge",
provider_id="llm_judge",
provider_scoring_function_id="response_quality",
params={
"criteria": "Evaluate response quality, relevance, and helpfulness",
"scale": "1-10"
}
)
# Score responses using LLM judgment
result = client.scoring.score(
input_rows=[{
"query": "What is machine learning?",
"response": "Machine learning is a subset of AI that enables computers to learn patterns from data..."
}],
scoring_function_id="quality_judge"
)
```
### Braintrust Integration Example
```python
# Register a Braintrust scoring function
client.scoring_functions.register(
scoring_function_id="braintrust_eval",
provider_id="braintrust",
provider_scoring_function_id="semantic_similarity"
)
# Run evaluation with Braintrust
result = client.scoring.score(
input_rows=[{
"reference": "The capital of France is Paris",
"candidate": "Paris is the capital city of France"
}],
scoring_function_id="braintrust_eval"
)
```
## Best Practices
- **Choose appropriate providers**: Use Basic for simple metrics, Braintrust for advanced analytics, LLM-as-Judge for subjective evaluation
- **Define clear criteria**: When using LLM-as-Judge, provide specific evaluation criteria and scales
- **Validate scoring functions**: Test your scoring functions with known examples before production use
- **Monitor performance**: Track scoring performance and adjust thresholds based on results
- **Combine multiple metrics**: Use different scoring providers together for comprehensive evaluation
## Integration with Evaluation
The Scoring API works closely with the [Evaluation](./evaluation.mdx) API to provide comprehensive evaluation workflows:
1. **Datasets** are loaded via the DatasetIO API
2. **Evaluation** generates model outputs using the Eval API
3. **Scoring** evaluates the quality of outputs using various scoring functions
4. **Results** are aggregated and reported for analysis
## Next Steps
- Check out the [Evaluation](./evaluation.mdx) guide for running complete evaluations
- See the [Building Applications - Evaluation](../building-applications/evals.mdx) guide for application examples
- Review the [Evaluation Reference](../references/evals-reference.mdx) for comprehensive scoring function usage
- Explore the [Evaluation Concepts](../concepts/evaluation-concepts.mdx) for detailed conceptual information

View file

@ -0,0 +1,185 @@
---
title: Agent Execution Loop
description: Understanding the internal processing flow of Llama Stack agents
sidebar_label: Agent Execution Loop
sidebar_position: 4
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agent Execution Loop
Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
## Steps in the Agent Workflow
Each agent turn follows these key steps:
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
2. **Context Retrieval**:
- If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the `instructions` field to steer the agent.
- For new documents, they are first inserted into the memory bank.
- Retrieved context is provided to the LLM as a tool response in the message history.
3. **Inference Loop**: The agent enters its main execution loop:
- The LLM receives a user prompt (with previous tool outputs)
- The LLM generates a response, potentially with [tool calls](./tools)
- If tool calls are present:
- Tool inputs are safety-checked
- Tools are executed (e.g., web search, code execution)
- Tool responses are fed back to the LLM for synthesis
- The loop continues until:
- The LLM provides a final response without tool calls
- Maximum iterations are reached
- Token limit is exceeded
4. **Final Safety Check**: The agent's final response is screened through safety shields
## Execution Flow Diagram
```mermaid
sequenceDiagram
participant U as User
participant E as Executor
participant M as Memory Bank
participant L as LLM
participant T as Tools
participant S as Safety Shield
Note over U,S: Agent Turn Start
U->>S: 1. Submit Prompt
activate S
S->>E: Input Safety Check
deactivate S
loop Inference Loop
E->>L: 2.1 Augment with Context
L-->>E: 2.2 Response (with/without tool calls)
alt Has Tool Calls
E->>S: Check Tool Input
S->>T: 3.1 Execute Tool
T-->>E: 3.2 Tool Response
E->>L: 4.1 Tool Response
L-->>E: 4.2 Synthesized Response
end
opt Stop Conditions
Note over E: Break if:
Note over E: - No tool calls
Note over E: - Max iterations reached
Note over E: - Token limit exceeded
end
end
E->>S: Output Safety Check
S->>U: 5. Final Response
```
Each step in this process can be monitored and controlled through configurations.
## Agent Execution Example
Here's an example that demonstrates monitoring the agent's execution:
<Tabs>
<TabItem value="streaming" label="Streaming Execution">
```python
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
# Replace host and port
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
# Check with `llama-stack-client models list`
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
# Enable both RAG and tool usage
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": ["my_docs"]},
},
"builtin::code_interpreter",
],
# Configure safety (optional)
input_shields=["llama_guard"],
output_shields=["llama_guard"],
# Control the inference loop
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.7, "top_p": 0.95},
"max_tokens": 2048,
},
)
session_id = agent.create_session("monitored_session")
# Stream the agent's execution steps
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
documents=[
{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain",
}
],
session_id=session_id,
)
# Monitor each step of execution
for log in AgentEventLogger().log(response):
log.print()
```
</TabItem>
<TabItem value="non-streaming" label="Non-Streaming Execution">
```python
from rich.pretty import pprint
# Using non-streaming API, the response contains input, steps, and output.
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
documents=[
{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain",
}
],
session_id=session_id,
stream=False,
)
pprint(f"Input: {response.input_messages}")
pprint(f"Output: {response.output_message.content}")
pprint(f"Steps: {response.steps}")
```
</TabItem>
</Tabs>
## Key Configuration Options
### Loop Control
- **max_infer_iters**: Maximum number of inference iterations (default: 5)
- **max_tokens**: Token limit for responses
- **temperature**: Controls response randomness
### Safety Configuration
- **input_shields**: Safety checks for user input
- **output_shields**: Safety checks for agent responses
### Tool Integration
- **tools**: List of available tools for the agent
- **tool_choice**: Control over when tools are used
## Related Resources
- **[Agents](./agent)** - Understanding agent fundamentals
- **[Tools Integration](./tools)** - Adding capabilities to agents
- **[Safety Guardrails](./safety)** - Implementing safety measures
- **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced workflows

View file

@ -0,0 +1,112 @@
---
title: Agents
description: Build powerful AI applications with the Llama Stack agent framework
sidebar_label: Agents
sidebar_position: 3
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agents
An Agent in Llama Stack is a powerful abstraction that allows you to build complex AI applications.
The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
## Core Concepts
### 1. Agent Configuration
Agents are configured using the `AgentConfig` class, which includes:
- **Model**: The underlying LLM to power the agent
- **Instructions**: System prompt that defines the agent's behavior
- **Tools**: Capabilities the agent can use to interact with external systems
- **Safety Shields**: Guardrails to ensure responsible AI behavior
```python
from llama_stack_client import Agent
# Create the agent
agent = Agent(
llama_stack_client,
model="meta-llama/Llama-3-70b-chat",
instructions="You are a helpful assistant that can use tools to answer questions.",
tools=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
)
```
### 2. Sessions
Agents maintain state through sessions, which represent a conversation thread:
```python
# Create a session
session_id = agent.create_session(session_name="My conversation")
```
### 3. Turns
Each interaction with an agent is called a "turn" and consists of:
- **Input Messages**: What the user sends to the agent
- **Steps**: The agent's internal processing (inference, tool execution, etc.)
- **Output Message**: The agent's response
<Tabs>
<TabItem value="streaming" label="Streaming Response">
```python
from llama_stack_client import AgentEventLogger
# Create a turn with streaming response
turn_response = agent.create_turn(
session_id=session_id,
messages=[{"role": "user", "content": "Tell me about Llama models"}],
)
for log in AgentEventLogger().log(turn_response):
log.print()
```
</TabItem>
<TabItem value="non-streaming" label="Non-Streaming Response">
```python
from rich.pretty import pprint
# Non-streaming API
response = agent.create_turn(
session_id=session_id,
messages=[{"role": "user", "content": "Tell me about Llama models"}],
stream=False,
)
print("Inputs:")
pprint(response.input_messages)
print("Output:")
pprint(response.output_message.content)
print("Steps:")
pprint(response.steps)
```
</TabItem>
</Tabs>
### 4. Steps
Each turn consists of multiple steps that represent the agent's thought process:
- **Inference Steps**: The agent generating text responses
- **Tool Execution Steps**: The agent using tools to gather information
- **Shield Call Steps**: Safety checks being performed
## Agent Execution Loop
Refer to the [Agent Execution Loop](./agent-execution-loop) for more details on what happens within an agent turn.
## Related Resources
- **[Agent Execution Loop](./agent-execution-loop)** - Understanding the internal processing flow
- **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced agents
- **[Tools Integration](./tools)** - Extending agent capabilities with external tools
- **[Safety Guardrails](./safety)** - Implementing responsible AI practices

View file

@ -0,0 +1,264 @@
---
title: Evaluations
description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
sidebar_label: Evaluations
sidebar_position: 7
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Evaluations
The Llama Stack provides a comprehensive set of APIs for supporting evaluations of LLM applications:
- **`/datasetio` + `/datasets` API**: Manage evaluation datasets and data input/output
- **`/scoring` + `/scoring_functions` API**: Apply scoring functions to evaluate responses
- **`/eval` + `/benchmarks` API**: Run benchmarks and structured evaluations
This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.
:::tip[Interactive Examples]
Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
:::
## Application Evaluation Example
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
In this example, we will show you how to:
1. **Build an Agent** with Llama Stack
2. **Query the agent's sessions, turns, and steps** to analyze execution
3. **Evaluate the results** using scoring functions
## Step-by-Step Evaluation Process
### 1. Building a Search Agent
First, let's create an agent that can search the web to answer questions:
```python
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant. Use search tool to answer the questions.",
tools=["builtin::websearch"],
)
# Test prompts for evaluation
user_prompts = [
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
# Execute all prompts in the session
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
```
### 2. Query Agent Execution Steps
Now, let's analyze the agent's execution steps to understand its performance:
<Tabs>
<TabItem value="session-analysis" label="Session Analysis">
```python
from rich.pretty import pprint
# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
pprint(session_response)
```
</TabItem>
<TabItem value="tool-validation" label="Tool Usage Validation">
```python
# Sanity check: Verify that all user prompts are followed by tool calls
num_tool_call = 0
for turn in session_response.turns:
for step in turn.steps:
if (
step.step_type == "tool_execution"
and step.tool_calls[0].tool_name == "brave_search"
):
num_tool_call += 1
print(
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)
```
</TabItem>
</Tabs>
### 3. Evaluate Agent Responses
Now we'll evaluate the agent's responses using Llama Stack's scoring API:
<Tabs>
<TabItem value="data-preparation" label="Data Preparation">
```python
# Process agent execution history into evaluation rows
eval_rows = []
# Define expected answers for our test prompts
expected_answers = [
"Dallas Mavericks and the Minnesota Timberwolves",
"Season 4, Episode 12",
"King Cobra",
]
# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
eval_rows.append(
{
"input_query": turn.input_messages[0].content,
"generated_answer": turn.output_message.content,
"expected_answer": expected_answers[i],
}
)
pprint(eval_rows)
```
</TabItem>
<TabItem value="scoring" label="Scoring & Evaluation">
```python
# Configure scoring parameters
scoring_params = {
"basic::subset_of": None, # Check if generated answer contains expected answer
}
# Run evaluation using Llama Stack's scoring API
scoring_response = client.scoring.score(
input_rows=eval_rows,
scoring_functions=scoring_params
)
pprint(scoring_response)
# Analyze results
for i, result in enumerate(scoring_response.results):
print(f"Query {i+1}: {result.score}")
print(f" Generated: {eval_rows[i]['generated_answer'][:100]}...")
print(f" Expected: {expected_answers[i]}")
print(f" Score: {result.score}")
print()
```
</TabItem>
</Tabs>
## Available Scoring Functions
Llama Stack provides several built-in scoring functions:
### Basic Scoring Functions
- **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
- **`basic::exact_match`**: Performs exact string matching between expected and generated answers
- **`basic::regex_match`**: Uses regular expressions to match patterns in responses
### Advanced Scoring Functions
- **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
- **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
- **`llm_as_judge::safety`**: Assesses response safety and appropriateness
### Custom Scoring Functions
You can also create custom scoring functions for domain-specific evaluation needs.
## Evaluation Workflow Best Practices
### 🎯 **Dataset Preparation**
- Use diverse test cases that cover edge cases and common scenarios
- Include clear expected answers or success criteria
- Balance your dataset across different difficulty levels
### 📊 **Metrics Selection**
- Choose appropriate scoring functions for your use case
- Combine multiple metrics for comprehensive evaluation
- Consider both automated and human evaluation metrics
### 🔄 **Iterative Improvement**
- Run evaluations regularly during development
- Use evaluation results to identify areas for improvement
- Track performance changes over time
### 📈 **Analysis & Reporting**
- Analyze failures to understand model limitations
- Generate comprehensive evaluation reports
- Share results with stakeholders for informed decision-making
## Advanced Evaluation Scenarios
### Batch Evaluation
For evaluating large datasets efficiently:
```python
# Prepare large evaluation dataset
large_eval_dataset = [
{"input_query": query, "expected_answer": answer}
for query, answer in zip(queries, expected_answers)
]
# Run batch evaluation
batch_results = client.scoring.score(
input_rows=large_eval_dataset,
scoring_functions={
"basic::subset_of": None,
"llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
}
)
```
### Multi-Metric Evaluation
Combining different scoring approaches:
```python
comprehensive_scoring = {
"exact_match": "basic::exact_match",
"subset_match": "basic::subset_of",
"llm_judge": "llm_as_judge::accuracy",
"safety_check": "llm_as_judge::safety",
}
results = client.scoring.score(
input_rows=eval_rows,
scoring_functions=comprehensive_scoring
)
```
## Related Resources
- **[Agents](./agent)** - Building agents for evaluation
- **[Tools Integration](./tools)** - Using tools in evaluated agents
- **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
- **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
- **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios

View file

@ -0,0 +1,83 @@
---
title: Building Applications
description: Comprehensive guides for building AI applications with Llama Stack
sidebar_label: Overview
sidebar_position: 5
---
# AI Application Examples
Llama Stack provides all the building blocks needed to create sophisticated AI applications.
## Getting Started
The best way to get started is to look at this comprehensive notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
**📓 [Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)**
## Core Topics
Here are the key topics that will help you build effective AI applications:
### 🤖 **Agent Development**
- **[Agent Framework](./agent)** - Understand the components and design patterns of the Llama Stack agent framework
- **[Agent Execution Loop](./agent-execution-loop)** - How agents process information, make decisions, and execute actions
- **[Agents vs Responses API](./responses-vs-agents)** - Learn when to use each API for different use cases
### 📚 **Knowledge Integration**
- **[RAG (Retrieval-Augmented Generation)](./rag)** - Enhance your agents with external knowledge through retrieval mechanisms
### 🛠️ **Capabilities & Extensions**
- **[Tools](./tools)** - Extend your agents' capabilities by integrating with external tools and APIs
### 📊 **Quality & Monitoring**
- **[Evaluations](./evals)** - Evaluate your agents' effectiveness and identify areas for improvement
- **[Telemetry](./telemetry)** - Monitor and analyze your agents' performance and behavior
- **[Safety](./safety)** - Implement guardrails and safety measures to ensure responsible AI behavior
### 🎮 **Interactive Development**
- **[Playground](./playground)** - Interactive environment for testing and developing applications
## Application Patterns
### 🤖 **Conversational Agents**
Build intelligent chatbots and assistants that can:
- Maintain context across conversations
- Access external knowledge bases
- Execute actions through tool integrations
- Apply safety filters and guardrails
### 📖 **RAG Applications**
Create knowledge-augmented applications that:
- Retrieve relevant information from documents
- Generate contextually accurate responses
- Handle large knowledge bases efficiently
- Provide source attribution
### 🔧 **Tool-Enhanced Systems**
Develop applications that can:
- Search the web for real-time information
- Interact with databases and APIs
- Perform calculations and analysis
- Execute complex multi-step workflows
### 🛡️ **Enterprise Applications**
Build production-ready systems with:
- Comprehensive safety measures
- Performance monitoring and analytics
- Scalable deployment configurations
- Evaluation and quality assurance
## Next Steps
1. **📖 Start with the Notebook** - Work through the complete tutorial
2. **🎯 Choose Your Pattern** - Pick the application type that matches your needs
3. **🏗️ Build Your Foundation** - Set up your [providers](/docs/providers/) and [distributions](/docs/distributions/)
4. **🚀 Deploy & Monitor** - Use our [deployment guides](/docs/deploying/) for production
## Related Resources
- **[Getting Started](/docs/getting-started/)** - Basic setup and concepts
- **[Providers](/docs/providers/)** - Available AI service providers
- **[Distributions](/docs/distributions/)** - Pre-configured deployment packages
- **[API Reference](/docs/api/)** - Complete API documentation

View file

@ -0,0 +1,299 @@
---
title: Llama Stack Playground
description: Interactive interface to explore and experiment with Llama Stack capabilities
sidebar_label: Playground
sidebar_position: 10
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Llama Stack Playground
:::note[Experimental Feature]
The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
:::
The Llama Stack Playground is a simple interface that aims to:
- **Showcase capabilities and concepts** of Llama Stack in an interactive environment
- **Demo end-to-end application code** to help users get started building their own applications
- **Provide a UI** to help users inspect and understand Llama Stack API providers and resources
## Key Features
### Interactive Playground Pages
The playground provides interactive pages for users to explore Llama Stack API capabilities:
#### Chatbot Interface
<video
controls
autoPlay
playsInline
muted
loop
style={{width: '100%'}}
>
<source src="https://github.com/user-attachments/assets/8d2ef802-5812-4a28-96e1-316038c84cbf" type="video/mp4" />
Your browser does not support the video tag.
</video>
<Tabs>
<TabItem value="chat" label="Chat">
**Simple Chat Interface**
- Chat directly with Llama models through an intuitive interface
- Uses the `/inference/chat-completion` streaming API under the hood
- Real-time message streaming for responsive interactions
- Perfect for testing model capabilities and prompt engineering
</TabItem>
<TabItem value="rag" label="RAG Chat">
**Document-Aware Conversations**
- Upload documents to create memory banks
- Chat with a RAG-enabled agent that can query your documents
- Uses Llama Stack's `/agents` API to create and manage RAG sessions
- Ideal for exploring knowledge-enhanced AI applications
</TabItem>
</Tabs>
#### Evaluation Interface
<video
controls
autoPlay
playsInline
muted
loop
style={{width: '100%'}}
>
<source src="https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5" type="video/mp4" />
Your browser does not support the video tag.
</video>
<Tabs>
<TabItem value="scoring" label="Scoring Evaluations">
**Custom Dataset Evaluation**
- Upload your own evaluation datasets
- Run evaluations using available scoring functions
- Uses Llama Stack's `/scoring` API for flexible evaluation workflows
- Great for testing application performance on custom metrics
</TabItem>
<TabItem value="benchmarks" label="Benchmark Evaluations">
<video
controls
autoPlay
playsInline
muted
loop
style={{width: '100%', marginBottom: '1rem'}}
>
<source src="https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf" type="video/mp4" />
Your browser does not support the video tag.
</video>
**Pre-registered Evaluation Tasks**
- Evaluate models or agents on pre-defined tasks
- Uses Llama Stack's `/eval` API for comprehensive evaluation
- Combines datasets and scoring functions for standardized testing
**Setup Requirements:**
Register evaluation datasets and benchmarks first:
```bash
# Register evaluation dataset
llama-stack-client datasets register \
--dataset-id "mmlu" \
--provider-id "huggingface" \
--url "https://huggingface.co/datasets/llamastack/evals" \
--metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \
--schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string"}, "chat_completion_input": {"type": "string"}}'
# Register benchmark task
llama-stack-client benchmarks register \
--eval-task-id meta-reference-mmlu \
--provider-id meta-reference \
--dataset-id mmlu \
--scoring-functions basic::regex_parser_multiple_choice_answer
```
</TabItem>
</Tabs>
#### Inspection Interface
<video
controls
autoPlay
playsInline
muted
loop
style={{width: '100%'}}
>
<source src="https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99" type="video/mp4" />
Your browser does not support the video tag.
</video>
<Tabs>
<TabItem value="providers" label="API Providers">
**Provider Management**
- Inspect available Llama Stack API providers
- View provider configurations and capabilities
- Uses the `/providers` API for real-time provider information
- Essential for understanding your deployment's capabilities
</TabItem>
<TabItem value="resources" label="API Resources">
**Resource Exploration**
- Inspect Llama Stack API resources including:
- **Models**: Available language models
- **Datasets**: Registered evaluation datasets
- **Memory Banks**: Vector databases and knowledge stores
- **Benchmarks**: Evaluation tasks and scoring functions
- **Shields**: Safety and content moderation tools
- Uses `/<resources>/list` APIs for comprehensive resource visibility
- For detailed information about resources, see [Core Concepts](/docs/concepts)
</TabItem>
</Tabs>
## Getting Started
### Quick Start Guide
<Tabs>
<TabItem value="setup" label="Setup">
**1. Start the Llama Stack API Server**
```bash
# Build and run a distribution (example: together)
llama stack build --distro together --image-type venv
llama stack run together
```
**2. Start the Streamlit UI**
```bash
# Launch the playground interface
uv run --with ".[ui]" streamlit run llama_stack.core/ui/app.py
```
</TabItem>
<TabItem value="usage" label="Usage Tips">
**Making the Most of the Playground:**
- **Start with Chat**: Test basic model interactions and prompt engineering
- **Explore RAG**: Upload sample documents to see knowledge-enhanced responses
- **Try Evaluations**: Use the scoring interface to understand evaluation metrics
- **Inspect Resources**: Check what providers and resources are available
- **Experiment with Settings**: Adjust parameters to see how they affect results
</TabItem>
</Tabs>
### Available Distributions
The playground works with any Llama Stack distribution. Popular options include:
<Tabs>
<TabItem value="together" label="Together AI">
```bash
llama stack build --distro together --image-type venv
llama stack run together
```
**Features:**
- Cloud-hosted models
- Fast inference
- Multiple model options
</TabItem>
<TabItem value="ollama" label="Ollama (Local)">
```bash
llama stack build --distro ollama --image-type venv
llama stack run ollama
```
**Features:**
- Local model execution
- Privacy-focused
- No internet required
</TabItem>
<TabItem value="meta-reference" label="Meta Reference">
```bash
llama stack build --distro meta-reference --image-type venv
llama stack run meta-reference
```
**Features:**
- Reference implementation
- All API features available
- Best for development
</TabItem>
</Tabs>
## Use Cases & Examples
### Educational Use Cases
- **Learning Llama Stack**: Hands-on exploration of API capabilities
- **Prompt Engineering**: Interactive testing of different prompting strategies
- **RAG Experimentation**: Understanding how document retrieval affects responses
- **Evaluation Understanding**: See how different metrics evaluate model performance
### Development Use Cases
- **Prototype Testing**: Quick validation of application concepts
- **API Exploration**: Understanding available endpoints and parameters
- **Integration Planning**: Seeing how different components work together
- **Demo Creation**: Showcasing Llama Stack capabilities to stakeholders
### Research Use Cases
- **Model Comparison**: Side-by-side testing of different models
- **Evaluation Design**: Understanding how scoring functions work
- **Safety Testing**: Exploring shield effectiveness with different inputs
- **Performance Analysis**: Measuring model behavior across different scenarios
## Best Practices
### 🚀 **Getting Started**
- Begin with simple chat interactions to understand basic functionality
- Gradually explore more advanced features like RAG and evaluations
- Use the inspection tools to understand your deployment's capabilities
### 🔧 **Development Workflow**
- Use the playground to prototype before writing application code
- Test different parameter settings interactively
- Validate evaluation approaches before implementing them programmatically
### 📊 **Evaluation & Testing**
- Start with simple scoring functions before trying complex evaluations
- Use the playground to understand evaluation results before automation
- Test safety features with various input types
### 🎯 **Production Preparation**
- Use playground insights to inform your production API usage
- Test edge cases and error conditions interactively
- Validate resource configurations before deployment
## Related Resources
- **[Getting Started Guide](/docs/getting-started)** - Complete setup and introduction
- **[Core Concepts](/docs/concepts)** - Understanding Llama Stack fundamentals
- **[Agents](./agent)** - Building intelligent agents
- **[RAG (Retrieval Augmented Generation)](./rag)** - Knowledge-enhanced applications
- **[Evaluations](./evals)** - Comprehensive evaluation framework
- **[API Reference](/docs/api-reference)** - Complete API documentation

View file

@ -0,0 +1,375 @@
---
title: Retrieval Augmented Generation (RAG)
description: Build knowledge-enhanced AI applications with external document retrieval
sidebar_label: RAG (Retrieval Augmented Generation)
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Retrieval Augmented Generation (RAG)
RAG enables your applications to reference and recall information from previous interactions or external documents.
## Architecture Overview
Llama Stack organizes the APIs that enable RAG into three layers:
1. **Lower-Level APIs**: Deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon)
2. **RAG Tool**: A first-class tool as part of the [Tools API](./tools) that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly
3. **Agents API**: The top-level [Agents API](./agent) that allows you to create agents that can use the tools to answer questions, perform tasks, and more
![RAG System Architecture](/img/rag.png)
The RAG system uses lower-level storage for different types of data:
- **Vector IO**: For semantic search and retrieval
- **Key-Value and Relational IO**: For structured data storage
:::info[Future Storage Types]
We may add more storage types like Graph IO in the future.
:::
## Setting up Vector Databases
For this guide, we will use [Ollama](https://ollama.com/) as the inference provider. Ollama is an LLM runtime that allows you to run Llama models locally.
Here's how to set up a vector database for RAG:
```python
# Create HTTP client
import os
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
# Register a vector database
vector_db_id = "my_documents"
response = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
provider_id="faiss",
)
```
## Document Ingestion
You can ingest documents into the vector database using two methods: directly inserting pre-chunked documents or using the RAG Tool.
### Direct Document Insertion
<Tabs>
<TabItem value="basic" label="Basic Insertion">
```python
# You can insert a pre-chunked document directly into the vector db
chunks = [
{
"content": "Your document text here",
"mime_type": "text/plain",
"metadata": {
"document_id": "doc1",
"author": "Jane Doe",
},
},
]
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)
```
</TabItem>
<TabItem value="embeddings" label="With Precomputed Embeddings">
If you decide to precompute embeddings for your documents, you can insert them directly into the vector database by including the embedding vectors in the chunk data. This is useful if you have a separate embedding service or if you want to customize the ingestion process.
```python
chunks_with_embeddings = [
{
"content": "First chunk of text",
"mime_type": "text/plain",
"embedding": [0.1, 0.2, 0.3, ...], # Your precomputed embedding vector
"metadata": {"document_id": "doc1", "section": "introduction"},
},
{
"content": "Second chunk of text",
"mime_type": "text/plain",
"embedding": [0.2, 0.3, 0.4, ...], # Your precomputed embedding vector
"metadata": {"document_id": "doc1", "section": "methodology"},
},
]
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks_with_embeddings)
```
:::warning[Embedding Dimensions]
When providing precomputed embeddings, ensure the embedding dimension matches the `embedding_dimension` specified when registering the vector database.
:::
</TabItem>
</Tabs>
### Document Retrieval
You can query the vector database to retrieve documents based on their embeddings.
```python
# You can then query for these chunks
chunks_response = client.vector_io.query(
vector_db_id=vector_db_id,
query="What do you know about..."
)
```
## Using the RAG Tool
:::danger[Deprecation Notice]
The RAG Tool is being deprecated in favor of directly using the OpenAI-compatible Search API. We recommend migrating to the OpenAI APIs for better compatibility and future support.
:::
A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. More examples for how to format a RAGDocument can be found in the [appendix](#more-ragdocument-examples).
### OpenAI API Integration & Migration
The RAG tool has been updated to use OpenAI-compatible APIs. This provides several benefits:
- **Files API Integration**: Documents are now uploaded using OpenAI's file upload endpoints
- **Vector Stores API**: Vector storage operations use OpenAI's vector store format with configurable chunking strategies
- **Error Resilience**: When processing multiple documents, individual failures are logged but don't crash the operation. Failed documents are skipped while successful ones continue processing.
### Migration Path
We recommend migrating to the OpenAI-compatible Search API for:
1. **Better OpenAI Ecosystem Integration**: Direct compatibility with OpenAI tools and workflows including the Responses API
2. **Future-Proof**: Continued support and feature development
3. **Full OpenAI Compatibility**: Vector Stores, Files, and Search APIs are fully compatible with OpenAI's Responses API
The OpenAI APIs are used under the hood, so you can continue to use your existing RAG Tool code with minimal changes. However, we recommend updating your code to use the new OpenAI-compatible APIs for better long-term support. If any documents fail to process, they will be logged in the response but will not cause the entire operation to fail.
### RAG Tool Example
```python
from llama_stack_client import RAGDocument
urls = ["memory_optimizations.rst", "chat.rst", "llama3.rst"]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
metadata={},
)
for i, url in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
# Query documents
results = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id],
content="What do you know about...",
)
```
### Custom Context Configuration
You can configure how the RAG tool adds metadata to the context if you find it useful for your application:
```python
# Query documents with custom template
results = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id],
content="What do you know about...",
query_config={
"chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
},
)
```
## Building RAG-Enhanced Agents
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
### Agent with Knowledge Search
```python
from llama_stack_client import Agent
# Create agent with memory
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
# Defaults
"query_config": {
"chunk_size_in_tokens": 512,
"chunk_overlap_in_tokens": 0,
"chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
},
},
}
],
)
session_id = agent.create_session("rag_session")
# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
response = agent.create_turn(
messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
session_id=session_id,
)
```
:::tip[Agent Instructions]
The `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
:::
### Document-Aware Conversations
You can also pass documents along with the user's message and ask questions about them:
```python
# Initial document ingestion
response = agent.create_turn(
messages=[
{"role": "user", "content": "I am providing some documents for reference."}
],
documents=[
{
"content": "https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/memory_optimizations.rst",
"mime_type": "text/plain",
}
],
session_id=session_id,
)
# Query with RAG
response = agent.create_turn(
messages=[{"role": "user", "content": "What are the key topics in the documents?"}],
session_id=session_id,
)
```
### Viewing Agent Responses
You can print the response with the following:
```python
from llama_stack_client import AgentEventLogger
for log in AgentEventLogger().log(response):
log.print()
```
## Vector Database Management
### Unregistering Vector DBs
If you need to clean up and unregister vector databases, you can do so as follows:
<Tabs>
<TabItem value="single" label="Single Database">
```python
# Unregister a specified vector database
vector_db_id = "my_vector_db_id"
print(f"Unregistering vector database: {vector_db_id}")
client.vector_dbs.unregister(vector_db_id=vector_db_id)
```
</TabItem>
<TabItem value="all" label="All Databases">
```python
# Unregister all vector databases
for vector_db_id in client.vector_dbs.list():
print(f"Unregistering vector database: {vector_db_id.identifier}")
client.vector_dbs.unregister(vector_db_id=vector_db_id.identifier)
```
</TabItem>
</Tabs>
## Best Practices
### 🎯 **Document Chunking**
- Use appropriate chunk sizes (512 tokens is often a good starting point)
- Consider overlap between chunks for better context preservation
- Experiment with different chunking strategies for your content type
### 🔍 **Embedding Strategy**
- Choose embedding models that match your domain
- Consider the trade-off between embedding dimension and performance
- Test different embedding models for your specific use case
### 📊 **Query Optimization**
- Use specific, well-formed queries for better retrieval
- Experiment with different search strategies
- Consider hybrid approaches (keyword + semantic search)
### 🛡️ **Error Handling**
- Implement proper error handling for failed document processing
- Monitor ingestion success rates
- Have fallback strategies for retrieval failures
## Appendix
### More RAGDocument Examples
Here are various ways to create RAGDocument objects for different content types:
```python
from llama_stack_client import RAGDocument
import base64
# File URI
RAGDocument(document_id="num-0", content={"uri": "file://path/to/file"})
# Plain text
RAGDocument(document_id="num-1", content="plain text")
# Explicit text input
RAGDocument(
document_id="num-2",
content={
"type": "text",
"text": "plain text input",
}, # for inputs that should be treated as text explicitly
)
# Image from URL
RAGDocument(
document_id="num-3",
content={
"type": "image",
"image": {"url": {"uri": "https://mywebsite.com/image.jpg"}},
},
)
# Base64 encoded image
B64_ENCODED_IMAGE = base64.b64encode(
requests.get(
"https://raw.githubusercontent.com/meta-llama/llama-stack/refs/heads/main/docs/_static/llama-stack.png"
).content
)
RAGDocument(
document_id="num-4",
content={"type": "image", "image": {"data": B64_ENCODED_IMAGE}},
)
```
For more strongly typed interaction, use the typed dicts found [here](https://github.com/meta-llama/llama-stack-client-python/blob/38cd91c9e396f2be0bec1ee96a19771582ba6f17/src/llama_stack_client/types/shared_params/document.py).
## Related Resources
- **[Agent Framework](./agent)** - Building intelligent agents
- **[Tools Integration](./tools)** - Extending agent capabilities
- **[Vector IO Providers](/docs/providers/vector_io/)** - Available vector database options
- **[OpenAI Compatibility](/docs/providers/openai-compatibility)** - Using OpenAI APIs

View file

@ -0,0 +1,221 @@
---
title: Agents vs OpenAI Responses API
description: Compare the Agents API and OpenAI Responses API for building AI applications with tool calling capabilities
sidebar_label: Agents vs Responses API
sidebar_position: 5
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Agents vs OpenAI Responses API
Llama Stack (LLS) provides two different APIs for building AI applications with tool calling capabilities: the **Agents API** and the **OpenAI Responses API**. While both enable AI systems to use tools, and maintain full conversation history, they serve different use cases and have distinct characteristics.
:::note
**Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](/docs/providers/openai-compatibility#chat-completions) directly, before progressing to Agents or Responses API.
:::
## Overview
### LLS Agents API
The Agents API is a full-featured, stateful system designed for complex, multi-turn conversations. It maintains conversation state through persistent sessions identified by a unique session ID. The API supports comprehensive agent lifecycle management, detailed execution tracking, and rich metadata about each interaction through a structured session/turn/step hierarchy. The API can orchestrate multiple tool calls within a single turn.
### OpenAI Responses API
The OpenAI Responses API is a full-featured, stateful system designed for complex, multi-turn conversations, with direct compatibility with OpenAI's conversational patterns enhanced by LLama Stack's tool calling capabilities. It maintains conversation state by chaining responses through a `previous_response_id`, allowing interactions to branch or continue from any prior point. Each response can perform multiple tool calls within a single turn.
### Key Differences
The LLS Agents API uses the Chat Completions API on the backend for inference as it's the industry standard for building AI applications and most LLM providers are compatible with this API. For a detailed comparison between Responses and Chat Completions, see [OpenAI's documentation](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
Additionally, Agents let you specify input/output shields whereas Responses do not (though support is planned). Agents use a linear conversation model referenced by a single session ID. Responses, on the other hand, support branching, where each response can serve as a fork point, and conversations are tracked by the latest response ID. Responses also lets you dynamically choose the model, vector store, files, MCP servers, and more on each inference call, enabling more complex workflows. Agents require a static configuration for these components at the start of the session.
Today the Agents and Responses APIs can be used independently depending on the use case. But, it is also productive to treat the APIs as complementary. It is not currently supported, but it is planned for the LLS Agents API to alternatively use the Responses API as its backend instead of the default Chat Completions API, i.e., enabling a combination of the safety features of Agents with the dynamic configuration and branching capabilities of Responses.
## Feature Comparison
| Feature | LLS Agents API | OpenAI Responses API |
|---------|------------|---------------------|
| **Conversation Management** | Linear persistent sessions | Can branch from any previous response ID |
| **Input/Output Safety Shields** | Supported | Not yet supported |
| **Per-call Flexibility** | Static per-session configuration | Dynamic per-call configuration |
## Use Case Example: Research with Multiple Search Methods
Let's compare how both APIs handle a research task where we need to:
1. Search for current information and examples
2. Access different information sources dynamically
3. Continue the conversation based on search results
<Tabs>
<TabItem value="agents" label="Agents API">
### Session-based Configuration with Safety Shields
```python
# Create agent with static session configuration
agent = Agent(
client,
model="Llama3.2-3B-Instruct",
instructions="You are a helpful coding assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": ["code_docs"]},
},
"builtin::code_interpreter",
],
input_shields=["llama_guard"],
output_shields=["llama_guard"],
)
session_id = agent.create_session("code_session")
# First turn: Search and execute
response1 = agent.create_turn(
messages=[
{
"role": "user",
"content": "Find examples of sorting algorithms and run a bubble sort on [3,1,4,1,5]",
},
],
session_id=session_id,
)
# Continue conversation in same session
response2 = agent.create_turn(
messages=[
{
"role": "user",
"content": "Now optimize that code and test it with a larger dataset",
},
],
session_id=session_id, # Same session, maintains full context
)
# Agents API benefits:
# ✅ Safety shields protect against malicious code execution
# ✅ Session maintains context between code executions
# ✅ Consistent tool configuration throughout conversation
print(f"First result: {response1.output_message.content}")
print(f"Optimization: {response2.output_message.content}")
```
</TabItem>
<TabItem value="responses" label="Responses API">
### Dynamic Per-call Configuration with Branching
```python
# First response: Use web search for latest algorithms
response1 = client.responses.create(
model="Llama3.2-3B-Instruct",
input="Search for the latest efficient sorting algorithms and their performance comparisons",
tools=[
{
"type": "web_search",
},
], # Web search for current information
)
# Continue conversation: Switch to file search for local docs
response2 = client.responses.create(
model="Llama3.2-1B-Instruct", # Switch to faster model
input="Now search my uploaded files for existing sorting implementations",
tools=[
{ # Using Responses API built-in tools
"type": "file_search",
"vector_store_ids": ["vs_abc123"], # Vector store containing uploaded files
},
],
previous_response_id=response1.id,
)
# Branch from first response: Try different search approach
response3 = client.responses.create(
model="Llama3.2-3B-Instruct",
input="Instead, search the web for Python-specific sorting best practices",
tools=[{"type": "web_search"}], # Different web search query
previous_response_id=response1.id, # Branch from response1
)
# Responses API benefits:
# ✅ Dynamic tool switching (web search ↔ file search per call)
# ✅ OpenAI-compatible tool patterns (web_search, file_search)
# ✅ Branch conversations to explore different information sources
# ✅ Model flexibility per search type
print(f"Web search results: {response1.output_message.content}")
print(f"File search results: {response2.output_message.content}")
print(f"Alternative web search: {response3.output_message.content}")
```
</TabItem>
</Tabs>
Both APIs demonstrate distinct strengths that make them valuable on their own for different scenarios. The Agents API excels in providing structured, safety-conscious workflows with persistent session management, while the Responses API offers flexibility through dynamic configuration and OpenAI compatible tool patterns.
## Use Case Examples
### 1. Research and Analysis with Safety Controls
**Best Choice: Agents API**
**Scenario:** You're building a research assistant for a financial institution that needs to analyze market data, execute code to process financial models, and search through internal compliance documents. The system must ensure all interactions are logged for regulatory compliance and protected by safety shields to prevent malicious code execution or data leaks.
**Why Agents API?** The Agents API provides persistent session management for iterative research workflows, built-in safety shields to protect against malicious code in financial models, and structured execution logs (session/turn/step) required for regulatory compliance. The static tool configuration ensures consistent access to your knowledge base and code interpreter throughout the entire research session.
### 2. Dynamic Information Gathering with Branching Exploration
**Best Choice: Responses API**
**Scenario:** You're building a competitive intelligence tool that helps businesses research market trends. Users need to dynamically switch between web search for current market data and file search through uploaded industry reports. They also want to branch conversations to explore different market segments simultaneously and experiment with different models for various analysis types.
**Why Responses API?** The Responses API's branching capability lets users explore multiple market segments from any research point. Dynamic per-call configuration allows switching between web search and file search as needed, while experimenting with different models (faster models for quick searches, more powerful models for deep analysis). The OpenAI-compatible tool patterns make integration straightforward.
### 3. OpenAI Migration with Advanced Tool Capabilities
**Best Choice: Responses API**
**Scenario:** You have an existing application built with OpenAI's Assistants API that uses file search and web search capabilities. You want to migrate to Llama Stack for better performance and cost control while maintaining the same tool calling patterns and adding new capabilities like dynamic vector store selection.
**Why Responses API?** The Responses API provides full OpenAI tool compatibility (`web_search`, `file_search`) with identical syntax, making migration seamless. The dynamic per-call configuration enables advanced features like switching vector stores per query or changing models based on query complexity - capabilities that extend beyond basic OpenAI functionality while maintaining compatibility.
### 4. Educational Programming Tutor
**Best Choice: Agents API**
**Scenario:** You're building a programming tutor that maintains student context across multiple sessions, safely executes code exercises, and tracks learning progress with audit trails for educators.
**Why Agents API?** Persistent sessions remember student progress across multiple interactions, safety shields prevent malicious code execution while allowing legitimate programming exercises, and structured execution logs help educators track learning patterns.
### 5. Advanced Software Debugging Assistant
**Best Choice: Agents API with Responses Backend**
**Scenario:** You're building a debugging assistant that helps developers troubleshoot complex issues. It needs to maintain context throughout a debugging session, safely execute diagnostic code, switch between different analysis tools dynamically, and branch conversations to explore multiple potential causes simultaneously.
**Why Agents + Responses?** The Agent provides safety shields for code execution and session management for the overall debugging workflow. The underlying Responses API enables dynamic model selection and flexible tool configuration per query, while branching lets you explore different theories (memory leak vs. concurrency issue) from the same debugging point and compare results.
:::info[Future Enhancement]
The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
:::
## Decision Framework
Use this framework to choose the right API for your use case:
### Choose Agents API when:
- ✅ You need **safety shields** for input/output validation
- ✅ Your application requires **linear conversation flow** with persistent context
- ✅ You need **audit trails** and structured execution logs
- ✅ Your tool configuration is **static** throughout the session
- ✅ You're building **educational, financial, or enterprise** applications with compliance requirements
### Choose Responses API when:
- ✅ You need **conversation branching** to explore multiple paths
- ✅ You want **dynamic per-call configuration** (models, tools, vector stores)
- ✅ You're **migrating from OpenAI** and want familiar tool patterns
- ✅ You need **OpenAI compatibility** for existing workflows
- ✅ Your application benefits from **flexible, experimental** interactions
## Related Resources
- **[Agents](./agent)** - Understanding the Agents API fundamentals
- **[Agent Execution Loop](./agent-execution-loop)** - How agents process turns and steps
- **[Tools Integration](./tools)** - Adding capabilities to both APIs
- **[OpenAI Compatibility](/docs/providers/openai-compatibility)** - Using OpenAI-compatible endpoints
- **[Safety Guardrails](./safety)** - Implementing safety measures in agents

View file

@ -0,0 +1,395 @@
---
title: Safety Guardrails
description: Implement safety measures and content moderation in Llama Stack applications
sidebar_label: Safety
sidebar_position: 9
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Safety Guardrails
Safety is a critical component of any AI application. Llama Stack provides a comprehensive Shield system that can be applied at multiple touchpoints to ensure responsible AI behavior and content moderation.
## Shield System Overview
The Shield system in Llama Stack provides:
- **Content filtering** for both input and output messages
- **Multi-touchpoint protection** across your application flow
- **Configurable safety policies** tailored to your use case
- **Integration with agents** for automated safety enforcement
## Basic Shield Usage
### Registering a Safety Shield
<Tabs>
<TabItem value="registration" label="Shield Registration">
```python
# Register a safety shield
shield_id = "content_safety"
client.shields.register(
shield_id=shield_id,
provider_shield_id="llama-guard-basic"
)
```
</TabItem>
<TabItem value="manual-check" label="Manual Safety Check">
```python
# Run content through shield manually
response = client.safety.run_shield(
shield_id=shield_id,
messages=[{"role": "user", "content": "User message here"}]
)
if response.violation:
print(f"Safety violation detected: {response.violation.user_message}")
# Handle violation appropriately
else:
print("Content passed safety checks")
```
</TabItem>
</Tabs>
## Agent Integration
Shields can be automatically applied to agent interactions for seamless safety enforcement:
<Tabs>
<TabItem value="input-shields" label="Input Shields">
```python
from llama_stack_client import Agent
# Create agent with input safety shields
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions="You are a helpful assistant",
input_shields=["content_safety"], # Shield user inputs
tools=["builtin::websearch"],
)
session_id = agent.create_session("safe_session")
# All user inputs will be automatically screened
response = agent.create_turn(
messages=[{"role": "user", "content": "Tell me about AI safety"}],
session_id=session_id,
)
```
</TabItem>
<TabItem value="output-shields" label="Output Shields">
```python
# Create agent with output safety shields
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions="You are a helpful assistant",
output_shields=["content_safety"], # Shield agent outputs
tools=["builtin::websearch"],
)
session_id = agent.create_session("safe_session")
# All agent responses will be automatically screened
response = agent.create_turn(
messages=[{"role": "user", "content": "Help me with my research"}],
session_id=session_id,
)
```
</TabItem>
<TabItem value="both-shields" label="Input & Output Shields">
```python
# Create agent with comprehensive safety coverage
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions="You are a helpful assistant",
input_shields=["content_safety"], # Screen user inputs
output_shields=["content_safety"], # Screen agent outputs
tools=["builtin::websearch"],
)
session_id = agent.create_session("fully_protected_session")
# Both input and output are automatically protected
response = agent.create_turn(
messages=[{"role": "user", "content": "Research question here"}],
session_id=session_id,
)
```
</TabItem>
</Tabs>
## Available Shield Types
### Llama Guard Shields
Llama Guard provides state-of-the-art content safety classification:
<Tabs>
<TabItem value="basic" label="Basic Llama Guard">
```python
# Basic Llama Guard for general content safety
client.shields.register(
shield_id="llama_guard_basic",
provider_shield_id="llama-guard-basic"
)
```
**Use Cases:**
- General content moderation
- Harmful content detection
- Basic safety compliance
</TabItem>
<TabItem value="advanced" label="Advanced Llama Guard">
```python
# Advanced Llama Guard with custom categories
client.shields.register(
shield_id="llama_guard_advanced",
provider_shield_id="llama-guard-advanced",
config={
"categories": [
"violence", "hate_speech", "sexual_content",
"self_harm", "illegal_activity"
],
"threshold": 0.8
}
)
```
**Use Cases:**
- Fine-tuned safety policies
- Domain-specific content filtering
- Enterprise compliance requirements
</TabItem>
</Tabs>
### Custom Safety Shields
Create domain-specific safety shields for specialized use cases:
```python
# Register custom safety shield
client.shields.register(
shield_id="financial_compliance",
provider_shield_id="custom-financial-shield",
config={
"detect_pii": True,
"financial_advice_warning": True,
"regulatory_compliance": "FINRA"
}
)
```
## Safety Response Handling
When safety violations are detected, handle them appropriately:
<Tabs>
<TabItem value="basic-handling" label="Basic Handling">
```python
response = client.safety.run_shield(
shield_id="content_safety",
messages=[{"role": "user", "content": "Potentially harmful content"}]
)
if response.violation:
violation = response.violation
print(f"Violation Type: {violation.violation_type}")
print(f"User Message: {violation.user_message}")
print(f"Metadata: {violation.metadata}")
# Log the violation for audit purposes
logger.warning(f"Safety violation detected: {violation.violation_type}")
# Provide appropriate user feedback
return "I can't help with that request. Please try asking something else."
```
</TabItem>
<TabItem value="advanced-handling" label="Advanced Handling">
```python
def handle_safety_response(safety_response, user_message):
"""Advanced safety response handling with logging and user feedback"""
if not safety_response.violation:
return {"safe": True, "message": "Content passed safety checks"}
violation = safety_response.violation
# Log violation details
audit_log = {
"timestamp": datetime.now().isoformat(),
"violation_type": violation.violation_type,
"original_message": user_message,
"shield_response": violation.user_message,
"metadata": violation.metadata
}
logger.warning(f"Safety violation: {audit_log}")
# Determine appropriate response based on violation type
if violation.violation_type == "hate_speech":
user_feedback = "I can't engage with content that contains hate speech. Let's keep our conversation respectful."
elif violation.violation_type == "violence":
user_feedback = "I can't provide information that could promote violence. How else can I help you today?"
else:
user_feedback = "I can't help with that request. Please try asking something else."
return {
"safe": False,
"user_feedback": user_feedback,
"violation_details": audit_log
}
# Usage
safety_result = handle_safety_response(response, user_input)
if not safety_result["safe"]:
return safety_result["user_feedback"]
```
</TabItem>
</Tabs>
## Safety Configuration Best Practices
### 🛡️ **Multi-Layer Protection**
- Use both input and output shields for comprehensive coverage
- Combine multiple shield types for different threat categories
- Implement fallback mechanisms when shields fail
### 📊 **Monitoring & Auditing**
- Log all safety violations for compliance and analysis
- Monitor false positive rates to tune shield sensitivity
- Track safety metrics across different use cases
### ⚙️ **Configuration Management**
- Use environment-specific safety configurations
- Implement A/B testing for shield effectiveness
- Regularly update shield models and policies
### 🔧 **Integration Patterns**
- Integrate shields early in the development process
- Test safety measures with adversarial inputs
- Provide clear user feedback for violations
## Advanced Safety Scenarios
### Context-Aware Safety
```python
# Safety shields that consider conversation context
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions="You are a healthcare assistant",
input_shields=["medical_safety"],
output_shields=["medical_safety"],
# Context helps shields make better decisions
safety_context={
"domain": "healthcare",
"user_type": "patient",
"compliance_level": "HIPAA"
}
)
```
### Dynamic Shield Selection
```python
def select_shield_for_user(user_profile):
"""Select appropriate safety shield based on user context"""
if user_profile.age < 18:
return "child_safety_shield"
elif user_profile.context == "enterprise":
return "enterprise_compliance_shield"
else:
return "general_safety_shield"
# Use dynamic shield selection
shield_id = select_shield_for_user(current_user)
response = client.safety.run_shield(
shield_id=shield_id,
messages=messages
)
```
## Compliance and Regulations
### Industry-Specific Safety
<Tabs>
<TabItem value="healthcare" label="Healthcare (HIPAA)">
```python
# Healthcare-specific safety configuration
client.shields.register(
shield_id="hipaa_compliance",
provider_shield_id="healthcare-safety-shield",
config={
"detect_phi": True, # Protected Health Information
"medical_advice_warning": True,
"regulatory_framework": "HIPAA"
}
)
```
</TabItem>
<TabItem value="financial" label="Financial (FINRA)">
```python
# Financial services safety configuration
client.shields.register(
shield_id="finra_compliance",
provider_shield_id="financial-safety-shield",
config={
"detect_financial_advice": True,
"investment_disclaimers": True,
"regulatory_framework": "FINRA"
}
)
```
</TabItem>
<TabItem value="education" label="Education (COPPA)">
```python
# Educational platform safety for minors
client.shields.register(
shield_id="coppa_compliance",
provider_shield_id="educational-safety-shield",
config={
"child_protection": True,
"educational_content_only": True,
"regulatory_framework": "COPPA"
}
)
```
</TabItem>
</Tabs>
## Related Resources
- **[Agents](./agent)** - Integrating safety shields with intelligent agents
- **[Agent Execution Loop](./agent-execution-loop)** - Understanding safety in the execution flow
- **[Evaluations](./evals)** - Evaluating safety shield effectiveness
- **[Telemetry](./telemetry)** - Monitoring safety violations and metrics
- **[Llama Guard Documentation](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)** - Advanced safety model details

View file

@ -0,0 +1,342 @@
---
title: Telemetry
description: Monitor and observe Llama Stack applications with comprehensive telemetry capabilities
sidebar_label: Telemetry
sidebar_position: 8
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Telemetry
The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output for complete observability of your AI applications.
## Event Types
The telemetry system supports three main types of events:
<Tabs>
<TabItem value="unstructured" label="Unstructured Logs">
Free-form log messages with severity levels for general application logging:
```python
unstructured_log_event = UnstructuredLogEvent(
message="This is a log message",
severity=LogSeverity.INFO
)
```
</TabItem>
<TabItem value="metrics" label="Metric Events">
Numerical measurements with units for tracking performance and usage:
```python
metric_event = MetricEvent(
metric="my_metric",
value=10,
unit="count"
)
```
</TabItem>
<TabItem value="structured" label="Structured Logs">
System events like span start/end that provide structured operation tracking:
```python
structured_log_event = SpanStartPayload(
name="my_span",
parent_span_id="parent_span_id"
)
```
</TabItem>
</Tabs>
## Spans and Traces
- **Spans**: Represent individual operations with timing information and hierarchical relationships
- **Traces**: Collections of related spans that form a complete request flow across your application
This hierarchical structure allows you to understand the complete execution path of requests through your Llama Stack application.
## Automatic Metrics Generation
Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
### Available Metrics
The following metrics are automatically generated for each inference request:
| Metric Name | Type | Unit | Description | Labels |
|-------------|------|------|-------------|--------|
| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
### Metric Generation Flow
1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
### Metric Aggregation Level
All metrics are generated and aggregated at the **inference request level**. This means:
- Each individual inference request generates its own set of metrics
- Metrics are not pre-aggregated across multiple requests
- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
### Example Metric Event
```python
MetricEvent(
trace_id="1234567890abcdef",
span_id="abcdef1234567890",
metric="total_tokens",
value=150,
timestamp=1703123456.789,
unit="tokens",
attributes={
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"provider_id": "tgi"
},
)
```
## Telemetry Sinks
Choose from multiple sink types based on your observability needs:
<Tabs>
<TabItem value="opentelemetry" label="OpenTelemetry">
Send events to an OpenTelemetry Collector for integration with observability platforms:
**Use Cases:**
- Visualizing traces in tools like Jaeger
- Collecting metrics for Prometheus
- Integration with enterprise observability stacks
**Features:**
- Standard OpenTelemetry format
- Compatible with all OpenTelemetry collectors
- Supports both traces and metrics
</TabItem>
<TabItem value="sqlite" label="SQLite">
Store events in a local SQLite database for direct querying:
**Use Cases:**
- Local development and debugging
- Custom analytics and reporting
- Offline analysis of application behavior
**Features:**
- Direct SQL querying capabilities
- Persistent local storage
- No external dependencies
</TabItem>
<TabItem value="console" label="Console">
Print events to the console for immediate debugging:
**Use Cases:**
- Development and testing
- Quick debugging sessions
- Simple logging without external tools
**Features:**
- Immediate output visibility
- No setup required
- Human-readable format
</TabItem>
</Tabs>
## Configuration
### Meta-Reference Provider
Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
```yaml
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
service_name: "llama-stack-service"
sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
otel_exporter_otlp_endpoint: "http://localhost:4318"
sqlite_db_path: "/path/to/telemetry.db"
```
### Environment Variables
Configure telemetry behavior using environment variables:
- **`OTEL_EXPORTER_OTLP_ENDPOINT`**: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
- **`OTEL_SERVICE_NAME`**: Service name for telemetry (default: empty string)
- **`TELEMETRY_SINKS`**: Comma-separated list of sinks (default: `console,sqlite`)
## Visualization with Jaeger
The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
### Starting Jaeger
Start a Jaeger instance with OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686:
```bash
docker run --pull always --rm --name jaeger \
-p 16686:16686 -p 4318:4318 \
jaegertracing/jaeger:2.1.0
```
Once running, you can visualize traces by navigating to [http://localhost:16686/](http://localhost:16686/).
## Querying Metrics
When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:
<Tabs>
<TabItem value="prometheus" label="Prometheus Queries">
Example Prometheus queries for analyzing token usage:
```promql
# Total tokens used across all models
sum(llama_stack_tokens_total)
# Tokens per model
sum by (model_id) (llama_stack_tokens_total)
# Average tokens per request over 5 minutes
rate(llama_stack_tokens_total[5m])
# Token usage by provider
sum by (provider_id) (llama_stack_tokens_total)
```
</TabItem>
<TabItem value="grafana" label="Grafana Dashboards">
Create dashboards using Prometheus as a data source:
- **Token Usage Over Time**: Line charts showing token consumption trends
- **Model Performance**: Comparison of different models by token efficiency
- **Provider Analysis**: Breakdown of usage across different providers
- **Request Patterns**: Understanding peak usage times and patterns
</TabItem>
<TabItem value="otlp" label="OpenTelemetry Collector">
Forward metrics to other observability systems:
- Export to multiple backends simultaneously
- Apply transformations and filtering
- Integrate with existing monitoring infrastructure
</TabItem>
</Tabs>
## SQLite Querying
The `sqlite` sink allows you to query traces without an external system. This is particularly useful for development and custom analytics.
### Example Queries
```sql
-- Query recent traces
SELECT * FROM traces WHERE timestamp > datetime('now', '-1 hour');
-- Analyze span durations
SELECT name, AVG(duration_ms) as avg_duration
FROM spans
GROUP BY name
ORDER BY avg_duration DESC;
-- Find slow operations
SELECT * FROM spans
WHERE duration_ms > 1000
ORDER BY duration_ms DESC;
```
:::tip[Advanced Analytics]
Refer to the [Getting Started notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on querying traces and spans programmatically.
:::
## Best Practices
### 🔍 **Monitoring Strategy**
- Use OpenTelemetry for production environments
- Combine multiple sinks for development (console + SQLite)
- Set up alerts on key metrics like token usage and error rates
### 📊 **Metrics Analysis**
- Track token usage trends to optimize costs
- Monitor response times across different models
- Analyze usage patterns to improve resource allocation
### 🚨 **Alerting & Debugging**
- Set up alerts for unusual token consumption spikes
- Use trace data to debug performance issues
- Monitor error rates and failure patterns
### 🔧 **Configuration Management**
- Use environment variables for flexible deployment
- Configure appropriate retention policies for SQLite
- Ensure proper network access to OpenTelemetry collectors
## Integration Examples
### Basic Telemetry Setup
```python
from llama_stack_client import LlamaStackClient
# Client with telemetry headers
client = LlamaStackClient(
base_url="http://localhost:8000",
extra_headers={
"X-Telemetry-Service": "my-ai-app",
"X-Telemetry-Version": "1.0.0"
}
)
# All API calls will be automatically traced
response = client.inference.chat_completion(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
```
### Custom Telemetry Context
```python
# Add custom span attributes for better tracking
with tracer.start_as_current_span("custom_operation") as span:
span.set_attribute("user_id", "user123")
span.set_attribute("operation_type", "chat_completion")
response = client.inference.chat_completion(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
```
## Related Resources
- **[Agents](./agent)** - Monitoring agent execution with telemetry
- **[Evaluations](./evals)** - Using telemetry data for performance evaluation
- **[Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Telemetry examples and queries
- **[OpenTelemetry Documentation](https://opentelemetry.io/)** - Comprehensive observability framework
- **[Jaeger Documentation](https://www.jaegertracing.io/)** - Distributed tracing visualization

View file

@ -0,0 +1,340 @@
---
title: Tools
description: Extend agent capabilities with external tools and function calling
sidebar_label: Tools
sidebar_position: 6
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Tools
Tools are functions that can be invoked by an agent to perform tasks. They are organized into tool groups and registered with specific providers. Each tool group represents a collection of related tools from a single provider. They are organized into groups so that state can be externalized: the collection operates on the same state typically.
An example of this would be a "db_access" tool group that contains tools for interacting with a database. "list_tables", "query_table", "insert_row" could be examples of tools in this group.
Tools are treated as any other resource in llama stack like models. You can register them, have providers for them etc.
When instantiating an agent, you can provide it a list of tool groups that it has access to. Agent gets the corresponding tool definitions for the specified tool groups and passes them along to the model.
Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools.
## Server-side vs. Client-side Tool Execution
Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution and optional continuation using the `agent.resume_turn` method.
## Server-side Tools
Llama Stack provides built-in providers for some common tools. These include web search, math, and RAG capabilities.
### Web Search
You have three providers to execute the web search tool calls generated by a model: Brave Search, Bing Search, and Tavily Search.
To indicate that the web search tool calls should be executed by brave-search, you can point the "builtin::websearch" toolgroup to the "brave-search" provider.
```python
client.toolgroups.register(
toolgroup_id="builtin::websearch",
provider_id="brave-search",
args={"max_results": 5},
)
```
The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is:
```
{"<provider_name>_api_key": <your api key>}
```
### Math
The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
```python
client.toolgroups.register(
toolgroup_id="builtin::wolfram_alpha",
provider_id="wolfram-alpha"
)
```
Example usage:
```python
result = client.tool_runtime.invoke_tool(
tool_name="wolfram_alpha",
args={"query": "solve x^2 + 2x + 1 = 0"}
)
```
### RAG
The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
```python
# Register Memory tool group
client.toolgroups.register(
toolgroup_id="builtin::rag",
provider_id="faiss",
args={"max_chunks": 5, "max_tokens_in_context": 4096},
)
```
Features:
- Support for multiple memory bank types
- Configurable query generation
- Context retrieval with token limits
:::note[Default Configuration]
By default, llama stack run.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers.
:::
## Model Context Protocol (MCP)
[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered from an MCP endpoint and can be used to extend the agent's capabilities.
### Using Remote MCP Servers
You can find some popular remote MCP servers [here](https://github.com/jaw9c/awesome-remote-mcp-servers). You can register them as toolgroups in the same way as local providers.
```python
client.toolgroups.register(
toolgroup_id="mcp::deepwiki",
provider_id="model-context-protocol",
mcp_endpoint=URL(uri="https://mcp.deepwiki.com/sse"),
)
```
Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide authorization headers to send to the MCP server using the "Provider Data" abstraction provided by Llama Stack. When making an agent call,
```python
agent = Agent(
...,
tools=["mcp::deepwiki"],
extra_headers={
"X-LlamaStack-Provider-Data": json.dumps(
{
"mcp_headers": {
"http://mcp.deepwiki.com/sse": {
"Authorization": "Bearer <your_access_token>",
},
},
}
),
},
)
agent.create_turn(...)
```
### Running Your Own MCP Server
Here's an example of how to run a simple MCP server that exposes a File System as a set of tools to the Llama Stack agent.
<Tabs>
<TabItem value="setup" label="Server Setup">
```shell
# Start your MCP server
mkdir /tmp/content
touch /tmp/content/foo
touch /tmp/content/bar
npx -y supergateway --port 8000 --stdio 'npx -y @modelcontextprotocol/server-filesystem /tmp/content'
```
</TabItem>
<TabItem value="register" label="Registration">
```python
# Register the MCP server as a tool group
client.toolgroups.register(
toolgroup_id="mcp::filesystem",
provider_id="model-context-protocol",
mcp_endpoint=URL(uri="http://localhost:8000/sse"),
)
```
</TabItem>
</Tabs>
## Adding Custom (Client-side) Tools
When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed along to the generative model.
```python
# Example tool definition
def my_tool(input: int) -> int:
"""
Runs my awesome tool.
:param input: some int parameter
"""
return input * 2
```
:::tip[Documentation Best Practices]
We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
:::
Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
```python
# Example agent config with client provided tools
agent = Agent(client, ..., tools=[my_tool])
```
Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
## Tool Invocation
Tools can be invoked using the `invoke_tool` method:
```python
result = client.tool_runtime.invoke_tool(
tool_name="web_search",
kwargs={"query": "What is the capital of France?"}
)
```
The result contains:
- `content`: The tool's output
- `error_message`: Optional error message if the tool failed
- `error_code`: Optional error code if the tool failed
## Listing Available Tools
You can list all available tools or filter by tool group:
```python
# List all tools
all_tools = client.tools.list_tools()
# List tools in a specific group
group_tools = client.tools.list_tools(toolgroup_id="search_tools")
```
## Complete Examples
### Web Search Agent
<Tabs>
<TabItem value="setup" label="Setup & Configuration">
1. Start by registering a Tavily API key at [Tavily](https://tavily.com/).
2. [Optional] Provide the API key directly to the Llama Stack server
```bash
export TAVILY_SEARCH_API_KEY="your key"
```
```bash
--env TAVILY_SEARCH_API_KEY=${TAVILY_SEARCH_API_KEY}
```
</TabItem>
<TabItem value="implementation" label="Implementation">
```python
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(
base_url=f"http://localhost:8321",
provider_data={
"tavily_search_api_key": "your_TAVILY_SEARCH_API_KEY"
}, # Set this from the client side. No need to provide it if it has already been configured on the Llama Stack server.
)
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions=(
"You are a web search assistant, must use websearch tool to look up the most current and precise information available. "
),
tools=["builtin::websearch"],
)
session_id = agent.create_session("websearch-session")
response = agent.create_turn(
messages=[
{"role": "user", "content": "How did the USA perform in the last Olympics?"}
],
session_id=session_id,
)
for log in EventLogger().log(response):
log.print()
```
</TabItem>
</Tabs>
### WolframAlpha Math Agent
<Tabs>
<TabItem value="setup" label="Setup & Configuration">
1. Start by registering for a WolframAlpha API key at [WolframAlpha Developer Portal](https://developer.wolframalpha.com/access).
2. Provide the API key either when starting the Llama Stack server:
```bash
--env WOLFRAM_ALPHA_API_KEY=${WOLFRAM_ALPHA_API_KEY}
```
or from the client side:
```python
client = LlamaStackClient(
base_url="http://localhost:8321",
provider_data={"wolfram_alpha_api_key": wolfram_api_key},
)
```
</TabItem>
<TabItem value="implementation" label="Implementation">
```python
# Configure the tools in the Agent by setting tools=["builtin::wolfram_alpha"]
agent = Agent(
client,
model="meta-llama/Llama-3.2-3B-Instruct",
instructions="You are a mathematical assistant that can solve complex equations.",
tools=["builtin::wolfram_alpha"],
)
session_id = agent.create_session("math-session")
# Example user query
response = agent.create_turn(
messages=[{"role": "user", "content": "Solve x^2 + 2x + 1 = 0 using WolframAlpha"}],
session_id=session_id,
)
```
</TabItem>
</Tabs>
## Best Practices
### 🛠️ **Tool Selection**
- Use **server-side tools** for production applications requiring reliability and security
- Use **client-side tools** for development, prototyping, or specialized integrations
- Combine multiple tool types for comprehensive functionality
### 📝 **Documentation**
- Write clear, detailed docstrings for custom tools
- Include parameter descriptions and expected return types
- Test tool descriptions with the model to ensure proper usage
### 🔐 **Security**
- Store API keys securely using environment variables or secure configuration
- Use the `X-LlamaStack-Provider-Data` header for dynamic authentication
- Validate tool inputs and outputs for security
### 🔄 **Error Handling**
- Implement proper error handling in custom tools
- Use structured error responses with meaningful messages
- Monitor tool performance and reliability
## Related Resources
- **[Agents](./agent)** - Building intelligent agents with tools
- **[RAG (Retrieval Augmented Generation)](./rag)** - Using knowledge retrieval tools
- **[Agent Execution Loop](./agent-execution-loop)** - Understanding tool execution flow
- **[Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Comprehensive examples
- **[Llama Stack Apps Examples](https://github.com/meta-llama/llama-stack-apps)** - Real-world tool implementations

View file

@ -0,0 +1,19 @@
---
title: API Providers
description: Understanding remote vs inline provider implementations
sidebar_label: API Providers
sidebar_position: 4
---
# API Providers
The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
- LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, etc.),
- Vector databases (e.g., ChromaDB, Weaviate, Qdrant, Milvus, FAISS, PGVector, etc.),
- Safety providers (e.g., Meta's Llama Guard, AWS Bedrock Guardrails, etc.)
Providers come in two flavors:
- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code.
- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack.
Most importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.

View file

@ -0,0 +1,28 @@
---
title: APIs
description: Available REST APIs and planned capabilities in Llama Stack
sidebar_label: APIs
sidebar_position: 3
---
# APIs
A Llama Stack API is described as a collection of REST endpoints. We currently support the following APIs:
- **Inference**: run inference with a LLM
- **Safety**: apply safety policies to the output at a Systems (not only model) level
- **Agents**: run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
- **DatasetIO**: interface with datasets and data loaders
- **Scoring**: evaluate outputs of the system
- **Eval**: generate outputs (via Inference or Agents) and perform scoring
- **VectorIO**: perform operations on vector stores, such as adding documents, searching, and deleting documents
- **Telemetry**: collect telemetry data from the system
- **Post Training**: fine-tune a model
- **Tool Runtime**: interact with various tools and protocols
- **Responses**: generate responses from an LLM using this OpenAI compatible API.
We are working on adding a few more APIs to complete the application lifecycle. These will include:
- **Batch Inference**: run inference on a dataset of inputs
- **Batch Agents**: run agents on a dataset of inputs
- **Synthetic Data Generation**: generate synthetic data for model development
- **Batches**: OpenAI-compatible batch management for inference

View file

@ -0,0 +1,74 @@
---
title: Llama Stack Architecture
description: Understanding Llama Stack's service-oriented design and benefits
sidebar_label: Architecture
sidebar_position: 2
---
# Llama Stack architecture
Llama Stack allows you to build different layers of distributions for your AI workloads using various SDKs and API providers.
<img src="/img/llama-stack.png" alt="Llama Stack" width="400" />
## Benefits of Llama stack
### Current challenges in custom AI applications
Building production AI applications today requires solving multiple challenges:
**Infrastructure Complexity**
- Running large language models efficiently requires specialized infrastructure.
- Different deployment scenarios (local development, cloud, edge) need different solutions.
- Moving from development to production often requires significant rework.
**Essential Capabilities**
- Safety guardrails and content filtering are necessary in an enterprise setting.
- Just model inference is not enough - Knowledge retrieval and RAG capabilities are required.
- Nearly any application needs composable multi-step workflows.
- Without monitoring, observability and evaluation, you end up operating in the dark.
**Lack of Flexibility and Choice**
- Directly integrating with multiple providers creates tight coupling.
- Different providers have different APIs and abstractions.
- Changing providers requires significant code changes.
### Our Solution: A Universal Stack
Llama Stack addresses these challenges through a service-oriented, API-first approach:
**Develop Anywhere, Deploy Everywhere**
- Start locally with CPU-only setups
- Move to GPU acceleration when needed
- Deploy to cloud or edge without code changes
- Same APIs and developer experience everywhere
**Production-Ready Building Blocks**
- Pre-built safety guardrails and content filtering
- Built-in RAG and agent capabilities
- Comprehensive evaluation toolkit
- Full observability and monitoring
**True Provider Independence**
- Swap providers without application changes
- Mix and match best-in-class implementations
- Federation and fallback support
- No vendor lock-in
**Robust Ecosystem**
- Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies).
- Ecosystem offers tailored infrastructure, software, and services for deploying a variety of models.
## Our Philosophy
- **Service-Oriented**: REST APIs enforce clean interfaces and enable seamless transitions across different environments.
- **Composability**: Every component is independent but works together seamlessly
- **Production Ready**: Built for real-world applications, not just demos
- **Turnkey Solutions**: Easy to deploy built in solutions for popular deployment scenarios
With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.

View file

@ -0,0 +1,16 @@
---
title: Distributions
description: Pre-packaged provider configurations for different deployment scenarios
sidebar_label: Distributions
sidebar_position: 5
---
# Distributions
While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples:
**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions.
**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros.
**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](/docs/distributions/ondevice_distro/ios_sdk) and [Android](/docs/distributions/ondevice_distro/android_sdk)

View file

@ -0,0 +1,71 @@
# Evaluation Concepts
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications:
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/benchmarks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
The Evaluation APIs are associated with a set of Resources. Please visit the Resources section in our [Core Concepts](./index.mdx) guide for better high-level understanding.
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `Benchmark` resource.
## Open-benchmark Eval
### List of open-benchmarks Llama Stack support
Llama stack pre-registers several popular open-benchmarks to easily evaluate model perfomance via CLI.
The list of open-benchmarks we currently support:
- [MMLU-COT](https://arxiv.org/abs/2009.03300) (Measuring Massive Multitask Language Understanding): Benchmark designed to comprehensively evaluate the breadth and depth of a model's academic and professional understanding
- [GPQA-COT](https://arxiv.org/abs/2311.12022) (A Graduate-Level Google-Proof Q&A Benchmark): A challenging benchmark of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.
You can follow this [contributing guide](../references/evals-reference.mdx#open-benchmark-contributing-guide) to add more open-benchmarks to Llama Stack
### Run evaluation on open-benchmarks via CLI
We have built-in functionality to run the supported open-benckmarks using llama-stack-client CLI
#### Spin up Llama Stack server
Spin up llama stack server with 'open-benchmark' template
```bash
llama stack run llama_stack/distributions/open-benchmark/run.yaml
```
#### Run eval CLI
There are 3 necessary inputs to run a benchmark eval
- `list of benchmark_ids`: The list of benchmark ids to run evaluation on
- `model-id`: The model id to evaluate on
- `output_dir`: Path to store the evaluate results
```bash
llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
--model_id <model id to evaluate on> \
--output_dir <directory to store the evaluate results>
```
You can run
```bash
llama-stack-client eval run-benchmark help
```
to see the description of all the flags that eval run-benchmark has
In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggregate
evaluation results over there.
## What's Next?
- Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP).
- Check out our [Building Applications - Evaluation](../building-applications/evals.mdx) guide for more details on how to use the Evaluation APIs to evaluate your applications.
- Check out our [Evaluation Reference](../references/evals-reference.mdx) for more details on the APIs.

View file

@ -0,0 +1,18 @@
---
title: Core Concepts
description: Understanding Llama Stack's service-oriented philosophy and key concepts
sidebar_label: Overview
sidebar_position: 1
---
# Core Concepts
Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.
This section covers the key concepts you need to understand to work effectively with Llama Stack:
- **[Architecture](./architecture)** - Llama Stack's service-oriented design and benefits
- **[APIs](./apis)** - Available REST APIs and planned capabilities
- **[API Providers](./api-providers)** - Remote vs inline provider implementations
- **[Distributions](./distributions)** - Pre-packaged provider configurations
- **[Resources](./resources)** - Resource federation and registration

View file

@ -0,0 +1,26 @@
---
title: Resources
description: Resource federation and registration in Llama Stack
sidebar_label: Resources
sidebar_position: 6
---
# Resources
Some of these APIs are associated with a set of **Resources**. Here is the mapping of APIs to resources:
- **Inference**, **Eval** and **Post Training** are associated with `Model` resources.
- **Safety** is associated with `Shield` resources.
- **Tool Runtime** is associated with `ToolGroup` resources.
- **DatasetIO** is associated with `Dataset` resources.
- **VectorIO** is associated with `VectorDB` resources.
- **Scoring** is associated with `ScoringFunction` resources.
- **Eval** is associated with `Model` and `Benchmark` resources.
Furthermore, we allow these resources to be **federated** across multiple providers. For example, you may have some Llama models served by Fireworks while others are served by AWS Bedrock. Regardless, they will all work seamlessly with the same uniform Inference API provided by Llama Stack.
:::tip Registering Resources
Given this architecture, it is necessary for the Stack to know which provider to use for a given resource. This means you need to explicitly _register_ resources (including models) before you can use them with the associated APIs.
:::

View file

@ -0,0 +1,315 @@
---
title: Contributing to Llama Stack
description: Guide for contributing to the Llama Stack project
sidebar_label: Overview
sidebar_position: 1
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
We want to make contributing to this project as easy and transparent as possible.
## Quick Start
<Tabs>
<TabItem value="setup" label="Development Setup">
### Set up your development environment
We use [uv](https://github.com/astral-sh/uv) to manage python dependencies and virtual environments.
Install `uv` by following this [guide](https://docs.astral.sh/uv/getting-started/installation/).
Install dependencies:
```bash
cd llama-stack
uv sync --group dev
uv pip install -e .
source .venv/bin/activate
```
:::note[Python Version]
You can use a specific version of Python with `uv` by adding the `--python <version>` flag (e.g. `--python 3.12`).
Otherwise, `uv` will automatically select a Python version according to the `requires-python` section of the `pyproject.toml`.
For more info, see the [uv docs around Python versions](https://docs.astral.sh/uv/concepts/python-versions/).
:::
### Environment Configuration
Create a `.env` file with necessary environment variables:
```bash
LLAMA_STACK_BASE_URL=http://localhost:8321
LLAMA_STACK_CLIENT_LOG=debug
LLAMA_STACK_PORT=8321
LLAMA_STACK_CONFIG=<provider-name>
TAVILY_SEARCH_API_KEY=
BRAVE_SEARCH_API_KEY=
```
Use with integration tests:
```bash
uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
```
</TabItem>
<TabItem value="pre-commit" label="Pre-commit Hooks">
### Pre-commit Hooks
We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks:
```bash
# Install pre-commit hooks (runs automatically before each commit)
uv run pre-commit install
# Run checks manually
uv run pre-commit run --all-files
```
:::caution
Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
:::
</TabItem>
</Tabs>
## Contributing Workflow
### Discussions → Issues → Pull Requests
We actively welcome your pull requests. If in doubt, please open a [discussion](https://github.com/meta-llama/llama-stack/discussions); we can always convert that to an issue later.
<Tabs>
<TabItem value="questions" label="I have a question!">
1. Open a discussion or use [Discord](https://discord.gg/llama-stack)
</TabItem>
<TabItem value="bugs" label="I have a bug!">
1. Search the issue tracker and discussions for similar issues
2. If you don't have steps to reproduce, open a discussion
3. If you have steps to reproduce, open an issue
</TabItem>
<TabItem value="features" label="I have an idea for a feature!">
1. Open a discussion
</TabItem>
<TabItem value="contributions" label="I'd like to contribute!">
If you are new to the project, start by looking at issues tagged with "good first issue". Leave a comment and a triager will assign it to you.
**Guidelines:**
- Work on only 12 issues at a time
- Check if issues are already assigned
- If blocked, unassign yourself or leave a comment
</TabItem>
</Tabs>
### Opening a Pull Request
1. Fork the repo and create your branch from `main`
2. If you've changed APIs, update the documentation
3. Ensure the test suite passes
4. Make sure your code lints using `pre-commit`
5. Complete the Contributor License Agreement ("CLA")
6. Follow [conventional commits format](https://www.conventionalcommits.org/en/v1.0.0/)
7. Follow the [coding style guidelines](#coding-style)
Keep PRs small and focused. Split large changes into logically grouped, smaller PRs.
:::tip[PR Guidelines]
- **Experienced contributors**: No more than 5 open PRs at a time
- **New contributors**: One open PR at a time until familiar with the process
:::
## Adding New Providers
Learn how to extend Llama Stack with new capabilities:
- **[Adding a New API Provider](./new-api-provider)** - Add new API providers to the Stack
- **[Adding a Vector Database](./new-vector-database)** - Add new vector databases
- **[External Providers](/docs/providers/external)** - Add external providers to the Stack
## Testing
Llama Stack uses two types of tests:
| Type | Location | Purpose |
|------|----------|---------|
| **Unit** | `tests/unit/` | Fast, isolated component testing |
| **Integration** | `tests/integration/` | End-to-end workflows with record-replay |
### Testing Philosophy
For unit tests, create minimal mocks and rely more on "fakes". Mocks are too brittle. Tests must be very fast and reliable.
### Record-Replay for Integration Tests
Testing AI applications end-to-end creates challenges:
- **API costs** accumulate quickly during development and CI
- **Non-deterministic responses** make tests unreliable
- **Multiple providers** require testing the same logic across different APIs
Our solution: **Record real API responses once, replay them for fast, deterministic tests.**
Benefits:
- **Cost control** - No repeated API calls during development
- **Speed** - Instant test execution with cached responses
- **Reliability** - Consistent results regardless of external service state
- **Provider coverage** - Same tests work across OpenAI, Anthropic, local models, etc.
### Running Tests
<Tabs>
<TabItem value="unit" label="Unit Tests">
```bash
uv run --group unit pytest -sv tests/unit/
```
</TabItem>
<TabItem value="integration" label="Integration Tests">
**Prerequisites:**
- A stack config (various formats supported):
- `server:<config>` - auto-start server (e.g., `server:starter`)
- `server:<config>:<port>` - custom port (e.g., `server:starter:8322`)
- URL pointing to a Llama Stack server
- Distribution name or path to `run.yaml`
- Comma-separated API=provider pairs
**Run tests in replay mode:**
```bash
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter
```
</TabItem>
<TabItem value="recording" label="Re-recording Tests">
**Local Re-recording (Manual Setup Required):**
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record \
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter -k "<test name>"
```
:::warning[CI Compatibility]
When re-recording locally, you must match the CI setup:
- Ollama running with specific models
- Using the `starter` distribution
:::
**Remote Re-recording (Recommended):**
```bash
# Record tests for specific subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with vision tests
./scripts/github/schedule-record-workflow.sh --test-suite vision
# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm
```
**Prerequisites:**
- GitHub CLI: `brew install gh && gh auth login`
- jq: `brew install jq`
- Your branch pushed to a remote
</TabItem>
</Tabs>
## Common Development Tasks
### Using `llama stack build`
Building a stack image uses production packages. For development, set environment variables:
```bash
cd work/
git clone https://github.com/meta-llama/llama-stack.git
git clone https://github.com/meta-llama/llama-stack-client-python.git
cd llama-stack
LLAMA_STACK_DIR=$(pwd) LLAMA_STACK_CLIENT_DIR=../llama-stack-client-python llama stack build --distro <...>
```
### Updating Configurations
**Distribution configurations:**
```bash
./scripts/distro_codegen.py
```
Don't manually edit `docs/source/.../distributions/` files - they're auto-generated.
**Provider documentation:**
```bash
./scripts/provider_codegen.py
```
Don't manually edit `docs/source/.../providers/` files - they're auto-generated.
### Building Documentation
```bash
# Rebuild documentation pages
uv run --group docs make -C docs/ html
# Start local server with auto-rebuild (usually at http://127.0.0.1:8000)
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
```
### Update API Documentation
If you modify API endpoints:
```bash
uv run ./docs/openapi_generator/run_openapi_generator.sh
```
Generated documentation will be in `docs/_static/`. Review changes before committing.
## Coding Style
- **Comments**: Provide meaningful insights, avoid filler comments
- **Exceptions**: Use specific exception types, not broad catch-alls
- **Error messages**: Prefix with "Failed to ..."
- **Indentation**: 4 spaces, not tabs
- **`# noqa` usage**: Include comment explaining justification
- **`# type: ignore` usage**: Include comment explaining justification
- **Character encoding**: ASCII-only preferred
- **Provider config**: Use Pydantic Field class with `description` field
- **Function calls**: Use keyword arguments when possible
- **Custom exceptions**: Use [custom Exception classes](llama_stack/apis/common/errors.py) where applicable
## Legal
### Issues and Security
We use GitHub issues to track public bugs. For security bugs, use Meta's [bounty program](http://facebook.com/whitehat/info).
### Contributor License Agreement
Complete your CLA here: [https://code.facebook.com/cla](https://code.facebook.com/cla)
### License
By contributing to Llama Stack, you agree that your contributions will be licensed under the LICENSE file in the root directory.
## Advanced Topics
- **[Testing Record-Replay System](./testing-record-replay)** - Deep dive into testing internals
## Related Resources
- **[Adding API Providers](./new-api-provider)** - Extend Llama Stack with new providers
- **[Vector Database Integration](./new-vector-database)** - Add vector database support
- **[External Providers](/docs/providers/external)** - External provider development
- **[GitHub Discussions](https://github.com/meta-llama/llama-stack/discussions)** - Community discussion
- **[Discord](https://discord.gg/llama-stack)** - Real-time community chat

View file

@ -0,0 +1,283 @@
---
title: Adding a New API Provider
description: Guide for adding new API providers to Llama Stack
sidebar_label: New API Provider
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
This guide will walk you through the process of adding a new API provider to Llama Stack.
## Getting Started
Before implementing your provider, complete these preparatory steps:
<Tabs>
<TabItem value="planning" label="Planning">
**1. Review Core Concepts**
- Study the [core concepts](/docs/concepts) of Llama Stack
- Choose the API your provider belongs to (Inference, Safety, VectorIO, etc.)
**2. Determine Provider Type**
- **Remote providers**: Make requests to external services
- **Inline providers**: Execute implementation locally
</TabItem>
<TabItem value="implementation" label="Implementation Steps">
**3. Add Provider to Registry**
- Add your provider to the appropriate registry in `llama_stack/providers/registry/`
- Specify pip dependencies necessary for your provider
**4. Update Distribution Templates**
- Update `build.yaml` and `run.yaml` files in `llama_stack/distributions/` if they should include your provider by default
- Run `./scripts/distro_codegen.py` if necessary
:::note[Distribution Compatibility]
`distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. The distribution's `get_distribution_template()` code path should only import Config or model alias definitions from each provider, not the provider's actual implementation.
:::
</TabItem>
</Tabs>
## Example Implementations
Study these example PRs to understand the implementation patterns:
- **[Grok Inference Implementation](https://github.com/meta-llama/llama-stack/pull/609)** - OpenAI-compatible inference provider
- **[Nvidia Inference Implementation](https://github.com/meta-llama/llama-stack/pull/355)** - Remote inference service integration
- **[Model Context Protocol Tool Runtime](https://github.com/meta-llama/llama-stack/pull/665)** - Tool runtime provider
## Provider Types: Internal vs External
| **Type** | **Internal (In-tree)** | **External (Out-of-tree)** |
|----------|------------------------|---------------------------|
| **Description** | Provider directly in Llama Stack code | Provider outside core codebase but accessible by Llama Stack |
| **Benefits** | Minimal configuration, direct integration | Separate provider code, no core changes needed |
| **Use Cases** | Core functionality, widely-used services | Specialized services, experimental providers |
## Inference Provider Patterns
When implementing Inference providers for OpenAI-compatible APIs, Llama Stack provides mixin classes to simplify development.
### OpenAIMixin
The `OpenAIMixin` class provides OpenAI API functionality for providers that work with OpenAI-compatible endpoints.
<Tabs>
<TabItem value="features" label="Features">
**Direct API Methods:**
- `openai_completion()`: Legacy text completion API with full parameter support
- `openai_chat_completion()`: Chat completion API supporting streaming, tools, and function calling
- `openai_embeddings()`: Text embeddings generation with customizable encoding and dimensions
**Model Management:**
- `check_model_availability()`: Queries the API endpoint to verify model existence and accessibility
**Client Management:**
- `client` property: Automatically creates and configures AsyncOpenAI client instances using your provider's credentials
</TabItem>
<TabItem value="implementation" label="Implementation">
To use `OpenAIMixin`, your provider must implement these abstract methods:
```python
from abc import abstractmethod
class YourProvider(OpenAIMixin):
@abstractmethod
def get_api_key(self) -> str:
"""Return the API key for authentication"""
pass
@abstractmethod
def get_base_url(self) -> str:
"""Return the OpenAI-compatible API base URL"""
pass
# Your provider-specific implementation
async def completion(self, request):
return await self.openai_completion(request)
async def chat_completion(self, request):
return await self.openai_chat_completion(request)
```
</TabItem>
</Tabs>
## Testing Your Provider
Comprehensive testing ensures your provider works correctly across different scenarios.
### Prerequisites
Install required dependencies for your provider:
```bash
llama stack build --distro <your-distribution>
```
### Testing Levels
<Tabs>
<TabItem value="integration" label="Integration Testing">
**Location:** `tests/integration/`
**Purpose:** Test functionality using python client-SDK APIs from the `llama_stack_client` package.
**Configuration:** Each provider's `sample_run_config()` method references environment variables for API keys. Set these in the environment or pass via the `--env` flag.
**Running Tests:**
```bash
# Set environment variables
export YOUR_PROVIDER_API_KEY=your-key-here
# Run integration tests
uv run --group test pytest tests/integration/ --stack-config=<your-config>
```
For details, see `tests/integration/README.md`.
</TabItem>
<TabItem value="unit" label="Unit Testing">
**Location:** `tests/unit/providers/`
**Purpose:** Fast, isolated testing of provider components.
**Running Tests:**
```bash
uv run --group unit pytest tests/unit/providers/
```
These tests run automatically as part of the CI process. For details, see `tests/unit/README.md`.
</TabItem>
<TabItem value="e2e" label="End-to-End Testing">
**Manual Validation:**
1. **Start Llama Stack Server**
```bash
llama stack run <your-distribution>
```
2. **Test with Client Scripts**
- Use existing client scripts in [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main)
- Verify compatibility with your provider
- Document which scripts work with your provider
3. **Validate Core Functionality**
- Test all supported API methods
- Verify error handling and edge cases
- Confirm streaming behavior (if applicable)
</TabItem>
</Tabs>
## Implementation Best Practices
### Configuration Management
```python
from pydantic import BaseModel, Field
class YourProviderConfig(BaseModel):
api_key: str = Field(
description="API key for authentication with your service"
)
base_url: str = Field(
default="https://api.yourservice.com/v1",
description="Base URL for the API endpoint"
)
timeout: int = Field(
default=30,
description="Request timeout in seconds"
)
```
### Error Handling
```python
from llama_stack.apis.common.errors import ProviderError
async def your_api_method(self, request):
try:
response = await self.client.your_api_call(request)
return response
except Exception as e:
raise ProviderError(f"Failed to call your API: {str(e)}")
```
### Logging
```python
import logging
logger = logging.getLogger(__name__)
class YourProvider:
async def your_method(self, request):
logger.debug(f"Processing request: {request}")
# Implementation
logger.info("Request processed successfully")
```
## Submitting Your Pull Request
### Pre-submission Checklist
- [ ] **All tests pass** - Both unit and integration tests
- [ ] **Code follows style guidelines** - Pre-commit hooks pass
- [ ] **Documentation updated** - Provider configuration documented
- [ ] **Distribution templates updated** - If provider should be included by default
- [ ] **Example usage provided** - Working example in PR description
### PR Content
**Include in your PR:**
1. **Comprehensive test plan** describing:
- Test scenarios covered
- Any manual testing performed
- Compatibility with existing client scripts
2. **Documentation updates** including:
- Provider configuration options
- Environment variables required
- Known limitations or considerations
3. **Example configuration** showing:
- How to configure your provider
- Sample API keys or endpoints (redacted)
- Integration with distributions
## Troubleshooting
### Common Issues
**Import Errors:**
- Ensure dependencies are properly listed in registry
- Check that distribution templates don't import implementation code directly
**Test Failures:**
- Verify API keys and environment variables are set correctly
- Check that your provider's `sample_run_config()` method is properly implemented
**Configuration Issues:**
- Ensure Pydantic models have proper `description` fields
- Verify that configuration validation works as expected
## Related Resources
- **[Core Concepts](/docs/concepts)** - Understanding Llama Stack architecture
- **[External Providers](/docs/providers/external)** - Alternative implementation approach
- **[Vector Database Guide](./new-vector-database)** - Specialized provider implementation
- **[Testing Record-Replay](./testing-record-replay)** - Advanced testing techniques

View file

@ -0,0 +1,493 @@
---
title: Adding a New Vector Database
description: Guide for adding new vector database providers to Llama Stack
sidebar_label: New Vector Database
sidebar_position: 3
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
This guide will walk you through the process of adding a new vector database provider to Llama Stack.
:::note[Example Implementation]
See the [Milvus Vector Database Provider](https://github.com/meta-llama/llama-stack/pull/1467) Pull Request for a complete implementation example.
:::
## Overview
Vector Database providers are used to store and retrieve vector embeddings. They're not limited to vector search but can support:
- **Vector search** - Semantic similarity search using embeddings
- **Keyword search** - Traditional text-based search
- **Hybrid search** - Combining vector and keyword search
- **Operations** - Filtering, sorting, and aggregating vectors
## Implementation Steps
### Step 1: Choose Database Type
Determine your vector database deployment model:
<Tabs>
<TabItem value="remote" label="Remote Provider">
**Remote databases** make requests to external services:
- External hosted services (Pinecone, Weaviate Cloud)
- Self-hosted services running on different infrastructure
- Requires network communication and authentication
</TabItem>
<TabItem value="inline" label="Inline Provider">
**Inline databases** execute locally:
- Embedded databases (SQLite, DuckDB)
- Local instances (ChromaDB, FAISS)
- Direct library integration
</TabItem>
<TabItem value="both" label="Hybrid Provider">
**Both remote and inline** support:
- Services that can run locally or remotely
- Different connection modes for the same technology
- Example: ChromaDB can run inline or as a remote service
</TabItem>
</Tabs>
### Step 2: Implement the Provider
Create a new provider class with two main components:
#### Vector Index Implementation
Implement `YourVectorIndex` with these required methods:
<Tabs>
<TabItem value="index-core" label="Core Methods">
```python
class YourVectorIndex:
async def create(self, vector_db_id: str, embedding_dimension: int, **kwargs):
"""Create a new vector index"""
pass
async def initialize(self):
"""Initialize the vector index connection"""
pass
async def add_chunks(self, chunks: List[Chunk]) -> List[str]:
"""Add vector chunks to the index"""
pass
async def delete_chunk(self, chunk_ids: List[str]):
"""Delete chunks by their IDs"""
pass
```
</TabItem>
<TabItem value="index-search" label="Search Methods">
```python
async def query_vector(self, embedding: List[float], k: int = 10, **kwargs):
"""Perform vector similarity search"""
pass
async def query_keyword(self, query: str, k: int = 10, **kwargs):
"""Perform keyword-based search"""
pass
async def query_hybrid(self,
embedding: List[float],
query: str,
k: int = 10,
**kwargs):
"""Perform hybrid vector + keyword search"""
pass
```
</TabItem>
</Tabs>
#### Vector IO Adapter Implementation
Implement `YourVectorIOAdapter` with these required methods:
<Tabs>
<TabItem value="adapter-lifecycle" label="Lifecycle Methods">
```python
class YourVectorIOAdapter:
async def initialize(self):
"""Initialize the adapter and establish connections"""
pass
async def shutdown(self):
"""Clean up resources and close connections"""
pass
```
</TabItem>
<TabItem value="adapter-management" label="Database Management">
```python
async def list_vector_dbs(self) -> List[VectorDB]:
"""List all available vector databases"""
pass
async def register_vector_db(self, vector_db: VectorDB):
"""Register a new vector database"""
pass
async def unregister_vector_db(self, vector_db_id: str):
"""Unregister a vector database"""
pass
```
</TabItem>
<TabItem value="adapter-operations" label="Data Operations">
```python
async def insert_chunks(self, vector_db_id: str, chunks: List[Chunk]):
"""Insert chunks into the specified vector database"""
pass
async def query_chunks(self, vector_db_id: str, query: VectorQuery):
"""Query chunks from the specified vector database"""
pass
async def delete_chunks(self, vector_db_id: str, chunk_ids: List[str]):
"""Delete chunks from the specified vector database"""
pass
```
</TabItem>
</Tabs>
### Step 3: Add to Registry
Register your provider in `llama_stack/providers/registry/vector_io.py`:
<Tabs>
<TabItem value="inline-registration" label="Inline Provider">
```python
from llama_stack.providers.registry.specs import InlineProviderSpec
from llama_stack.providers.registry.api import Api
InlineProviderSpec(
api=Api.vector_io,
provider_type="inline::milvus",
pip_packages=["pymilvus>=2.4.10"],
module="llama_stack.providers.inline.vector_io.milvus",
config_class="llama_stack.providers.inline.vector_io.milvus.MilvusVectorIOConfig",
api_dependencies=[Api.inference],
optional_api_dependencies=[Api.files],
description="Milvus vector database for high-performance similarity search",
)
```
</TabItem>
<TabItem value="remote-registration" label="Remote Provider">
```python
from llama_stack.providers.registry.specs import RemoteProviderSpec
RemoteProviderSpec(
api=Api.vector_io,
provider_type="remote::pinecone",
pip_packages=["pinecone-client>=2.0.0"],
module="llama_stack.providers.remote.vector_io.pinecone",
config_class="llama_stack.providers.remote.vector_io.pinecone.PineconeConfig",
api_dependencies=[Api.inference],
description="Pinecone cloud vector database service",
)
```
</TabItem>
</Tabs>
### Step 4: Add Tests
Comprehensive testing ensures your provider works correctly:
#### Unit Tests Configuration
<Tabs>
<TabItem value="conftest" label="Test Configuration">
Update `/tests/unit/providers/vector_io/conftest.py`:
```python
# 1. Add your provider to the vector_provider fixture
@pytest.fixture
def vector_provider():
return {
# ... existing providers
"your_vectorprovider": "inline::your_vectorprovider",
}
# 2. Create your vector index fixture
@pytest.fixture
async def your_vectorprovider_index():
config = YourVectorProviderConfig(
# Your test configuration
)
index = YourVectorIndex(config)
await index.initialize()
yield index
# Cleanup if needed
# 3. Create your adapter fixture
@pytest.fixture
async def your_vectorprovider_adapter():
config = YourVectorProviderConfig(
# Your test configuration
)
adapter = YourVectorIOAdapter(config)
await adapter.initialize()
yield adapter
await adapter.shutdown()
# 4. Add to vector_io_providers fixture
@pytest.fixture
def vector_io_providers():
return {
# ... existing providers
"your_vectorprovider": {
"index": "your_vectorprovider_index",
"adapter": "your_vectorprovider_adapter",
}
}
```
</TabItem>
<TabItem value="naming" label="Naming Convention">
Follow the naming convention for fixtures:
- Index fixture: `{provider_name}_index`
- Adapter fixture: `{provider_name}_adapter`
This naming is required for the automated tests to execute properly.
</TabItem>
</Tabs>
#### Integration Tests
<Tabs>
<TabItem value="vector-io-tests" label="Core Vector IO Tests">
**Location:** `tests/integration/vector_io/test_vector_io.py`
**Tests:** Registration, insertion, and retrieval functionality
**No changes needed** - tests run automatically for all registered providers
</TabItem>
<TabItem value="openai-tests" label="OpenAI Compatibility Tests">
**Location:** `tests/integration/vector_io/test_openai_vector_stores.py`
Update skip conditions if your provider supports OpenAI compatibility:
```python
def skip_if_provider_doesnt_support_openai_vector_stores(provider_id):
unsupported = [
# Remove your provider from this list if it supports OpenAI vector stores
"your_vectorprovider", # Remove this line if supported
]
# ... rest of function
def skip_if_provider_doesnt_support_openai_vector_stores_search(provider_id):
unsupported = [
# Remove your provider from this list if it supports search
"your_vectorprovider", # Remove this line if supported
]
# ... rest of function
```
</TabItem>
<TabItem value="ci-tests" label="CI Configuration">
Update `.github/workflows/integration-vector-io-tests.yml`:
```yaml
# Add your provider to the test matrix
strategy:
matrix:
provider:
- chroma
- faiss
- your_vectorprovider # Add your provider here
# If remote provider, add container setup
services:
your_vectorprovider:
image: your-provider/image:latest
ports:
- 8080:8080
env:
YOUR_ENV_VAR: value
```
</TabItem>
</Tabs>
### Step 5: Update Dependencies
<Tabs>
<TabItem value="inline-deps" label="Inline Provider Dependencies">
For inline providers, update the `unit` group:
```bash
uv add your_pip_package --group unit
```
</TabItem>
<TabItem value="remote-deps" label="Remote Provider Dependencies">
For remote providers, update the `test` group (used in CI):
```bash
uv add your_pip_package --group test
```
</TabItem>
</Tabs>
### Step 6: Update Documentation
Generate and update provider documentation:
<Tabs>
<TabItem value="generate-docs" label="Generate Documentation">
```bash
# Generate provider documentation
./scripts/provider_codegen.py
```
</TabItem>
<TabItem value="update-registry" label="Update Registry Description">
Update the description in your registry entry:
```python
InlineProviderSpec(
# ... other fields
description="Your vector database provider description. Explain key features, use cases, and any special capabilities.",
)
```
</TabItem>
</Tabs>
## Configuration Best Practices
### Provider Configuration Class
```python
from pydantic import BaseModel, Field
from typing import Optional
class YourVectorProviderConfig(BaseModel):
host: str = Field(
default="localhost",
description="Host address for the vector database"
)
port: int = Field(
default=19530,
description="Port number for database connection"
)
api_key: Optional[str] = Field(
default=None,
description="API key for authentication (if required)"
)
collection_prefix: str = Field(
default="llama_stack_",
description="Prefix for collection names"
)
```
### Error Handling
```python
from llama_stack.apis.common.errors import ProviderError
class YourVectorIndex:
async def add_chunks(self, chunks):
try:
# Your implementation
return chunk_ids
except YourDatabaseException as e:
raise ProviderError(f"Failed to add chunks to vector database: {str(e)}")
```
## Testing Your Implementation
### Local Testing
```bash
# Run unit tests
uv run --group unit pytest tests/unit/providers/vector_io/
# Run integration tests
uv run --group test pytest tests/integration/vector_io/ --stack-config=starter
```
### Manual Validation
```python
# Test your provider manually
from your_provider import YourVectorIOAdapter
config = YourVectorProviderConfig(host="localhost", port=8080)
adapter = YourVectorIOAdapter(config)
await adapter.initialize()
# Test your methods...
await adapter.shutdown()
```
## Common Implementation Patterns
### Connection Management
```python
class YourVectorIndex:
def __init__(self, config):
self.config = config
self._client = None
async def initialize(self):
self._client = await create_client(self.config)
async def _ensure_connected(self):
if not self._client:
await self.initialize()
```
### Batch Operations
```python
async def add_chunks(self, chunks: List[Chunk]) -> List[str]:
batch_size = 100
all_ids = []
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
batch_ids = await self._add_batch(batch)
all_ids.extend(batch_ids)
return all_ids
```
## Related Resources
- **[Vector IO Providers](/docs/providers/vector_io)** - Existing provider implementations
- **[Core Concepts](/docs/concepts)** - Understanding Llama Stack architecture
- **[New API Provider Guide](./new-api-provider)** - General provider development
- **[Testing Guide](./testing-record-replay)** - Advanced testing techniques

View file

@ -0,0 +1,432 @@
---
title: Record-Replay Testing System
description: Understanding how Llama Stack captures and replays API interactions for testing
sidebar_label: Testing Record-Replay
sidebar_position: 4
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
Understanding how Llama Stack captures and replays API interactions for reliable, cost-effective testing.
## Overview
The record-replay system solves a fundamental challenge in AI testing: **How do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?**
**The solution:** Intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
## System Architecture
### Request Hashing
Every API request gets converted to a deterministic hash for lookup:
```python
def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
normalized = {
"method": method.upper(),
"endpoint": urlparse(url).path, # Just the path, not full URL
"body": body, # Request parameters
}
return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
```
:::warning[Precise Hashing]
The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
```python
# These produce DIFFERENT hashes:
{"content": "Hello world"}
{"content": "Hello world\n"}
{"temperature": 0.7}
{"temperature": 0.7000001}
```
:::
### Client Interception
The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
### Storage Architecture
Recordings are stored as JSON files in the recording directory, looked up by their request hash:
```
recordings/
└── responses/
├── abc123def456.json # Individual response files
└── def789ghi012.json
```
**JSON files** store complete request/response pairs in human-readable format for debugging.
## Recording Modes
The system supports three distinct modes for different testing scenarios:
<Tabs>
<TabItem value="live" label="LIVE Mode">
**Direct API calls** with no recording or replay:
```python
with inference_recording(mode=InferenceMode.LIVE):
response = await client.chat.completions.create(...)
```
**Use for:**
- Initial development and debugging
- Testing against real APIs
- Validating new functionality
</TabItem>
<TabItem value="record" label="RECORD Mode">
**Captures API interactions** while passing through real responses:
```python
with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
response = await client.chat.completions.create(...)
# Real API call made, response captured AND returned
```
**The recording process:**
1. Request intercepted and hashed
2. Real API call executed
3. Response captured and serialized
4. Recording stored to disk
5. Original response returned to caller
</TabItem>
<TabItem value="replay" label="REPLAY Mode">
**Returns stored responses** instead of making API calls:
```python
with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
response = await client.chat.completions.create(...)
# No API call made, cached response returned instantly
```
**The replay process:**
1. Request intercepted and hashed
2. Hash looked up in SQLite index
3. Response loaded from JSON file
4. Response deserialized and returned
5. Error if no recording found
</TabItem>
</Tabs>
## Streaming Support
Streaming APIs present a unique challenge: how do you capture an async generator?
### The Challenge
```python
# How do you record this?
async for chunk in client.chat.completions.create(stream=True):
process(chunk)
```
### The Solution
The system captures all chunks immediately before yielding any:
<Tabs>
<TabItem value="capture" label="Stream Capture">
```python
async def handle_streaming_record(response):
# Capture complete stream first
chunks = []
async for chunk in response:
chunks.append(chunk)
# Store complete recording
storage.store_recording(
request_hash,
request_data,
{"body": chunks, "is_streaming": True}
)
# Return generator that replays captured chunks
async def replay_stream():
for chunk in chunks:
yield chunk
return replay_stream()
```
</TabItem>
<TabItem value="benefits" label="Benefits">
This approach ensures:
- **Complete capture** - The entire stream is saved atomically
- **Interface preservation** - The returned object behaves like the original API
- **Deterministic replay** - Same chunks in the same order every time
- **No API changes** - Your streaming code works unchanged
</TabItem>
</Tabs>
## Serialization
API responses contain complex Pydantic objects that need careful serialization:
```python
def _serialize_response(response):
if hasattr(response, "model_dump"):
# Preserve type information for proper deserialization
return {
"__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
"__data__": response.model_dump(mode="json"),
}
return response
```
This preserves **type safety** - when replayed, you get the same Pydantic objects with all their validation and methods.
## Usage in Testing
### Environment Variables
Control recording behavior globally:
<Tabs>
<TabItem value="env-vars" label="Environment Variables">
```bash
# Set recording mode (default: replay)
export LLAMA_STACK_TEST_INFERENCE_MODE=replay
# Set recording directory (default: tests/integration/recordings)
export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings
# Run tests
pytest tests/integration/
```
</TabItem>
<TabItem value="pytest" label="Pytest Integration">
The system integrates automatically based on environment variables, requiring **no changes** to test code.
```python
# Your test code remains unchanged
async def test_chat_completion():
response = await client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
assert response.choices[0].message.content
```
</TabItem>
</Tabs>
### Recording New Tests
<Tabs>
<TabItem value="local-record" label="Local Recording">
```bash
# Record new interactions locally
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new_feature.py
# Record specific test
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_file.py::test_function
```
</TabItem>
<TabItem value="remote-record" label="Remote Recording">
Use the automated GitHub workflow for easier recording:
```bash
# Record tests for specific subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm
```
</TabItem>
</Tabs>
## Debugging Recordings
### Inspecting Storage
<Tabs>
<TabItem value="sqlite" label="SQLite Queries">
```bash
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
# Find recordings by endpoint
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
# Check for specific model
sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE model='gpt-3.5-turbo';"
```
</TabItem>
<TabItem value="json" label="JSON Inspection">
```bash
# View specific response
cat recordings/responses/abc123def456.json | jq '.response.body'
# Compare request details
cat recordings/responses/abc123.json | jq '.request'
# Pretty print entire recording
cat recordings/responses/abc123.json | jq '.'
```
</TabItem>
</Tabs>
### Common Issues
<Tabs>
<TabItem value="hash-mismatch" label="Hash Mismatches">
**Problem:** Request parameters changed slightly between record and replay
**Solution:**
```bash
# Compare request details
cat recordings/responses/abc123.json | jq '.request'
# Re-record with updated parameters
rm recordings/responses/failing_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
```
</TabItem>
<TabItem value="serialization" label="Serialization Errors">
**Problem:** Response types changed between versions
**Solution:**
```bash
# Re-record with updated types
rm recordings/responses/failing_hash.json
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
```
</TabItem>
<TabItem value="missing" label="Missing Recordings">
**Problem:** New test or changed parameters
**Solution:**
```bash
# Record the missing interaction
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
```
</TabItem>
</Tabs>
## Design Decisions
### Why Not Mocks?
Traditional mocking breaks down with AI APIs because:
- **Complex structures** - Response structures are complex and evolve frequently
- **Streaming behavior** - Hard to mock streaming responses correctly
- **Edge cases** - Real API edge cases get missed in mocks
- **Maintenance burden** - Mocks become brittle and hard to maintain
### Why Precise Hashing?
Loose hashing (normalizing whitespace, rounding floats) seems convenient but **hides bugs**. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
### Why JSON + SQLite?
The hybrid storage approach provides the best of both worlds:
- **JSON** - Human readable, diff-friendly, easy to inspect and modify
- **SQLite** - Fast indexed lookups without loading response bodies
- **Combined** - Optimal for both performance and debugging
## Advanced Usage
### Custom Recording Contexts
```python
# Record specific API calls only
with inference_recording(
mode=InferenceMode.RECORD,
storage_dir="./custom_recordings",
filter_endpoints=["/v1/chat/completions"]
):
# Only chat completions will be recorded
response = await client.chat.completions.create(...)
```
### Conditional Recording
```python
# Record only if not exists
mode = InferenceMode.REPLAY
if not recording_exists(request_hash):
mode = InferenceMode.RECORD
with inference_recording(mode=mode):
response = await client.chat.completions.create(...)
```
### Recording Validation
```python
# Validate recordings during CI
def validate_recordings():
for recording_file in glob("recordings/responses/*.json"):
with open(recording_file) as f:
data = json.load(f)
assert "request" in data
assert "response" in data
# Additional validation...
```
## Best Practices
### 🎯 **Recording Strategy**
- Record comprehensive test scenarios once
- Use replay mode for regular development
- Re-record when API contracts change
- Keep recordings in version control
### 🔧 **Development Workflow**
- Start with LIVE mode for new features
- Switch to RECORD when tests are stable
- Use REPLAY for fast iteration
- Re-record when responses change
### 🚨 **Debugging Tips**
- Inspect JSON files for response details
- Use SQLite queries to find specific recordings
- Compare request hashes when tests fail
- Clear recordings to force re-recording
### 📊 **Maintenance**
- Regularly review and clean old recordings
- Update recordings when API versions change
- Document which recordings are critical
- Monitor recording file sizes
## Related Resources
- **[Contributing Overview](./index)** - General contribution guidelines
- **[Testing Guide](/docs/testing)** - Complete testing documentation
- **[Integration Tests](https://github.com/meta-llama/llama-stack/tree/main/tests/integration)** - Example test implementations
- **[GitHub Workflows](https://github.com/meta-llama/llama-stack/tree/main/.github/workflows)** - Automated recording workflows

View file

@ -0,0 +1,189 @@
---
title: Deploying Llama Stack
description: Production deployment guides for Llama Stack in various environments
sidebar_label: Overview
sidebar_position: 1
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Deploying Llama Stack
This section provides comprehensive guides for deploying Llama Stack in production environments, focusing on scalable, reliable deployments suitable for enterprise use.
## Deployment Options
Llama Stack can be deployed in various environments depending on your requirements:
<Tabs>
<TabItem value="kubernetes" label="Kubernetes">
**Container Orchestration**
- Deploy on Kubernetes clusters (local, cloud, or on-premises)
- Supports both local development (Kind) and production (EKS, GKE, AKS)
- Includes horizontal scaling and high availability
- Integrated with vLLM for efficient model serving
**Use Cases:**
- Production environments requiring scalability
- Multi-tenant deployments
- CI/CD integration
- Auto-scaling based on demand
[**→ Kubernetes Deployment Guide**](./kubernetes)
</TabItem>
<TabItem value="cloud" label="Cloud Native">
**Managed Cloud Services**
- AWS EKS with automated setup scripts
- Integration with cloud storage and networking
- OAuth authentication support
- Load balancing and auto-scaling
**Use Cases:**
- Enterprise cloud deployments
- Teams requiring managed infrastructure
- Applications with variable traffic patterns
- Integration with existing cloud services
[**→ Cloud Deployment Details**](./kubernetes#deploying-llama-stack-server-in-aws-eks)
</TabItem>
<TabItem value="local" label="Local Development">
**Development & Testing**
- Local Kind clusters for testing
- Docker Compose setups
- Single-node deployments
- Quick prototyping environments
**Use Cases:**
- Development and testing
- Learning and experimentation
- CI/CD testing pipelines
- Resource-constrained environments
[**→ Local Setup Guide**](./kubernetes#prerequisites)
</TabItem>
</Tabs>
## Architecture Considerations
### High Availability
- **Load Balancing**: Distribute traffic across multiple Llama Stack instances
- **Failover**: Automatic failover for model serving components
- **Health Checks**: Kubernetes liveness and readiness probes
- **Data Persistence**: Persistent volumes for model storage and application data
### Scalability
- **Horizontal Scaling**: Scale Llama Stack servers based on demand
- **Model Serving**: Separate model inference from application logic
- **Resource Management**: CPU and memory limits for predictable performance
- **Auto-scaling**: Kubernetes Horizontal Pod Autoscaler (HPA) support
### Security
- **Network Policies**: Secure inter-service communication
- **Secrets Management**: Secure handling of API keys and tokens
- **RBAC**: Role-based access control for Kubernetes resources
- **Authentication**: OAuth integration for user management
## Getting Started
### Quick Start Checklist
1. **✅ Choose Your Environment**
- Local development: Kind cluster
- Production: AWS EKS, GKE, or AKS
- Hybrid: On-premises Kubernetes
2. **✅ Prepare Prerequisites**
- Kubernetes cluster access
- Container registry access
- Model access tokens (Hugging Face)
- Storage provisioning
3. **✅ Configure Resources**
- Persistent volumes for model storage
- Secrets for API tokens
- Network policies and services
- Ingress controllers (if needed)
4. **✅ Deploy Components**
- Model serving infrastructure (vLLM)
- Llama Stack server
- Supporting services
- Monitoring and logging
### Recommended Flow
```mermaid
graph TD
A[Choose Deployment Target] --> B{Environment Type}
B -->|Local/Dev| C[Kind Cluster Setup]
B -->|Production| D[Cloud K8s Setup]
C --> E[Deploy vLLM Service]
D --> E
E --> F[Configure Llama Stack]
F --> G[Deploy Llama Stack Server]
G --> H[Verify Deployment]
H --> I[Configure Monitoring]
I --> J[Production Ready]
```
## Best Practices
### 🏗️ **Infrastructure**
- Use persistent volumes for model storage to avoid re-downloading
- Separate model serving from application logic for better scaling
- Implement proper resource limits and requests
- Use namespaces to organize related resources
### 🔒 **Security**
- Store sensitive data (API keys, tokens) in Kubernetes secrets
- Use network policies to restrict inter-pod communication
- Implement proper RBAC for cluster access
- Regular security updates for base images
### 📊 **Monitoring**
- Deploy logging aggregation (ELK stack, Fluentd)
- Set up metrics collection (Prometheus, Grafana)
- Configure alerts for system health
- Monitor resource utilization and performance
### 🚀 **Performance**
- Use appropriate node types for GPU/CPU workloads
- Configure resource requests and limits
- Implement caching strategies
- Monitor and optimize model loading times
## Troubleshooting
### Common Issues
**Pod Startup Failures**
- Check resource limits and node capacity
- Verify secret and configmap references
- Review persistent volume claims
- Check image pull policies and registry access
**Model Loading Issues**
- Verify Hugging Face token permissions
- Check persistent volume size and availability
- Monitor download progress in pod logs
- Ensure network connectivity to model repositories
**Service Connectivity**
- Verify service selectors and port configurations
- Check network policies and ingress rules
- Test internal DNS resolution
- Validate load balancer configurations
## Related Resources
- **[Kubernetes Deployment Guide](./kubernetes)** - Detailed deployment instructions
- **[Distributions](/docs/distributions)** - Understanding Llama Stack distributions
- **[Configuration](/docs/distributions/configuration)** - Server configuration options
- **[Building Applications](/docs/building-applications)** - Application development guides

View file

@ -0,0 +1,248 @@
---
title: Kubernetes Deployment Guide
description: Deploy Llama Stack on Kubernetes clusters with vLLM inference service
sidebar_label: Kubernetes
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Kubernetes Deployment Guide
Deploy Llama Stack and vLLM servers in a Kubernetes cluster instead of running them locally. This guide covers both local development with Kind and production deployment on AWS EKS.
## Prerequisites
### Local Kubernetes Setup
Create a local Kubernetes cluster via Kind:
```bash
kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
```
Set your Hugging Face token:
```bash
export HF_TOKEN=$(echo -n "your-hf-token" | base64)
```
## Quick Deployment
### Step 1: Create Storage and Secrets
```yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
data:
token: $HF_TOKEN
EOF
```
### Step 2: Deploy vLLM Server
```yaml
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata:
labels:
app.kubernetes.io/name: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: ["vllm serve meta-llama/Llama-3.2-1B-Instruct"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
EOF
```
### Step 3: Configure Llama Stack
Update your run configuration:
```yaml
providers:
inference:
- provider_id: vllm
provider_type: remote::vllm
config:
url: http://vllm-server.default.svc.cluster.local:8000/v1
max_tokens: 4096
api_token: fake
```
Build container image:
```bash
tmp_dir=$(mktemp -d) && cat >$tmp_dir/Containerfile.llama-stack-run-k8s <<EOF
FROM distribution-myenv:dev
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source
ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
EOF
podman build -f $tmp_dir/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s $tmp_dir
```
### Step 4: Deploy Llama Stack Server
```yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-stack-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: llama-stack
template:
metadata:
labels:
app.kubernetes.io/name: llama-stack
spec:
containers:
- name: llama-stack
image: localhost/llama-stack-run-k8s:latest
imagePullPolicy: IfNotPresent
command: ["python", "-m", "llama_stack.core.server.server", "--config", "/app/config.yaml"]
ports:
- containerPort: 5000
volumeMounts:
- name: llama-storage
mountPath: /root/.llama
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llama-stack-service
spec:
selector:
app.kubernetes.io/name: llama-stack
ports:
- protocol: TCP
port: 5000
targetPort: 5000
type: ClusterIP
EOF
```
### Step 5: Test Deployment
```bash
# Port forward and test
kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
```
## AWS EKS Deployment
### Prerequisites
- Set up an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html)
- Create a [GitHub OAuth app](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/creating-an-oauth-app)
- Set authorization callback URL to `http://<your-llama-stack-ui-url>/api/auth/callback/`
### Automated Deployment
```bash
export HF_TOKEN=<your-huggingface-token>
export GITHUB_CLIENT_ID=<your-github-client-id>
export GITHUB_CLIENT_SECRET=<your-github-client-secret>
export LLAMA_STACK_UI_URL=<your-llama-stack-ui-url>
cd docs/source/distributions/eks
./apply.sh
```
This script will:
- Set up default storage class for AWS EKS
- Deploy Llama Stack server in Kubernetes pods and services
## Troubleshooting
**Check pod status:**
```bash
kubectl get pods -l app.kubernetes.io/name=vllm
kubectl logs -l app.kubernetes.io/name=vllm
```
**Test service connectivity:**
```bash
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://vllm-server:8000/v1/models
```
## Related Resources
- **[Deployment Overview](./index)** - Overview of deployment options
- **[Distributions](/docs/distributions)** - Understanding Llama Stack distributions
- **[Configuration](/docs/distributions/configuration)** - Detailed configuration options

View file

@ -0,0 +1,455 @@
---
title: Build Your Own Distribution
description: Step-by-step guide to create custom Llama Stack distributions with your choice of API providers
sidebar_label: Building Custom Distributions
sidebar_position: 3
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Build Your Own Distribution
This guide will walk you through the steps to get started with building a Llama Stack distribution from scratch with your choice of API providers.
## Setting Your Log Level
In order to specify the proper logging level users can apply the following environment variable `LLAMA_STACK_LOGGING` with the following format:
`LLAMA_STACK_LOGGING=server=debug;core=info`
Where each category in the following list:
- all
- core
- server
- router
- inference
- agents
- safety
- eval
- tools
- client
Can be set to any of the following log levels:
- debug
- info
- warning
- error
- critical
The default global log level is `info`. `all` sets the log level for all components.
A user can also set `LLAMA_STACK_LOG_FILE` which will pipe the logs to the specified path as well as to the terminal. An example would be: `export LLAMA_STACK_LOG_FILE=server.log`
## Llama Stack Build
In order to build your own distribution, we recommend you clone the `llama-stack` repository.
```bash
git clone git@github.com:meta-llama/llama-stack.git
cd llama-stack
pip install -e .
```
Use the CLI to build your distribution. The main points to consider are:
1. **Image Type** - Do you want a venv environment or a Container (eg. Docker)
2. **Template** - Do you want to use a template to build your distribution? or start from scratch ?
3. **Config** - Do you want to use a pre-existing config file to build your distribution?
```bash
llama stack build -h
usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--distro DISTRIBUTION] [--list-distros] [--image-type {container,venv}] [--image-name IMAGE_NAME] [--print-deps-only]
[--run] [--providers PROVIDERS]
Build a Llama stack container
options:
-h, --help show this help message and exit
--config CONFIG Path to a config file to use for the build. You can find example configs in llama_stack.cores/**/build.yaml. If this argument is not provided, you will be prompted to
enter information interactively (default: None)
--template TEMPLATE (deprecated) Name of the example template config to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default:
None)
--distro DISTRIBUTION, --distribution DISTRIBUTION
Name of the distribution to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default: None)
--list-distros, --list-distributions
Show the available distributions for building a Llama Stack distribution (default: False)
--image-type {container,venv}
Image Type to use for the build. If not specified, will use the image type from the template config. (default: None)
--image-name IMAGE_NAME
[for image-type=container|venv] Name of the virtual environment to use for the build. If not specified, currently active environment will be used if found. (default:
None)
--print-deps-only Print the dependencies for the stack only, without building the stack (default: False)
--run Run the stack after building using the same image type, name, and other applicable arguments (default: False)
--providers PROVIDERS
Build a config for a list of providers and only those providers. This list is formatted like: api1=provider1,api2=provider2. Where there can be multiple providers per
API. (default: None)
```
After this step is complete, a file named `<name>-build.yaml` and template file `<name>-run.yaml` will be generated and saved at the output file path specified at the end of the command.
## Build Methods
<Tabs>
<TabItem value="template" label="Building from a template">
To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
The following command will allow you to see the available templates and their corresponding providers.
```bash
llama stack build --list-templates
```
```
------------------------------+-----------------------------------------------------------------------------+
| Template Name | Description |
+------------------------------+-----------------------------------------------------------------------------+
| watsonx | Use watsonx for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| vllm-gpu | Use a built-in vLLM engine for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| together | Use Together.AI for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| tgi | Use (an external) TGI server for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| starter | Quick start template for running Llama Stack with several popular providers |
+------------------------------+-----------------------------------------------------------------------------+
| sambanova | Use SambaNova for running LLM inference and safety |
+------------------------------+-----------------------------------------------------------------------------+
| remote-vllm | Use (an external) vLLM server for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| postgres-demo | Quick start template for running Llama Stack with several popular providers |
+------------------------------+-----------------------------------------------------------------------------+
| passthrough | Use Passthrough hosted llama-stack endpoint for LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| open-benchmark | Distribution for running open benchmarks |
+------------------------------+-----------------------------------------------------------------------------+
| ollama | Use (an external) Ollama server for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| nvidia | Use NVIDIA NIM for running LLM inference, evaluation and safety |
+------------------------------+-----------------------------------------------------------------------------+
| meta-reference-gpu | Use Meta Reference for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| llama_api | Distribution for running e2e tests in CI |
+------------------------------+-----------------------------------------------------------------------------+
| hf-serverless | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| hf-endpoint | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| groq | Use Groq for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| fireworks | Use Fireworks.AI for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| experimental-post-training | Experimental template for post training |
+------------------------------+-----------------------------------------------------------------------------+
| dell | Dell's distribution of Llama Stack. TGI inference via Dell's custom |
| | container |
+------------------------------+-----------------------------------------------------------------------------+
| ci-tests | Distribution for running e2e tests in CI |
+------------------------------+-----------------------------------------------------------------------------+
| cerebras | Use Cerebras for running LLM inference |
+------------------------------+-----------------------------------------------------------------------------+
| bedrock | Use AWS Bedrock for running LLM inference and safety |
+------------------------------+-----------------------------------------------------------------------------+
```
You may then pick a template to build your distribution with providers fitted to your liking.
For example, to build a distribution with TGI as the inference provider, you can run:
```bash
$ llama stack build --distro starter
...
You can now edit ~/.llama/distributions/llamastack-starter/starter-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-starter/starter-run.yaml`
```
:::tip
The generated `run.yaml` file is a starting point for your configuration. For comprehensive guidance on customizing it for your specific needs, infrastructure, and deployment scenarios, see [Customizing Your run.yaml Configuration](./customizing-run-yaml).
:::
</TabItem>
<TabItem value="scratch" label="Building from Scratch">
If the provided templates do not fit your use case, you could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
It would be best to start with a template and understand the structure of the config file and the various concepts ( APIS, providers, resources, etc.) before starting from scratch.
```bash
llama stack build
> Enter a name for your Llama Stack (e.g. my-local-stack): my-stack
> Enter the image type you want your Llama Stack to be built as (container or venv): venv
Llama Stack is composed of several APIs working together. Let's select
the provider types (implementations) you want to use for these APIs.
Tip: use <TAB> to see options for the providers.
> Enter provider for API inference: inline::meta-reference
> Enter provider for API safety: inline::llama-guard
> Enter provider for API agents: inline::meta-reference
> Enter provider for API memory: inline::faiss
> Enter provider for API datasetio: inline::meta-reference
> Enter provider for API scoring: inline::meta-reference
> Enter provider for API eval: inline::meta-reference
> Enter provider for API telemetry: inline::meta-reference
> (Optional) Enter a short description for your Llama Stack:
You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml`
```
</TabItem>
<TabItem value="config" label="Building from a pre-existing build config file">
In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
The config file will be of contents like the ones in `llama_stack/distributions/*build.yaml`.
```bash
llama stack build --config llama_stack/distributions/starter/build.yaml
```
</TabItem>
<TabItem value="external" label="Building with External Providers">
Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently or use community-provided providers.
To build a distribution with external providers, you need to:
1. Configure the `external_providers_dir` in your build configuration file:
```yaml
# Example my-external-stack.yaml with external providers
version: '2'
distribution_spec:
description: Custom distro for CI tests
providers:
inference:
- remote::custom_ollama
# Add more providers as needed
image_type: container
image_name: ci-test
# Path to external provider implementations
external_providers_dir: ~/.llama/providers.d
```
Here's an example for a custom Ollama provider:
```yaml
adapter:
adapter_type: custom_ollama
pip_packages:
- ollama
- aiohttp
- llama-stack-provider-ollama # This is the provider package
config_class: llama_stack_ollama_provider.config.OllamaImplConfig
module: llama_stack_ollama_provider
api_dependencies: []
optional_api_dependencies: []
```
The `pip_packages` section lists the Python packages required by the provider, as well as the
provider package itself. The package must be available on PyPI or can be provided from a local
directory or a git repository (git must be installed on the build environment).
2. Build your distribution using the config file:
```bash
llama stack build --config my-external-stack.yaml
```
For more information on external providers, including directory structure, provider types, and implementation requirements, see the [External Providers documentation](/docs/providers/external/external-providers-guide).
</TabItem>
<TabItem value="container" label="Building Container">
:::tip[Podman Alternative]
Podman is supported as an alternative to Docker. Set `CONTAINER_BINARY` to `podman` in your environment to use Podman.
:::
To build a container image, you may start off from a template and use the `--image-type container` flag to specify `container` as the build image type.
```bash
llama stack build --distro starter --image-type container
```
```bash
$ llama stack build --distro starter --image-type container
...
Containerfile created successfully in /tmp/tmp.viA3a3Rdsg/ContainerfileFROM python:3.10-slim
...
```
You can now edit ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml and run `llama stack run ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml`
Now set some environment variables for the inference model ID and Llama Stack Port and create a local directory to mount into the container's file system.
```bash
export INFERENCE_MODEL="llama3.2:3b"
export LLAMA_STACK_PORT=8321
mkdir -p ~/.llama
```
After this step is successful, you should be able to find the built container image and test it with the below Docker command:
```bash
docker run -d \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
localhost/distribution-ollama:dev \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Here are the docker flags and their uses:
* `-d`: Runs the container in the detached mode as a background process
* `-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT`: Maps the container port to the host port for accessing the server
* `-v ~/.llama:/root/.llama`: Mounts the local .llama directory to persist configurations and data
* `localhost/distribution-ollama:dev`: The name and tag of the container image to run
* `--port $LLAMA_STACK_PORT`: Port number for the server to listen on
* `--env INFERENCE_MODEL=$INFERENCE_MODEL`: Sets the model to use for inference
* `--env OLLAMA_URL=http://host.docker.internal:11434`: Configures the URL for the Ollama service
</TabItem>
</Tabs>
## Running Your Stack Server
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
```bash
llama stack run -h
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
[--image-type {venv}] [--enable-ui]
[config | template]
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
positional arguments:
config | template Path to config file to use for the run or name of known template (`llama stack list` for a list). (default: None)
options:
-h, --help show this help message and exit
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
--image-name IMAGE_NAME
Name of the image to run. Defaults to the current environment (default: None)
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: None)
--image-type {venv}
Image Type used during the build. This should be venv. (default: None)
--enable-ui Start the UI server (default: False)
```
**Note:** Container images built with `llama stack build --image-type container` cannot be run using `llama stack run`. Instead, they must be run directly using Docker or Podman commands as shown in the container building section above.
```bash
# Start using template name
llama stack run tgi
# Start using config file
llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
# Start using a venv
llama stack run --image-type venv ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
```
```bash
$ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
Serving API inspect
GET /health
GET /providers/list
GET /routes/list
Serving API inference
POST /inference/chat_completion
POST /inference/completion
POST /inference/embeddings
...
Serving API agents
POST /agents/create
POST /agents/session/create
POST /agents/turn/create
POST /agents/delete
POST /agents/session/delete
POST /agents/session/get
POST /agents/step/get
POST /agents/turn/get
Listening on ['::', '0.0.0.0']:8321
INFO: Started server process [2935911]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK
```
## Managing Distributions
### Listing Distributions
Using the list command, you can view all existing Llama Stack distributions, including stacks built from templates, from scratch, or using custom configuration files.
```bash
llama stack list -h
usage: llama stack list [-h]
list the build stacks
options:
-h, --help show this help message and exit
```
Example Usage:
```bash
llama stack list
```
```
------------------------------+-----------------------------------------------------------------+--------------+------------+
| Stack Name | Path | Build Config | Run Config |
+------------------------------+-----------------------------------------------------------------------------+--------------+
| together | ~/.llama/distributions/together | Yes | No |
+------------------------------+-----------------------------------------------------------------------------+--------------+
| bedrock | ~/.llama/distributions/bedrock | Yes | No |
+------------------------------+-----------------------------------------------------------------------------+--------------+
| starter | ~/.llama/distributions/starter | Yes | Yes |
+------------------------------+-----------------------------------------------------------------------------+--------------+
| remote-vllm | ~/.llama/distributions/remote-vllm | Yes | Yes |
+------------------------------+-----------------------------------------------------------------------------+--------------+
```
### Removing a Distribution
Use the remove command to delete a distribution you've previously built.
```bash
llama stack rm -h
usage: llama stack rm [-h] [--all] [name]
Remove the build stack
positional arguments:
name Name of the stack to delete (default: None)
options:
-h, --help show this help message and exit
--all, -a Delete all stacks (use with caution) (default: False)
```
Example:
```bash
llama stack rm llamastack-test
```
To keep your environment organized and avoid clutter, consider using `llama stack list` to review old or unused distributions and `llama stack rm <name>` to delete them when they're no longer needed.
## Troubleshooting
If you encounter any issues, ask questions in our discord or search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.

View file

@ -0,0 +1,841 @@
---
title: Configuration Reference
description: Complete guide to configuring Llama Stack runtime with YAML files
sidebar_label: Configuration Reference
sidebar_position: 6
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Configuring a "Stack"
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
:::note
The default `run.yaml` files generated by templates are starting points for your configuration. For guidance on customizing these files for your specific needs, see [Customizing Your run.yaml Configuration](./customizing-run-yaml).
:::
<details>
<summary>👋 Click here for a Sample Configuration File</summary>
```yaml
version: 2
apis:
- agents
- inference
- vector_io
- safety
- telemetry
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
url: ${env.OLLAMA_URL:=http://localhost:11434}
vector_io:
- provider_id: faiss
provider_type: inline::faiss
config:
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/faiss_store.db
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/agents_store.db
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config: {}
metadata_store:
namespace: null
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/registry.db
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: ollama
provider_model_id: null
shields: []
server:
port: 8321
auth:
provider_config:
type: "oauth2_token"
jwks:
uri: "https://my-token-issuing-svc.com/jwks"
```
</details>
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
```yaml
apis:
- agents
- inference
- vector_io
- safety
- telemetry
```
## Providers
Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the `inference` API:
```yaml
providers:
inference:
# provider_id is a string you can choose freely
- provider_id: ollama
# provider_type is a string that specifies the type of provider.
# in this case, the provider for inference is ollama and it runs remotely (outside of the distribution)
provider_type: remote::ollama
# config is a dictionary that contains the configuration for the provider.
# in this case, the configuration is the url of the ollama server
config:
url: ${env.OLLAMA_URL:=http://localhost:11434}
```
A few things to note:
- A _provider instance_ is identified with an (id, type, config) triplet.
- The id is a string you can choose freely.
- You can instantiate any number of provider instances of the same type.
- The configuration dictionary is provider-specific.
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
### Environment Variable Substitution
Llama Stack supports environment variable substitution in configuration values using the
`${env.VARIABLE_NAME}` syntax. This allows you to externalize configuration values and provide
different settings for different environments. The syntax is inspired by [bash parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
and follows similar patterns.
#### Basic Syntax
The basic syntax for environment variable substitution is:
```yaml
config:
api_key: ${env.API_KEY}
url: ${env.SERVICE_URL}
```
If the environment variable is not set, the server will raise an error during startup.
#### Default Values
You can provide default values using the `:=` operator:
```yaml
config:
url: ${env.OLLAMA_URL:=http://localhost:11434}
port: ${env.PORT:=8321}
timeout: ${env.TIMEOUT:=60}
```
If the environment variable is not set, the default value `http://localhost:11434` will be used.
Empty defaults are allowed so `url: ${env.OLLAMA_URL:=}` will be set to `None` if the environment variable is not set.
#### Conditional Values
You can use the `:+` operator to provide a value only when the environment variable is set:
```yaml
config:
# Only include this field if ENVIRONMENT is set
environment: ${env.ENVIRONMENT:+production}
```
If the environment variable is set, the value after `:+` will be used. If it's not set, the field
will be omitted with a `None` value.
Do not use conditional values (`${env.OLLAMA_URL:+}`) for empty defaults (`${env.OLLAMA_URL:=}`).
This will be set to `None` if the environment variable is not set.
Conditional must only be used when the environment variable is set.
#### Examples
Here are some common patterns:
```yaml
# Required environment variable (will error if not set)
api_key: ${env.OPENAI_API_KEY}
# Optional with default
base_url: ${env.API_BASE_URL:=https://api.openai.com/v1}
# Conditional field
debug_mode: ${env.DEBUG:+true}
# Optional field that becomes None if not set
optional_token: ${env.OPTIONAL_TOKEN:+}
```
#### Runtime Override
You can override environment variables at runtime when starting the server:
```bash
# Override specific environment variables
llama stack run --config run.yaml --env API_KEY=sk-123 --env BASE_URL=https://custom-api.com
# Or set them in your shell
export API_KEY=sk-123
export BASE_URL=https://custom-api.com
llama stack run --config run.yaml
```
#### Type Safety
The environment variable substitution system is type-safe:
- String values remain strings
- Empty defaults (`${env.VAR:+}`) are converted to `None` for fields that accept `str | None`
- Numeric defaults are properly typed (e.g., `${env.PORT:=8321}` becomes an integer)
- Boolean defaults work correctly (e.g., `${env.DEBUG:=false}` becomes a boolean)
## Resources
Let's look at the `models` section:
```yaml
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: ollama
provider_model_id: null
model_type: llm
```
A Model is an instance of a "Resource" (see [Concepts](/docs/concepts/)) and is associated with a specific inference provider (in this case, the provider with identifier `ollama`). This is an instance of a "pre-registered" model. While we always encourage the clients to register models before using them, some Stack servers may come up a list of "already known and available" models.
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
If you need to conditionally register a model in the configuration, such as only when specific environment variable(s) are set, this can be accomplished by utilizing a special `__disabled__` string as the default value of an environment variable substitution, as shown below:
```yaml
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL:__disabled__}
provider_id: ollama
provider_model_id: ${env.INFERENCE_MODEL:__disabled__}
```
The snippet above will only register this model if the environment variable `INFERENCE_MODEL` is set and non-empty. If the environment variable is not set, the model will not get registered at all.
## Server Configuration
The `server` section configures the HTTP server that serves the Llama Stack APIs:
```yaml
server:
port: 8321 # Port to listen on (default: 8321)
tls_certfile: "/path/to/cert.pem" # Optional: Path to TLS certificate for HTTPS
tls_keyfile: "/path/to/key.pem" # Optional: Path to TLS key for HTTPS
cors: true # Optional: Enable CORS (dev mode) or full config object
```
### CORS Configuration
CORS (Cross-Origin Resource Sharing) can be configured in two ways:
**Local development** (allows localhost origins only):
```yaml
server:
cors: true
```
**Explicit configuration** (custom origins and settings):
```yaml
server:
cors:
allow_origins: ["https://myapp.com", "https://app.example.com"]
allow_methods: ["GET", "POST", "PUT", "DELETE"]
allow_headers: ["Content-Type", "Authorization"]
allow_credentials: true
max_age: 3600
```
When `cors: true`, the server enables secure localhost-only access for local development. For production, specify exact origins to maintain security.
### Authentication Configuration
:::info[Breaking Change (v0.2.14)]
The authentication configuration structure has changed. The previous format with `provider_type` and `config` fields has been replaced with a unified `provider_config` field that includes the `type` field. Update your configuration files accordingly.
:::
The `auth` section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:
```
Authorization: Bearer <token>
```
The server supports multiple authentication providers:
#### OAuth 2.0/OpenID Connect Provider with Kubernetes
The server can be configured to use service account tokens for authorization, validating these against the Kubernetes API server, e.g.:
```yaml
server:
auth:
provider_config:
type: "oauth2_token"
jwks:
uri: "https://kubernetes.default.svc:8443/openid/v1/jwks"
token: "${env.TOKEN:+}"
key_recheck_period: 3600
tls_cafile: "/path/to/ca.crt"
issuer: "https://kubernetes.default.svc"
audience: "https://kubernetes.default.svc"
```
To find your cluster's jwks uri (from which the public key(s) to verify the token signature are obtained), run:
```bash
kubectl get --raw /.well-known/openid-configuration| jq -r .jwks_uri
```
For the tls_cafile, you can use the CA certificate of the OIDC provider:
```bash
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.certificate-authority}'
```
For the issuer, you can use the OIDC provider's URL:
```bash
kubectl get --raw /.well-known/openid-configuration| jq .issuer
```
The audience can be obtained from a token, e.g. run:
```bash
kubectl create token default --duration=1h | cut -d. -f2 | base64 -d | jq .aud
```
The jwks token is used to authorize access to the jwks endpoint. You can obtain a token by running:
```bash
kubectl create namespace llama-stack
kubectl create serviceaccount llama-stack-auth -n llama-stack
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
export TOKEN=$(cat llama-stack-auth-token)
```
Alternatively, you can configure the jwks endpoint to allow anonymous access. To do this, make sure
the `kube-apiserver` runs with `--anonymous-auth=true` to allow unauthenticated requests
and that the correct RoleBinding is created to allow the service account to access the necessary
resources. If that is not the case, you can create a RoleBinding for the service account to access
the necessary resources:
```yaml
# allow-anonymous-openid.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: allow-anonymous-openid
rules:
- nonResourceURLs: ["/openid/v1/jwks"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: allow-anonymous-openid
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: allow-anonymous-openid
subjects:
- kind: User
name: system:anonymous
apiGroup: rbac.authorization.k8s.io
```
And then apply the configuration:
```bash
kubectl apply -f allow-anonymous-openid.yaml
```
The provider extracts user information from the JWT token:
- Username from the `sub` claim becomes a role
- Kubernetes groups become teams
You can easily validate a request by running:
```bash
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
```
#### Kubernetes Authentication Provider
The server can be configured to use Kubernetes SelfSubjectReview API to validate tokens directly against the Kubernetes API server:
```yaml
server:
auth:
provider_config:
type: "kubernetes"
api_server_url: "https://kubernetes.default.svc"
claims_mapping:
username: "roles"
groups: "roles"
uid: "uid_attr"
verify_tls: true
tls_cafile: "/path/to/ca.crt"
```
Configuration options:
- `api_server_url`: The Kubernetes API server URL (e.g., https://kubernetes.default.svc:6443)
- `verify_tls`: Whether to verify TLS certificates (default: true)
- `tls_cafile`: Path to CA certificate file for TLS verification
- `claims_mapping`: Mapping of Kubernetes user claims to access attributes
The provider validates tokens by sending a SelfSubjectReview request to the Kubernetes API server at `/apis/authentication.k8s.io/v1/selfsubjectreviews`. The provider extracts user information from the response:
- Username from the `userInfo.username` field
- Groups from the `userInfo.groups` field
- UID from the `userInfo.uid` field
To obtain a token for testing:
```bash
kubectl create namespace llama-stack
kubectl create serviceaccount llama-stack-auth -n llama-stack
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
```
You can validate a request by running:
```bash
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
```
#### GitHub Token Provider
Validates GitHub personal access tokens or OAuth tokens directly:
```yaml
server:
auth:
provider_config:
type: "github_token"
github_api_base_url: "https://api.github.com" # Or GitHub Enterprise URL
```
The provider fetches user information from GitHub and maps it to access attributes based on the `claims_mapping` configuration.
#### Custom Provider
Validates tokens against a custom authentication endpoint:
```yaml
server:
auth:
provider_config:
type: "custom"
endpoint: "https://auth.example.com/validate" # URL of the auth endpoint
```
The custom endpoint receives a POST request with:
```json
{
"api_key": "<token>",
"request": {
"path": "/api/v1/endpoint",
"headers": {
"content-type": "application/json",
"user-agent": "curl/7.64.1"
},
"params": {
"key": ["value"]
}
}
}
```
And must respond with:
```json
{
"access_attributes": {
"roles": ["admin", "user"],
"teams": ["ml-team", "nlp-team"],
"projects": ["llama-3", "project-x"],
"namespaces": ["research"]
},
"message": "Authentication successful"
}
```
If no access attributes are returned, the token is used as a namespace.
### Access Control
When authentication is enabled, access to resources is controlled
through the `access_policy` attribute of the auth config section under
server. The value for this is a list of access rules.
Each access rule defines a list of actions either to permit or to
forbid. It may specify a principal or a resource that must match for
the rule to take effect.
Valid actions are create, read, update, and delete. The resource to
match should be specified in the form of a type qualified identifier,
e.g. model::my-model or vector_db::some-db, or a wildcard for all
resources of a type, e.g. model::*. If the principal or resource are
not specified, they will match all requests.
The valid resource types are model, shield, vector_db, dataset,
scoring_function, benchmark, tool, tool_group and session.
A rule may also specify a condition, either a 'when' or an 'unless',
with additional constraints as to where the rule applies. The
constraints supported at present are:
- 'user with `<attr-value>` in `<attr-name>`'
- 'user with `<attr-value>` not in `<attr-name>`'
- 'user is owner'
- 'user is not owner'
- 'user in owners `<attr-name>`'
- 'user not in owners `<attr-name>`'
The attributes defined for a user will depend on how the auth
configuration is defined.
When checking whether a particular action is allowed by the current
user for a resource, all the defined rules are tested in order to find
a match. If a match is found, the request is permitted or forbidden
depending on the type of rule. If no match is found, the request is
denied.
If no explicit rules are specified, a default policy is defined with
which all users can access all resources defined in config but
resources created dynamically can only be accessed by the user that
created them.
#### Examples
The following restricts access to particular github users:
```yaml
server:
auth:
provider_config:
type: "github_token"
github_api_base_url: "https://api.github.com"
access_policy:
- permit:
principal: user-1
actions: [create, read, delete]
description: user-1 has full access to all resources
- permit:
principal: user-2
actions: [read]
resource: model::model-1
description: user-2 has read access to model-1 only
```
Similarly, the following restricts access to particular kubernetes
service accounts:
```yaml
server:
auth:
provider_config:
type: "oauth2_token"
audience: https://kubernetes.default.svc.cluster.local
issuer: https://kubernetes.default.svc.cluster.local
tls_cafile: /home/gsim/.minikube/ca.crt
jwks:
uri: https://kubernetes.default.svc.cluster.local:8443/openid/v1/jwks
token: ${env.TOKEN}
access_policy:
- permit:
principal: system:serviceaccount:my-namespace:my-serviceaccount
actions: [create, read, delete]
description: specific serviceaccount has full access to all resources
- permit:
principal: system:serviceaccount:default:default
actions: [read]
resource: model::model-1
description: default account has read access to model-1 only
```
The following policy, which assumes that users are defined with roles
and teams by whichever authentication system is in use, allows any
user with a valid token to use models, create resources other than
models, read and delete resources they created and read resources
created by users sharing a team with them:
```yaml
access_policy:
- permit:
actions: [read]
resource: model::*
description: all users have read access to models
- forbid:
actions: [create, delete]
resource: model::*
unless: user with admin in roles
description: only user with admin role can create or delete models
- permit:
actions: [create, read, delete]
when: user is owner
description: users can create resources other than models and read and delete those they own
- permit:
actions: [read]
when: user in owner teams
description: any user has read access to any resource created by a user with the same team
```
#### API Endpoint Authorization with Scopes
In addition to resource-based access control, Llama Stack supports endpoint-level authorization using OAuth 2.0 style scopes. When authentication is enabled, specific API endpoints require users to have particular scopes in their authentication token.
**Scope-Gated APIs:**
The following APIs are currently gated by scopes:
- **Telemetry API** (scope: `telemetry.read`):
- `POST /telemetry/traces` - Query traces
- `GET /telemetry/traces/{trace_id}` - Get trace by ID
- `GET /telemetry/traces/{trace_id}/spans/{span_id}` - Get span by ID
- `POST /telemetry/spans/{span_id}/tree` - Get span tree
- `POST /telemetry/spans` - Query spans
- `POST /telemetry/metrics/{metric_name}` - Query metrics
**Authentication Configuration:**
For **JWT/OAuth2 providers**, scopes should be included in the JWT's claims:
```json
{
"sub": "user123",
"scope": "telemetry.read",
"aud": "llama-stack"
}
```
For **custom authentication providers**, the endpoint must return user attributes including the `scopes` array:
```json
{
"principal": "user123",
"attributes": {
"scopes": ["telemetry.read"]
}
}
```
**Behavior:**
- Users without the required scope receive a 403 Forbidden response
- When authentication is disabled, scope checks are bypassed
- Endpoints without `required_scope` work normally for all authenticated users
### Quota Configuration
The `quota` section allows you to enable server-side request throttling for both
authenticated and anonymous clients. This is useful for preventing abuse, enforcing
fairness across tenants, and controlling infrastructure costs without requiring
client-side rate limiting or external proxies.
Quotas are disabled by default. When enabled, each client is tracked using either:
* Their authenticated `client_id` (derived from the Bearer token), or
* Their IP address (fallback for anonymous requests)
Quota state is stored in a SQLite-backed key-value store, and rate limits are applied
within a configurable time window (currently only `day` is supported).
#### Example
```yaml
server:
quota:
kvstore:
type: sqlite
db_path: ./quotas.db
anonymous_max_requests: 100
authenticated_max_requests: 1000
period: day
```
#### Configuration Options
| Field | Description |
| ---------------------------- | -------------------------------------------------------------------------- |
| `kvstore` | Required. Backend storage config for tracking request counts. |
| `kvstore.type` | Must be `"sqlite"` for now. Other backends may be supported in the future. |
| `kvstore.db_path` | File path to the SQLite database. |
| `anonymous_max_requests` | Max requests per period for unauthenticated clients. |
| `authenticated_max_requests` | Max requests per period for authenticated clients. |
| `period` | Time window for quota enforcement. Only `"day"` is supported. |
:::note
If `authenticated_max_requests` is set but no authentication provider is
configured, the server will fall back to applying `anonymous_max_requests` to all
clients.
:::
#### Example with Authentication Enabled
```yaml
server:
port: 8321
auth:
provider_config:
type: custom
endpoint: https://auth.example.com/validate
quota:
kvstore:
type: sqlite
db_path: ./quotas.db
anonymous_max_requests: 100
authenticated_max_requests: 1000
period: day
```
If a client exceeds their limit, the server responds with:
```http
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
{
"error": {
"message": "Quota exceeded"
}
}
```
### CORS Configuration
Configure CORS to allow web browsers to make requests from different domains. Disabled by default.
#### Quick Setup
For development, use the simple boolean flag:
```yaml
server:
cors: true # Auto-enables localhost with any port
```
This automatically allows `http://localhost:*` and `https://localhost:*` with secure defaults.
#### Custom Configuration
For specific origins and full control:
```yaml
server:
cors:
allow_origins: ["https://myapp.com", "https://staging.myapp.com"]
allow_credentials: true
allow_methods: ["GET", "POST", "PUT", "DELETE"]
allow_headers: ["Content-Type", "Authorization"]
allow_origin_regex: "https://.*\\.example\\.com" # Optional regex pattern
expose_headers: ["X-Total-Count"]
max_age: 86400
```
#### Configuration Options
| Field | Description | Default |
| -------------------- | ---------------------------------------------- | ------- |
| `allow_origins` | List of allowed origins. Use `["*"]` for any. | `["*"]` |
| `allow_origin_regex` | Regex pattern for allowed origins (optional). | `None` |
| `allow_methods` | Allowed HTTP methods. | `["*"]` |
| `allow_headers` | Allowed headers. | `["*"]` |
| `allow_credentials` | Allow credentials (cookies, auth headers). | `false` |
| `expose_headers` | Headers exposed to browser. | `[]` |
| `max_age` | Preflight cache time (seconds). | `600` |
**Security Notes**:
- `allow_credentials: true` requires explicit origins (no wildcards)
- `cors: true` enables localhost access only (secure for development)
- For public APIs, always specify exact allowed origins
## Extending to handle Safety
Configuring Safety can be a little involved so it is instructive to go through an example.
The Safety API works with the associated Resource called a `Shield`. Providers can support various kinds of Shields. Good examples include the [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) system-safety models, or [Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).
To configure a Bedrock Shield, you would need to add:
- A Safety API provider instance with type `remote::bedrock`
- A Shield resource served by this provider.
```yaml
...
providers:
safety:
- provider_id: bedrock
provider_type: remote::bedrock
config:
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
...
shields:
- provider_id: bedrock
params:
guardrailVersion: ${env.GUARDRAIL_VERSION}
provider_shield_id: ${env.GUARDRAIL_ID}
...
```
The situation is more involved if the Shield needs _Inference_ of an associated model. This is the case with Llama Guard. In that case, you would need to add:
- A Safety API provider instance with type `inline::llama-guard`
- An Inference API provider instance for serving the model.
- A Model resource associated with this provider.
- A Shield resource served by the Safety provider.
The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:
```yaml
...
providers:
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
inference:
# this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
- provider_id: vllm-0
provider_type: remote::vllm
config:
url: ${env.VLLM_URL:=http://localhost:8000}
# this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
- provider_id: vllm-1
provider_type: remote::vllm
config:
url: ${env.SAFETY_VLLM_URL:=http://localhost:8001}
...
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: vllm-0
provider_model_id: null
- metadata: {}
model_id: ${env.SAFETY_MODEL}
provider_id: vllm-1
provider_model_id: null
shields:
- provider_id: llama-guard
shield_id: ${env.SAFETY_MODEL} # Llama Guard shields are identified by the corresponding LlamaGuard model
provider_shield_id: null
...
```

View file

@ -0,0 +1,55 @@
---
title: Customizing run.yaml Files
description: Guide to customizing Llama Stack run.yaml configuration files for your deployment environment
sidebar_label: Customizing run.yaml
sidebar_position: 4
---
# Customizing run.yaml Files
The `run.yaml` files generated by Llama Stack templates are **starting points** designed to be customized for your specific needs. They are not meant to be used as-is in production environments.
## Key Points
- **Templates are starting points**: Generated `run.yaml` files contain defaults for development/testing
- **Customization expected**: Update URLs, credentials, models, and settings for your environment
- **Version control separately**: Keep customized configs in your own repository
- **Environment-specific**: Create different configurations for dev, staging, production
## What You Can Customize
You can customize:
- **Provider endpoints**: Change `http://localhost:8000` to your actual servers
- **Swap providers**: Replace default providers (e.g., swap Tavily with Brave for search)
- **Storage paths**: Move from `/tmp/` to production directories
- **Authentication**: Add API keys, SSL, timeouts
- **Models**: Different model sizes for dev vs prod
- **Database settings**: Switch from SQLite to PostgreSQL
- **Tool configurations**: Add custom tools and integrations
## Best Practices
- Use environment variables for secrets and environment-specific values
- Create separate `run.yaml` files for different environments (dev, staging, prod)
- Document your changes with comments
- Test configurations before deployment
- Keep your customized configs in version control
Example structure:
```
your-project/
├── configs/
│ ├── dev-run.yaml
│ ├── prod-run.yaml
└── README.md
```
The goal is to take the generated template and adapt it to your specific infrastructure and operational needs.
## Related Guides
- **[Configuration Reference](./configuration)** - Detailed configuration file format and options
- **[Starting Llama Stack Server](./starting-llama-stack-server)** - How to run with your custom configuration
- **[Building Custom Distributions](./building-distro)** - Create distributions with your preferred providers

View file

@ -0,0 +1,65 @@
---
title: Using Llama Stack as a Library
description: How to use Llama Stack as a Python library instead of running a server
sidebar_label: Importing as Library
sidebar_position: 5
---
# Using Llama Stack as a Library
## Setup Llama Stack without a Server
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library.
This avoids the overhead of setting up a server.
```bash
# setup
uv pip install llama-stack
llama stack build --distro starter --image-type venv
```
```python
from llama_stack.core.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient(
"starter",
# provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
provider_data={"tavily_search_api_key": os.environ["TAVILY_SEARCH_API_KEY"]},
)
```
This will parse your config and set up any inline implementations and remote clients needed for your implementation.
Then, you can access the APIs like `models` and `inference` on the client and call their methods directly:
```python
response = client.models.list()
```
If you've created a [custom distribution](./building-distro), you can also use the run.yaml configuration file directly:
```python
client = LlamaStackAsLibraryClient(config_path)
```
## Benefits of Library Mode
- **No server overhead**: Direct Python API calls without HTTP requests
- **Simplified deployment**: No need to manage server processes
- **Better integration**: Seamlessly embed in existing Python applications
- **Reduced latency**: Eliminate network round-trips for inline providers
## Use Cases
Library mode is ideal when:
- Using external services for most APIs (Ollama, remote inference providers, etc.)
- Building Python applications that need Llama Stack functionality
- Prototyping and development workflows
- Serverless or container environments where you want minimal overhead
## Related Guides
- **[Building Custom Distributions](./building-distro)** - Create your own distribution for library use
- **[Configuration Reference](./configuration)** - Understanding the configuration format
- **[Starting Llama Stack Server](./starting-llama-stack-server)** - Alternative server-based deployment

View file

@ -0,0 +1,21 @@
---
title: Distributions Overview
description: Pre-packaged sets of Llama Stack components for different deployment scenarios
sidebar_label: Overview
sidebar_position: 1
---
# Distributions Overview
A distribution is a pre-packaged set of Llama Stack components that can be deployed together.
This section provides an overview of the distributions available in Llama Stack.
## Distribution Guides
- **[Available Distributions](./list-of-distributions)** - Complete list and comparison of all distributions
- **[Building Custom Distributions](./building-distro)** - Create your own distribution from scratch
- **[Customizing Configuration](./customizing-run-yaml)** - Customize run.yaml for your needs
- **[Starting Llama Stack Server](./starting-llama-stack-server)** - How to run distributions
- **[Importing as Library](./importing-as-library)** - Use distributions in your code
- **[Configuration Reference](./configuration)** - Configuration file format details

View file

@ -0,0 +1,125 @@
---
title: Available Distributions
description: Complete overview of Llama Stack distributions for different use cases and hardware
sidebar_label: Available Distributions
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Available Distributions
Llama Stack provides several pre-configured distributions to help you get started quickly. Choose the distribution that best fits your hardware and use case.
## Quick Reference
| Distribution | Use Case | Hardware Requirements | Provider |
|--------------|----------|----------------------|----------|
| `distribution-starter` | General purpose, prototyping | Any (CPU/GPU) | Ollama, Remote APIs |
| `distribution-meta-reference-gpu` | High-performance inference | GPU required | Local GPU inference |
| Remote-hosted | Production, managed service | None | Partner providers |
| iOS/Android SDK | Mobile applications | Mobile device | On-device inference |
## Choose Your Distribution
### 🚀 Getting Started (Recommended for Beginners)
**Use `distribution-starter` if you want to:**
- Prototype quickly without GPU requirements
- Use remote inference providers (Fireworks, Together, vLLM etc.)
- Run locally with Ollama for development
```bash
docker pull llama-stack/distribution-starter
```
**Guides:** [Starter Distribution Guide](./self-hosted-distro/starter)
### 🖥️ Self-Hosted with GPU
**Use `distribution-meta-reference-gpu` if you:**
- Have access to GPU hardware
- Want maximum performance and control
- Need to run inference locally
```bash
docker pull llama-stack/distribution-meta-reference-gpu
```
**Guides:** [Meta Reference GPU Guide](./self-hosted-distro/meta-reference-gpu)
### 🖥️ Self-Hosted with NVIDA NeMo Microservices
**Use `nvidia` if you:**
- Want to use Llama Stack with NVIDIA NeMo Microservices
**Guides:** [NVIDIA Distribution Guide](./self-hosted-distro/nvidia)
### ☁️ Managed Hosting
**Use remote-hosted endpoints if you:**
- Don't want to manage infrastructure
- Need production-ready reliability
- Prefer managed services
**Partners:** [Fireworks.ai](https://fireworks.ai) and [Together.xyz](https://together.xyz)
**Guides:** [Remote-Hosted Endpoints](./remote-hosted-distro/)
### 📱 Mobile Development
**Use mobile SDKs if you:**
- Are building iOS or Android applications
- Need on-device inference capabilities
- Want offline functionality
- [iOS SDK](./ondevice-distro/ios-sdk)
- [Android SDK](./ondevice-distro/android-sdk)
### 🔧 Custom Solutions
**Build your own distribution if:**
- None of the above fit your specific needs
- You need custom configurations
- You want to optimize for your specific use case
**Guides:** [Building Custom Distributions](./building-distro)
## Detailed Documentation
### Self-Hosted Distributions
- **[Starter Distribution](./self-hosted-distro/starter)** - General purpose template
- **[Meta Reference GPU](./self-hosted-distro/meta-reference-gpu)** - High-performance GPU inference
### Remote-Hosted Solutions
- **[Remote-Hosted Overview](./remote-hosted-distro/)** - Managed hosting options
### Mobile SDKs
- **[iOS SDK](./ondevice-distro/ios-sdk)** - Native iOS development
- **[Android SDK](./ondevice-distro/android-sdk)** - Native Android development
## Decision Flow
```mermaid
graph TD
A[What's your use case?] --> B{Need mobile app?}
B -->|Yes| C[Use Mobile SDKs]
B -->|No| D{Have GPU hardware?}
D -->|Yes| E[Use Meta Reference GPU]
D -->|No| F{Want managed hosting?}
F -->|Yes| G[Use Remote-Hosted]
F -->|No| H[Use Starter Distribution]
```
## Next Steps
1. **Choose your distribution** from the options above
2. **Follow the setup guide** for your selected distribution
3. **Configure your providers** with API keys or local models
4. **Start building** with Llama Stack!
For help choosing or troubleshooting, check our [Getting Started Guide](/docs/getting-started/) or [Community Support](https://github.com/llama-stack/llama-stack/discussions).

View file

@ -0,0 +1,309 @@
---
title: Android SDK
description: Llama Stack Client Kotlin API Library for native Android development with local and remote inference
sidebar_label: Android SDK
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Llama Stack Client Kotlin API Library
We are excited to share a guide for a Kotlin Library that brings the benefits of Llama Stack to your Android device. This library is a set of SDKs that provide a simple and effective way to integrate AI capabilities into your Android app whether it is local (on-device) or remote inference.
## Features
- **Local Inferencing**: Run Llama models purely on-device with real-time processing. We currently utilize ExecuTorch as the local inference distributor and may support others in the future.
- [ExecuTorch](https://github.com/pytorch/executorch/tree/main) is a complete end-to-end solution within the PyTorch framework for inferencing capabilities on-device with high portability and seamless performance.
- **Remote Inferencing**: Perform inferencing tasks remotely with Llama models hosted on a remote connection (or serverless localhost).
- **Simple Integration**: With easy-to-use APIs, a developer can quickly integrate Llama Stack in their Android app. The difference with local vs remote inferencing is also minimal.
**Latest Release Notes**: [GitHub Release](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release)
:::info[Stability]
Tagged releases are stable versions of the project. While we strive to maintain a stable main branch, it's not guaranteed to be free of bugs or issues.
:::
## Android Demo App
Check out our demo app to see how to integrate Llama Stack into your Android app: [Android Demo App](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app)
The key files in the app are `ExampleLlamaStackLocalInference.kt`, `ExampleLlamaStackRemoteInference.kts`, and `MainActivity.java`. With encompassed business logic, the app shows how to use Llama Stack for both environments.
## Quick Start
### Add Dependencies
<Tabs>
<TabItem value="remote" label="Remote Only">
Add the following dependency in your `build.gradle.kts` file:
```kotlin
dependencies {
implementation("com.llama.llamastack:llama-stack-client-kotlin:0.2.2")
}
```
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
If you plan on doing remote inferencing only, this is sufficient to get started.
</TabItem>
<TabItem value="local" label="With Local Inference">
For local inferencing, it is required to include the ExecuTorch library into your app.
Include the ExecuTorch library by:
1. Download the `download-prebuilt-et-lib.sh` script file from the [llama-stack-client-kotlin-client-local](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/llama-stack-client-kotlin-client-local/download-prebuilt-et-lib.sh) directory to your local machine.
2. Move the script to the top level of your Android app where the `app` directory resides.
3. Run `sh download-prebuilt-et-lib.sh` to create an `app/libs` directory and download the `executorch.aar` in that path. This generates an ExecuTorch library for the XNNPACK delegate.
4. Add the `executorch.aar` dependency in your `build.gradle.kts` file:
```kotlin
dependencies {
...
implementation(files("libs/executorch.aar"))
...
}
```
See other dependencies for the local RAG in Android app [README](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#quick-start).
</TabItem>
</Tabs>
## Llama Stack APIs in Your Android App
Breaking down the demo app, this section will show the core pieces that are used to initialize and run inference with Llama Stack using the Kotlin library.
### Setup Remote Inferencing
Start a Llama Stack server on localhost. Here is an example of how you can do this using the firework.ai distribution:
```bash
uv venv starter --python 3.12
source starter/bin/activate # On Windows: starter\Scripts\activate
pip install --no-cache llama-stack==0.2.2
llama stack build --distro starter --image-type venv
export FIREWORKS_API_KEY=<SOME_KEY>
llama stack run starter --port 5050
```
:::warning[Version Compatibility]
Ensure the Llama Stack server version is the same as the Kotlin SDK Library for maximum compatibility.
:::
Other inference providers: [Supported Implementations](/docs/#supported-llama-stack-implementations)
How to set remote localhost in Demo App: [Settings](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#settings)
### Initialize the Client
A client serves as the primary interface for interacting with a specific inference type and its associated parameters. Only after client is initialized then you can configure and start inferences.
<Tabs>
<TabItem value="local" label="Local Inference">
```kotlin
client = LlamaStackClientLocalClient
.builder()
.modelPath(modelPath)
.tokenizerPath(tokenizerPath)
.temperature(temperature)
.build()
```
</TabItem>
<TabItem value="remote" label="Remote Inference">
```kotlin
// remoteURL is a string like "http://localhost:5050"
client = LlamaStackClientOkHttpClient
.builder()
.baseUrl(remoteURL)
.build()
```
</TabItem>
</Tabs>
### Run Inference
With the Kotlin Library managing all the major operational logic, there are minimal to no changes when running simple chat inference for local or remote:
```kotlin
val result = client!!.inference().chatCompletion(
InferenceChatCompletionParams.builder()
.modelId(modelName)
.messages(listOfMessages)
.build()
)
// response contains string with response from model
var response = result.asChatCompletionResponse().completionMessage().content().string();
```
**[Remote only]** For inference with a streaming response:
```kotlin
val result = client!!.inference().chatCompletionStreaming(
InferenceChatCompletionParams.builder()
.modelId(modelName)
.messages(listOfMessages)
.build()
)
// Response can be received as a asChatCompletionResponseStreamChunk as part of a callback.
// See Android demo app for a detailed implementation example.
```
### Setup Custom Tool Calling
Android demo app for more details: [Custom Tool Calling](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#tool-calling)
## Advanced Users
The purpose of this section is to share more details with users that would like to dive deeper into the Llama Stack Kotlin Library. Whether you're interested in contributing to the open source library, debugging or just want to learn more, this section is for you!
### Prerequisites
You must complete the following steps:
1. Clone the repo (`git clone https://github.com/meta-llama/llama-stack-client-kotlin.git -b latest-release`)
2. Port the appropriate ExecuTorch libraries over into your Llama Stack Kotlin library environment.
```bash
cd llama-stack-client-kotlin-client-local
sh download-prebuilt-et-lib.sh --unzip
```
Now you will notice that the `jni/`, `libs/`, and `AndroidManifest.xml` files from the `executorch.aar` file are present in the local module. This way the local client module will be able to realize the ExecuTorch SDK.
### Building for Development/Debugging
If you'd like to contribute to the Kotlin library via development, debug, or add play around with the library with various print statements, run the following command in your terminal under the llama-stack-client-kotlin directory.
```bash
sh build-libs.sh
```
**Output**: .jar files located in the build-jars directory
Copy the .jar files over to the lib directory in your Android app. At the same time make sure to remove the llama-stack-client-kotlin dependency within your build.gradle.kts file in your app (or if you are using the demo app) to avoid having multiple llama stack client dependencies.
### Additional Options for Local Inferencing
Currently we provide additional properties support with local inferencing. In order to get the tokens/sec metric for each inference call, add the following code in your Android app after you run your chatCompletion inference function. The Reference app has this implementation as well:
```kotlin
var tps = (result.asChatCompletionResponse()._additionalProperties()["tps"] as JsonNumber).value as Float
```
We will be adding more properties in the future.
### Additional Options for Remote Inferencing
#### Network Options
##### Retries
Requests that experience certain errors are automatically retried 2 times by default, with a short exponential backoff. Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors will all be retried by default.
You can provide a `maxRetries` on the client builder to configure this:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.maxRetries(4)
.build()
```
##### Timeouts
Requests time out after 1 minute by default. You can configure this on the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.timeout(Duration.ofSeconds(30))
.build()
```
##### Proxies
Requests can be routed through a proxy. You can configure this on the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.proxy(new Proxy(
Type.HTTP,
new InetSocketAddress("proxy.com", 8080)
))
.build()
```
##### Environments
Requests are made to the production environment by default. You can connect to other environments, like `sandbox`, via the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.sandbox()
.build()
```
### Error Handling
This library throws exceptions in a single hierarchy for easy handling:
- **`LlamaStackClientException`** - Base exception for all exceptions
- **`LlamaStackClientServiceException`** - HTTP errors with a well-formed response body we were able to parse. The exception message and the `.debuggingRequestId()` will be set by the server.
| Status | Exception |
| ------ | ----------------------------- |
| 400 | BadRequestException |
| 401 | AuthenticationException |
| 403 | PermissionDeniedException |
| 404 | NotFoundException |
| 422 | UnprocessableEntityException |
| 429 | RateLimitException |
| 5xx | InternalServerException |
| others | UnexpectedStatusCodeException |
- **`LlamaStackClientIoException`** - I/O networking errors
- **`LlamaStackClientInvalidDataException`** - any other exceptions on the client side, e.g.:
- We failed to serialize the request body
- We failed to parse the response body (has access to response code and body)
## Known Issues
We're aware of the following issues and are working to resolve them:
1. Streaming response is a work-in-progress for local and remote inference
2. Due to #1, agents are not supported at the time. LS agents only work in streaming mode
3. Changing to another model is a work in progress for local and remote platforms
## Reporting Issues
If you encountered any bugs or issues following this guide please file a bug/issue on our [GitHub issue tracker](https://github.com/meta-llama/llama-stack-client-kotlin/issues).
## Thanks
We'd like to extend our thanks to the ExecuTorch team for providing their support as we integrated ExecuTorch as one of the local inference distributors for Llama Stack. Checkout [ExecuTorch GitHub repo](https://github.com/pytorch/executorch/tree/main) for more information.
---
The API interface is generated using the OpenAPI standard with [Stainless](https://www.stainlessapi.com/).
## Related Resources
- **[llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin)** - Official Kotlin SDK repository
- **[Android Demo App](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app)** - Complete example app
- **[ExecuTorch](https://github.com/pytorch/executorch/)** - PyTorch on-device inference library
- **[iOS SDK](./ios-sdk)** - iOS development guide

View file

@ -0,0 +1,179 @@
---
title: iOS SDK
description: Native iOS development with Llama Stack using Swift SDK for remote and on-device inference
sidebar_label: iOS SDK
sidebar_position: 1
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# iOS SDK
We offer both remote and on-device use of Llama Stack in Swift via a single SDK [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/) that contains two components:
1. **LlamaStackClient** for remote inference
2. **Local Inference** for on-device inference
![Seamlessly switching between local, on-device inference and remote hosted inference](/img/remote_or_local.gif)
## Remote Only
If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with the remote client.
### Setup
1. Add `https://github.com/meta-llama/llama-stack-client-swift/` as a Package Dependency in Xcode
2. Add `LlamaStackClient` as a framework to your app target
3. Call an API:
```swift
import LlamaStackClient
let agents = RemoteAgents(url: URL(string: "http://localhost:8321")!)
let request = Components.Schemas.CreateAgentTurnRequest(
agent_id: agentId,
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Hello Llama!"),
role: .user
))
],
session_id: self.agenticSystemSessionId,
stream: true
)
for try await chunk in try await agents.createTurn(request: request) {
let payload = chunk.event.payload
// ...
```
Check out [iOSCalendarAssistant](https://github.com/meta-llama/llama-stack-client-swift/tree/main/examples/ios_calendar_assistant) for a complete app demo.
## LocalInference
LocalInference provides a local inference implementation powered by [executorch](https://github.com/pytorch/executorch/).
Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using [executorch](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo), PyTorch's on-device inference library.
The APIs *work the same as remote* the only difference is you'll instead use the `LocalAgents` / `LocalInference` classes and pass in a `DispatchQueue`:
```swift
private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
let inference = LocalInference(queue: runnerQueue)
let agents = LocalAgents(inference: self.inference)
```
Check out [iOSCalendarAssistantWithLocalInf](https://github.com/meta-llama/llama-stack-client-swift/tree/main/examples/ios_calendar_assistant) for a complete app demo.
### Installation
:::info[Development Status]
We're working on making LocalInference easier to set up. For now, you'll need to import it via `.xcframework`.
:::
1. Clone the executorch submodule in this repo and its dependencies: `git submodule update --init --recursive`
2. Install [Cmake](https://cmake.org/) for the executorch build
3. Drag `LocalInference.xcodeproj` into your project
4. Add `LocalInference` as a framework in your app target
### Preparing a Model
1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model)
2. Bundle the `.pte` and `tokenizer.model` file into your app
We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
| Llama 3.2 1B | Tokens / Second (total) | | Time-to-First-Token (sec) | |
| :---- | :---- | :---- | :---- | :---- |
| | Haiku | Paragraph | Haiku | Paragraph |
| BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
### Using LocalInference
<Tabs>
<TabItem value="init" label="1. Initialize">
Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
```swift
init () {
runnerQueue = DispatchQueue(label: "org.meta.llamastack")
inferenceService = LocalInferenceService(queue: runnerQueue)
agentsService = LocalAgentsService(inference: inferenceService)
}
```
</TabItem>
<TabItem value="load" label="2. Load Model">
Before making any inference calls, load your model from your bundle:
```swift
let mainBundle = Bundle.main
inferenceService.loadModel(
modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
completion: {_ in } // use to handle load failures
)
```
</TabItem>
<TabItem value="inference" label="3. Make Inference Calls">
Make inference calls (or agents calls) as you normally would with LlamaStack:
```swift
for await chunk in try await agentsService.initAndCreateTurn(
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
role: .user))
]
) {
```
</TabItem>
</Tabs>
### Troubleshooting
If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
**(Opt+Click) Product > Clean Build Folder Immediately**
```bash
rm -rf \
~/Library/org.swift.swiftpm \
~/Library/Caches/org.swift.swiftpm \
~/Library/Caches/com.apple.dt.Xcode \
~/Library/Developer/Xcode/DerivedData
```
## Performance Considerations
- **Model Size**: Smaller models (1B-3B parameters) work best on mobile devices
- **Quantization**: Use SpinQuant or QAT-LoRA for optimal performance
- **Memory Usage**: Monitor app memory usage with larger models
- **Battery Life**: On-device inference can impact battery performance
## Use Cases
The iOS SDK is ideal for:
- **Native iOS applications** requiring AI capabilities
- **Offline functionality** without internet dependency
- **Privacy-focused** applications processing sensitive data locally
- **Real-time inference** with low latency requirements
- **Hybrid applications** switching between local and remote inference
## Related Resources
- **[llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/)** - Official Swift SDK repository
- **[iOS Calendar Assistant](https://github.com/meta-llama/llama-stack-client-swift/tree/main/examples/ios_calendar_assistant)** - Complete example app
- **[executorch](https://github.com/pytorch/executorch/)** - PyTorch on-device inference library
- **[Android SDK](./android-sdk)** - Android development guide

View file

@ -0,0 +1,53 @@
---
title: Remote-Hosted Distributions
description: Available endpoints serving Llama Stack API that you can directly connect to
sidebar_label: Overview
sidebar_position: 1
---
# Remote-Hosted Distributions
Remote-Hosted distributions are available endpoints serving Llama Stack API that you can directly connect to.
## Available Endpoints
| Distribution | Endpoint | Inference | Agents | Memory | Safety | Telemetry |
|-------------|----------|-----------|---------|---------|---------|------------|
| Together | [https://llama-stack.together.ai](https://llama-stack.together.ai) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
| Fireworks | [https://llamastack-preview.fireworks.ai](https://llamastack-preview.fireworks.ai) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
## Connecting to Remote-Hosted Distributions
You can use `llama-stack-client` to interact with these endpoints. For example, to list the available models served by the Fireworks endpoint:
```bash
$ pip install llama-stack-client
$ llama-stack-client configure --endpoint https://llamastack-preview.fireworks.ai
$ llama-stack-client models list
```
## Benefits of Remote-Hosted Distributions
- **Zero Setup**: No local installation or configuration required
- **Scalable Infrastructure**: Managed by the provider for high availability
- **Always Updated**: Latest features and models available automatically
- **Cost Effective**: Pay-per-use pricing without infrastructure overhead
## Getting Started
1. **Choose an endpoint** from the table above
2. **Install the client**: `pip install llama-stack-client`
3. **Configure the endpoint**: `llama-stack-client configure --endpoint <URL>`
4. **Start building**: Use the client to interact with Llama Stack APIs
## Additional Resources
- **[llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python/blob/main/docs/cli_reference.md)** - CLI reference and documentation
- **[llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main)** - Example applications built on top of Llama Stack
- **[WatsonX Distribution](./watsonx)** - IBM WatsonX integration guide
## Related Guides
- **[Available Distributions](../list-of-distributions)** - Compare with other distribution types
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Using as Library](../importing-as-library)** - Alternative deployment approach

View file

@ -0,0 +1,110 @@
---
title: watsonx Distribution
description: Use watsonx for running LLM inference in Llama Stack
sidebar_label: watsonx
sidebar_position: 2
---
# watsonx Distribution
The `llamastack/distribution-watsonx` distribution consists of the following provider configurations.
## Provider Configuration
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::watsonx`, `inline::sentence-transformers` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
| vector_io | `inline::faiss` |
## Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `WATSONX_API_KEY`: watsonx API Key (default: ``)
- `WATSONX_PROJECT_ID`: watsonx Project ID (default: ``)
## Available Models
The following models are available by default:
- `meta-llama/llama-3-3-70b-instruct` (aliases: `meta-llama/Llama-3.3-70B-Instruct`)
- `meta-llama/llama-2-13b-chat` (aliases: `meta-llama/Llama-2-13b`)
- `meta-llama/llama-3-1-70b-instruct` (aliases: `meta-llama/Llama-3.1-70B-Instruct`)
- `meta-llama/llama-3-1-8b-instruct` (aliases: `meta-llama/Llama-3.1-8B-Instruct`)
- `meta-llama/llama-3-2-11b-vision-instruct` (aliases: `meta-llama/Llama-3.2-11B-Vision-Instruct`)
- `meta-llama/llama-3-2-1b-instruct` (aliases: `meta-llama/Llama-3.2-1B-Instruct`)
- `meta-llama/llama-3-2-3b-instruct` (aliases: `meta-llama/Llama-3.2-3B-Instruct`)
- `meta-llama/llama-3-2-90b-vision-instruct` (aliases: `meta-llama/Llama-3.2-90B-Vision-Instruct`)
- `meta-llama/llama-guard-3-11b-vision` (aliases: `meta-llama/Llama-Guard-3-11B-Vision`)
## Prerequisites
### API Keys
Make sure you have access to a watsonx API Key. You can get one by referring to the [watsonx.ai documentation](https://www.ibm.com/docs/en/masv-and-l/maximo-manage/continuous-delivery?topic=setup-create-watsonx-api-key).
## Running Llama Stack with watsonx
You can do this via venv or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-watsonx \
--config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env WATSONX_API_KEY=$WATSONX_API_KEY \
--env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID \
--env WATSONX_BASE_URL=$WATSONX_BASE_URL
```
### Via venv
If you've set up your local development environment, you can also build the image using your local virtual environment:
```bash
llama stack build --distro watsonx --image-type venv
llama stack run ./run.yaml \
--port 5001 \
--env WATSONX_API_KEY=$WATSONX_API_KEY \
--env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID \
--env WATSONX_BASE_URL=$WATSONX_BASE_URL
```
## Use Cases
The watsonx distribution is ideal for:
- **Enterprise deployments** with IBM infrastructure
- **Production workloads** requiring enterprise support
- **Regulated industries** needing compliance and governance features
- **Hybrid cloud** deployments with IBM watsonx integration
## Getting Started
1. **Set up watsonx credentials** - Obtain API keys and project IDs from IBM watsonx
2. **Configure environment variables** - Set WATSONX_API_KEY and WATSONX_PROJECT_ID
3. **Run the distribution** - Use Docker or venv to start the Llama Stack server
4. **Test the connection** - Use llama-stack-client to verify the setup
## Related Guides
- **[Remote-Hosted Overview](./index)** - Overview of remote-hosted distributions
- **[Available Distributions](../list-of-distributions)** - Compare with other distributions
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,101 @@
---
title: Dell-TGI Distribution
description: Dell's custom TGI inference container distribution for self-hosted Llama Stack deployment
sidebar_label: Dell-TGI
sidebar_position: 1
---
# Dell-TGI Distribution
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
## Provider Configuration
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|-----------------|---------------|----------------|--------------------------------------------------|----------------|----------------|
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
## Getting Started
### Start the Distribution (Single Node GPU)
:::note
This assumes you have access to GPU to start a TGI server with access to your GPU.
:::
```bash
$ cd distributions/dell-tgi/
$ ls
compose.yaml README.md run.yaml
$ docker compose up
```
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs:
```
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:8321 (Press CTRL+C to quit)
```
To kill the server:
```bash
docker compose down
```
### Alternative: Dell-TGI server + llama stack run (Single Node GPU)
#### Start Dell-TGI server locally
```bash
docker run -it --pull always --shm-size 1g -p 80:80 --gpus 4 \
-e NUM_SHARD=4
-e MAX_BATCH_PREFILL_TOKENS=32768 \
-e MAX_INPUT_TOKENS=8000 \
-e MAX_TOTAL_TOKENS=8192 \
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
```
#### Start Llama Stack server pointing to TGI server
```bash
docker run --pull always --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
```
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
```yaml
inference:
- provider_id: tgi0
provider_type: remote::tgi
config:
url: http://127.0.0.1:5009
```
## Prerequisites
- **Hardware**: GPU access required
- **Software**: Docker with GPU support
- **Models**: Dell enterprise TGI container images
## Use Cases
The Dell-TGI distribution is ideal for:
- Enterprise deployments with Dell infrastructure
- High-performance inference with custom TGI optimizations
- Self-hosted environments requiring enterprise support
- Production workloads with specific performance requirements
## Related Guides
- **[Dell Distribution](./dell)** - Dell's standard distribution
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,222 @@
---
title: Dell Distribution
description: Dell's distribution of Llama Stack using custom TGI containers via Dell Enterprise Hub
sidebar_label: Dell
sidebar_position: 2
---
# Dell Distribution of Llama Stack
The `llamastack/distribution-dell` distribution consists of the following provider configurations.
## Provider Configuration
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::tgi`, `inline::sentence-transformers` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.
## Environment Variables
The following environment variables can be configured:
- `DEH_URL`: URL for the Dell inference server (default: `http://0.0.0.0:8181`)
- `DEH_SAFETY_URL`: URL for the Dell safety inference server (default: `http://0.0.0.0:8282`)
- `CHROMA_URL`: URL for the Chroma server (default: `http://localhost:6601`)
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
## Setting up Inference Server
### Dell Enterprise Hub's Custom TGI Container
:::note[Development Status]
This is a placeholder to run inference with TGI. This will be updated to use [Dell Enterprise Hub's containers](https://dell.huggingface.co/authenticated/models) once verified.
:::
```bash
export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321
docker run --rm -it \
--pull always \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $INFERENCE_PORT:$INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $INFERENCE_MODEL \
--port $INFERENCE_PORT --hostname 0.0.0.0
```
### Safety Model Setup (Optional)
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B`:
```bash
export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1
docker run --rm -it \
--pull always \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $SAFETY_MODEL \
--hostname 0.0.0.0 \
--port $SAFETY_INFERENCE_PORT
```
### ChromaDB Setup
Dell distribution relies on ChromaDB for vector database usage. You can start a ChromaDB easily using Docker:
```bash
# This is where the indices are persisted
mkdir -p $HOME/chromadb
podman run --rm -it \
--network host \
--name chromadb \
-v $HOME/chromadb:/chroma/chroma \
-e IS_PERSISTENT=TRUE \
chromadb/chroma:latest \
--port $CHROMADB_PORT \
--host $CHROMADB_HOST
```
## Running Llama Stack
Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
#### Basic Setup
```bash
docker run -it \
--pull always \
--network host \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
# NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
-v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \
# localhost/distribution-dell:dev if building / testing locally
llamastack/distribution-dell\
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL
```
#### With Safety/Shield APIs
If you are using Llama Stack Safety / Shield APIs:
```bash
# You need a local checkout of llama-stack to run this, get it using
# git clone https://github.com/meta-llama/llama-stack.git
cd /path/to/llama-stack
export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
-v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-run.yaml \
llamastack/distribution-dell \
--config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL
```
### Via venv
Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
#### Basic Setup
```bash
llama stack build --distro dell --image-type venv
llama stack run dell
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL
```
#### With Safety/Shield APIs
If you are using Llama Stack Safety / Shield APIs:
```bash
llama stack run ./run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL
```
## Prerequisites
- **Hardware**: GPU access required for TGI containers
- **Software**: Docker with GPU support, Podman (optional)
- **Models**: Access to Hugging Face models and Dell Enterprise Hub containers
- **Storage**: Sufficient disk space for model caches and ChromaDB indices
## Use Cases
The Dell distribution is ideal for:
- **Enterprise deployments** with Dell infrastructure
- **Custom TGI containers** via Dell Enterprise Hub
- **Local GPU inference** with high performance requirements
- **Vector database integration** with ChromaDB
- **Safety-enabled** applications with Llama Guard
## Related Guides
- **[Dell-TGI Distribution](./dell-tgi)** - Dell's TGI-specific distribution
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,154 @@
---
title: Meta Reference GPU Distribution
description: High-performance GPU inference distribution using Meta's reference implementation
sidebar_label: Meta Reference GPU
sidebar_position: 3
---
# Meta Reference GPU Distribution
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
## Provider Configuration
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `inline::meta-reference` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
:::warning[GPU Requirements]
You need access to NVIDIA GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
:::
## Environment Variables
The following environment variables can be configured:
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
- `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
## Prerequisites
### Downloading Models
Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](/docs/references/llama-cli-reference/download-models) to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
```bash
$ llama model list --downloaded
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Model ┃ Size ┃ Modified Time ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ Llama3.2-1B-Instruct:int4-qlora-eo8 │ 1.53 GB │ 2025-02-26 11:22:28 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama3.2-1B │ 2.31 GB │ 2025-02-18 21:48:52 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Prompt-Guard-86M │ 0.02 GB │ 2025-02-26 11:29:28 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB │ 2025-02-26 11:37:41 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama3.2-3B │ 5.99 GB │ 2025-02-18 21:51:26 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama3.1-8B │ 14.97 GB │ 2025-02-16 10:36:37 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB │ 2025-02-26 11:35:02 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama-Guard-3-1B │ 2.80 GB │ 2025-02-26 11:20:46 │
├─────────────────────────────────────────┼──────────┼─────────────────────┤
│ Llama-Guard-3-1B:int4 │ 0.43 GB │ 2025-02-26 11:33:33 │
└─────────────────────────────────────────┴──────────┴─────────────────────┘
```
## Running the Distribution
You can do this via venv or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
#### Basic Setup
```bash
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
--gpu all \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
#### With Safety/Shield APIs
If you are using Llama Stack Safety / Shield APIs, use:
```bash
docker run \
-it \
--pull always \
--gpu all \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
```
### Via venv
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
#### Basic Setup
```bash
llama stack build --distro meta-reference-gpu --image-type venv
llama stack run distributions/meta-reference-gpu/run.yaml \
--port 8321 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
#### With Safety/Shield APIs
If you are using Llama Stack Safety / Shield APIs, use:
```bash
llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
--port 8321 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
```
## Use Cases
The Meta Reference GPU distribution is ideal for:
- **High-performance inference**: Maximum performance with GPU acceleration
- **Local development**: Full control over models and configurations
- **Research and experimentation**: Access to Meta's reference implementations
- **Production deployments**: When you need local GPU inference without external dependencies
## Performance Considerations
- **Memory Requirements**: Ensure sufficient GPU memory for your chosen models
- **Model Size**: Larger models require more GPU memory and provide better performance
- **Batch Processing**: Optimize batch sizes for your specific GPU configuration
## Related Guides
- **[Available Distributions](../list-of-distributions)** - Compare with other distributions
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,197 @@
---
title: NVIDIA Distribution
description: Use NVIDIA NIM for running LLM inference, evaluation and safety with NeMo Microservices
sidebar_label: NVIDIA
sidebar_position: 4
---
# NVIDIA Distribution
The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
## Provider Configuration
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `inline::localfs`, `remote::nvidia` |
| eval | `remote::nvidia` |
| inference | `remote::nvidia` |
| post_training | `remote::nvidia` |
| safety | `remote::nvidia` |
| scoring | `inline::basic` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `inline::rag-runtime` |
| vector_io | `inline::faiss` |
## Environment Variables
The following environment variables can be configured:
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
## Available Models
The following models are available by default:
- `meta/llama3-8b-instruct`
- `meta/llama3-70b-instruct`
- `meta/llama-3.1-8b-instruct`
- `meta/llama-3.1-70b-instruct`
- `meta/llama-3.1-405b-instruct`
- `meta/llama-3.2-1b-instruct`
- `meta/llama-3.2-3b-instruct`
- `meta/llama-3.2-11b-vision-instruct`
- `meta/llama-3.2-90b-vision-instruct`
- `meta/llama-3.3-70b-instruct`
- `nvidia/vila`
- `nvidia/llama-3.2-nv-embedqa-1b-v2`
- `nvidia/nv-embedqa-e5-v5`
- `nvidia/nv-embedqa-mistral-7b-v2`
- `snowflake/arctic-embed-l`
## Prerequisites
### NVIDIA API Keys
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
### Deploy NeMo Microservices Platform
The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
## Supported Services
Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
### Inference: NVIDIA NIM
NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
1. **Hosted (default)**: Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
2. **Self-hosted**: NVIDIA NIMs that run on your own infrastructure.
The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
### Datasetio API: NeMo Data Store
The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposes APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
### Eval API: NeMo Evaluator
The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
### Post-Training API: NeMo Customizer
The NeMo Customizer microservice supports fine-tuning models. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
### Safety API: NeMo Guardrails
The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
## Deploying Models
In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
:::note[Improved Performance]
For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
:::
```bash
# URL to NeMo NIM Proxy service
export NEMO_URL="http://nemo.test"
curl --location "$NEMO_URL/v1/deployment/model-deployments" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"name": "llama-3.2-1b-instruct",
"namespace": "meta",
"config": {
"model": "meta/llama-3.2-1b-instruct",
"nim_deployment": {
"image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
"image_tag": "1.8.3",
"pvc_size": "25Gi",
"gpu": 1,
"additional_envs": {
"NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
}
}
}
}'
```
This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
You can also remove a deployed NIM to free up GPU resources, if needed:
```bash
export NEMO_URL="http://nemo.test"
curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
```
## Running Llama Stack with NVIDIA
You can do this via venv (build code), or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-nvidia \
--config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
```
### Via venv
If you've set up your local development environment, you can also build the image using your local virtual environment.
```bash
INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
llama stack build --distro nvidia --image-type venv
llama stack run ./run.yaml \
--port 8321 \
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
--env INFERENCE_MODEL=$INFERENCE_MODEL
```
## Example Notebooks
For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in the [NVIDIA notebooks directory](/docs/notebooks/nvidia/).
## Use Cases
The NVIDIA distribution is ideal for:
- **Enterprise deployments** with NVIDIA infrastructure
- **End-to-end ML workflows** from training to deployment
- **High-performance inference** with NVIDIA NIMs
- **Advanced safety** with NeMo Guardrails
- **Model customization** and fine-tuning
## Related Guides
- **[Available Distributions](../list-of-distributions)** - Compare with other distributions
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,62 @@
---
title: Passthrough Distribution
description: Self-hosted distribution using Passthrough hosted llama-stack endpoint for LLM inference
sidebar_label: Passthrough
sidebar_position: 5
---
# Passthrough Distribution
The `llamastack/distribution-passthrough` distribution consists of the following provider configurations.
## Provider Configuration
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::passthrough`, `inline::sentence-transformers` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `remote::wolfram-alpha`, `inline::rag-runtime`, `remote::model-context-protocol` |
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
## Getting Started
### Installation
```bash
docker pull llamastack/distribution-passthrough
```
### Environment Variables
The following environment variables can be configured:
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `PASSTHROUGH_API_KEY`: Passthrough API Key (default: ``)
- `PASSTHROUGH_URL`: Passthrough URL (default: ``)
### Available Models
The following models are available by default:
- `llama3.1-8b-instruct`
- `llama3.2-11b-vision-instruct`
## Use Cases
The Passthrough distribution is ideal for:
- Using hosted Llama Stack endpoints
- Reducing local infrastructure requirements
- Quick prototyping and development
- Accessing pre-configured model endpoints
## Related Guides
- **[Building Custom Distributions](../building-distro)** - Create your own distribution
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Starting Llama Stack Server](../starting-llama-stack-server)** - How to run distributions

View file

@ -0,0 +1,236 @@
---
title: Starter Distribution
description: Comprehensive multi-provider distribution for quick prototyping and experimentation
sidebar_label: Starter
sidebar_position: 6
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Starter Distribution
The `llamastack/distribution-starter` distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
## Provider Composition
The starter distribution consists of the following provider configurations:
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| files | `inline::localfs` |
| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
| vector_io | `inline::faiss`, `inline::sqlite-vec`, `inline::milvus`, `remote::chromadb`, `remote::pgvector` |
## Inference Providers
The starter distribution includes a comprehensive set of inference providers:
### Hosted Providers
- **[OpenAI](https://openai.com/api/)**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings (provider ID: `openai`)
- **[Fireworks](https://fireworks.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings (provider ID: `fireworks`)
- **[Together](https://together.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings (provider ID: `together`)
- **[Anthropic](https://www.anthropic.com/)**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings (provider ID: `anthropic`)
- **[Gemini](https://gemini.google.com/)**: Gemini 1.5, 2.0, 2.5 models and text embeddings (provider ID: `gemini`)
- **[Groq](https://groq.com/)**: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) (provider ID: `groq`)
- **[SambaNova](https://www.sambanova.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models (provider ID: `sambanova`)
- **[Cerebras](https://www.cerebras.ai/)**: Cerebras AI models (provider ID: `cerebras`)
- **[NVIDIA](https://www.nvidia.com/)**: NVIDIA NIM (provider ID: `nvidia`)
- **[HuggingFace](https://huggingface.co/)**: Serverless and endpoint models (provider ID: `hf::serverless` and `hf::endpoint`)
- **[Bedrock](https://aws.amazon.com/bedrock/)**: AWS Bedrock models (provider ID: `bedrock`)
### Local/Remote Providers
- **[Ollama](https://ollama.ai/)**: Local Ollama models (provider ID: `ollama`)
- **[vLLM](https://docs.vllm.ai/en/latest/)**: Local or remote vLLM server (provider ID: `vllm`)
- **[TGI](https://github.com/huggingface/text-generation-inference)**: Text Generation Inference server (provider ID: `tgi`)
- **[Sentence Transformers](https://www.sbert.net/)**: Local embedding models (provider ID: `sentence-transformers`)
:::info
All providers are disabled by default. You need to enable them by setting the appropriate environment variables.
:::
## Vector IO
The starter distribution includes a comprehensive set of vector IO providers:
- **[FAISS](https://github.com/facebookresearch/faiss)**: Local FAISS vector store - enabled by default (provider ID: `faiss`)
- **[SQLite](https://www.sqlite.org/index.html)**: Local SQLite vector store - disabled by default (provider ID: `sqlite-vec`)
- **[ChromaDB](https://www.trychroma.com/)**: Remote ChromaDB vector store - disabled by default (provider ID: `chromadb`)
- **[PGVector](https://github.com/pgvector/pgvector)**: PostgreSQL vector store - disabled by default (provider ID: `pgvector`)
- **[Milvus](https://milvus.io/)**: Milvus vector store - disabled by default (provider ID: `milvus`)
## Environment Variables
### Server Configuration
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
### API Keys for Hosted Providers
- `OPENAI_API_KEY`: OpenAI API key
- `FIREWORKS_API_KEY`: Fireworks API key
- `TOGETHER_API_KEY`: Together API key
- `ANTHROPIC_API_KEY`: Anthropic API key
- `GEMINI_API_KEY`: Google Gemini API key
- `GROQ_API_KEY`: Groq API key
- `SAMBANOVA_API_KEY`: SambaNova API key
- `CEREBRAS_API_KEY`: Cerebras API key
- `LLAMA_API_KEY`: Llama API key
- `NVIDIA_API_KEY`: NVIDIA API key
- `HF_API_TOKEN`: HuggingFace API token
### Local Provider Configuration
- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
- `TGI_URL`: TGI server URL
### Model Configuration
- `INFERENCE_MODEL`: HuggingFace model for serverless inference
- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
### Vector Database Configuration
- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
- `ENABLE_CHROMADB`: Enable ChromaDB provider
- `ENABLE_PGVECTOR`: Enable PGVector provider
- `CHROMADB_URL`: ChromaDB server URL
- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
- `PGVECTOR_PORT`: PGVector port (default: `5432`)
- `PGVECTOR_DB`: PGVector database name
- `PGVECTOR_USER`: PGVector username
- `PGVECTOR_PASSWORD`: PGVector password
### Tool Configuration
- `BRAVE_SEARCH_API_KEY`: Brave Search API key
- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
### Telemetry Configuration
- `OTEL_SERVICE_NAME`: OpenTelemetry service name
- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
## Enabling Providers
You can enable specific providers by setting appropriate environment variables. For example:
```bash
# self-hosted
export OLLAMA_URL=http://localhost:11434 # enables the Ollama inference provider
export VLLM_URL=http://localhost:8000/v1 # enables the vLLM inference provider
export TGI_URL=http://localhost:8000/v1 # enables the TGI inference provider
# cloud-hosted requiring API key configuration on the server
export CEREBRAS_API_KEY=your_cerebras_api_key # enables the Cerebras inference provider
export NVIDIA_API_KEY=your_nvidia_api_key # enables the NVIDIA inference provider
# vector providers
export MILVUS_URL=http://localhost:19530 # enables the Milvus vector provider
export CHROMADB_URL=http://localhost:8000/v1 # enables the ChromaDB vector provider
export PGVECTOR_DB=llama_stack_db # enables the PGVector vector provider
```
This distribution comes with a default "llama-guard" shield that can be enabled by setting the `SAFETY_MODEL` environment variable to point to an appropriate Llama Guard model id. Use `llama-stack-client models list` to see the list of available models.
## Running the Distribution
<Tabs>
<TabItem value="docker" label="Via Docker">
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-e OPENAI_API_KEY=your_openai_key \
-e FIREWORKS_API_KEY=your_fireworks_key \
-e TOGETHER_API_KEY=your_together_key \
llamastack/distribution-starter \
--port $LLAMA_STACK_PORT
```
</TabItem>
<TabItem value="venv" label="Via venv">
Ensure you have configured the starter distribution using the environment variables explained above.
```bash
uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
</TabItem>
</Tabs>
## Example Usage
Once the distribution is running, you can use any of the available models. Here are some examples:
### Using OpenAI Models
```bash
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id openai/gpt-4o \
--message "Hello, how are you?"
```
### Using Fireworks Models
```bash
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
--message "Write a short story about a robot."
```
### Using Local Ollama Models
```bash
# First, make sure Ollama is running and you have a model
ollama run llama3.2:3b
# Then use it through Llama Stack
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id ollama/llama3.2:3b \
--message "Explain quantum computing in simple terms."
```
## Storage
The starter distribution uses SQLite for local storage of various components:
- **Metadata store**: `~/.llama/distributions/starter/registry.db`
- **Inference store**: `~/.llama/distributions/starter/inference_store.db`
- **FAISS store**: `~/.llama/distributions/starter/faiss_store.db`
- **SQLite vector store**: `~/.llama/distributions/starter/sqlite_vec.db`
- **Files metadata**: `~/.llama/distributions/starter/files_metadata.db`
- **Agents store**: `~/.llama/distributions/starter/agents_store.db`
- **Responses store**: `~/.llama/distributions/starter/responses_store.db`
- **Trace store**: `~/.llama/distributions/starter/trace_store.db`
- **Evaluation store**: `~/.llama/distributions/starter/meta_reference_eval.db`
- **Dataset I/O stores**: Various HuggingFace and local filesystem stores
## Benefits of the Starter Distribution
1. **Comprehensive Coverage**: Includes most popular AI providers in one distribution
2. **Flexible Configuration**: Easy to enable/disable providers based on your needs
3. **No Local GPU Required**: Most providers are cloud-based, making it accessible to developers without high-end hardware
4. **Easy Migration**: Start with hosted providers and gradually move to local ones as needed
5. **Production Ready**: Includes safety, evaluation, and telemetry components
6. **Tool Integration**: Comes with web search, RAG, and model context protocol tools
The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.
## Related Guides
- **[Available Distributions](../list-of-distributions)** - Compare with other distributions
- **[Configuration Reference](../configuration)** - Understanding configuration options
- **[Building Custom Distributions](../building-distro)** - Create your own distribution

View file

@ -0,0 +1,75 @@
---
title: Starting a Llama Stack Server
description: Different ways to run Llama Stack servers - as library, container, or Kubernetes deployment
sidebar_label: Starting Llama Stack Server
sidebar_position: 7
---
# Starting a Llama Stack Server
You can run a Llama Stack server in one of the following ways:
## As a Library
This is the simplest way to get started. Using Llama Stack as a library means you do not need to start a server. This is especially useful when you are not running inference locally and relying on an external inference service (e.g. fireworks, together, groq, etc.)
**See:** [Using Llama Stack as a Library](./importing-as-library)
## Container
Another simple way to start interacting with Llama Stack is to just spin up a container (via Docker or Podman) which is pre-built with all the providers you need. We provide a number of pre-built images so you can start a Llama Stack server instantly. You can also build your own custom container. Which distribution to choose depends on the hardware you have.
**See:** [Available Distributions](./list-of-distributions) for more details on selecting the right distribution.
## Kubernetes
If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally.
**See:** [Kubernetes Deployment Guide](/docs/deploying/kubernetes-deployment) for more details.
## Which Method to Choose?
<table>
<thead>
<tr>
<th>Method</th>
<th>Best For</th>
<th>Complexity</th>
<th>Use Cases</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Library</strong></td>
<td>Development & External Services</td>
<td>Low</td>
<td>Prototyping, using remote inference providers</td>
</tr>
<tr>
<td><strong>Container</strong></td>
<td>Local Development & Production</td>
<td>Medium</td>
<td>Consistent environments, local inference</td>
</tr>
<tr>
<td><strong>Kubernetes</strong></td>
<td>Production & Scale</td>
<td>High</td>
<td>Production deployments, high availability</td>
</tr>
</tbody>
</table>
## Getting Started
1. **Choose your deployment method** based on your requirements
2. **Select a distribution** that matches your hardware and needs
3. **Configure your environment** with the appropriate settings
4. **Start your stack** and begin building with Llama Stack APIs
## Related Guides
- **[Available Distributions](./list-of-distributions)** - Choose the right distribution
- **[Building Custom Distributions](./building-distro)** - Create your own distribution
- **[Configuration Reference](./configuration)** - Understanding configuration options
- **[Customizing run.yaml](./customizing-run-yaml)** - Adapt configurations to your environment

View file

@ -0,0 +1,547 @@
---
title: Detailed Tutorial
description: Complete guide to using Llama Stack server and client SDK to build AI agents
sidebar_label: Detailed Tutorial
sidebar_position: 3
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple agent.
A Llama Stack agent is a simple integrated system that can perform tasks by combining a Llama model for reasoning with
tools (e.g., RAG, web search, code execution, etc.) for taking actions.
In Llama Stack, we provide a server exposing multiple APIs. These APIs are backed by implementations from different providers.
Llama Stack is a stateful service with REST APIs to support seamless transition of AI applications across different environments. The server can be run in a variety of ways, including as a standalone binary, Docker container, or hosted service. You can build and test using a local server first and deploy to a hosted endpoint for production.
In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](/docs/providers#inference) for a Llama Model.
### Step 1: Installation and Setup
Install Ollama by following the instructions on the [Ollama website](https://ollama.com/download), then
download Llama 3.2 3B model, and then start the Ollama service.
```bash
ollama pull llama3.2:3b
ollama run llama3.2:3b --keepalive 60m
```
Install [uv](https://docs.astral.sh/uv/) to setup your virtual environment
<Tabs>
<TabItem value="macos" label="macOS and Linux">
Use `curl` to download the script and execute it with `sh`:
```console
curl -LsSf https://astral.sh/uv/install.sh | sh
```
</TabItem>
<TabItem value="windows" label="Windows">
Use `irm` to download the script and execute it with `iex`:
```console
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```
</TabItem>
</Tabs>
Setup your virtual environment.
```bash
uv sync --python 3.12
source .venv/bin/activate
```
### Step 2: Run Llama Stack
Llama Stack is a server that exposes multiple APIs, you connect with it using the Llama Stack client SDK.
<Tabs>
<TabItem value="venv1" label="Using venv">
You can use Python to build and run the Llama Stack server, which is useful for testing and development.
Llama Stack uses a [YAML configuration file](/docs/distributions/configuration) to specify the stack setup,
which defines the providers and their settings. The generated configuration serves as a starting point that you can [customize for your specific needs](/docs/distributions/customizing-run-yaml).
Now let's build and run the Llama Stack config for Ollama.
We use `starter` as template. By default all providers are disabled, this requires enable ollama by passing environment variables.
```bash
llama stack build --distro starter --image-type venv --run
```
</TabItem>
<TabItem value="container" label="Using a Container">
You can use a container image to run the Llama Stack server. We provide several container images for the server
component that works with different inference providers out of the box. For this guide, we will use
`llamastack/distribution-starter` as the container image. If you'd like to build your own image or customize the
configurations, please check out [this guide](/docs/distributions/building-distro).
First lets setup some environment variables and create a local directory to mount into the container's file system.
```bash
export LLAMA_STACK_PORT=8321
mkdir -p ~/.llama
```
Then start the server using the container tool of your choice. For example, if you are running Docker you can use the
following command:
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-starter \
--port $LLAMA_STACK_PORT \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Note to start the container with Podman, you can do the same but replace `docker` at the start of the command with
`podman`. If you are using `podman` older than `4.7.0`, please also replace `host.docker.internal` in the `OLLAMA_URL`
with `host.containers.internal`.
The configuration YAML for the Ollama distribution is available at `distributions/ollama/run.yaml`.
:::tip
Docker containers run in their own isolated network namespaces on Linux. To allow the container to communicate with services running on the host via `localhost`, you need `--network=host`. This makes the container use the host's network directly so it can connect to Ollama running on `localhost:11434`.
Linux users having issues running the above command should instead try the following:
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--network=host \
llamastack/distribution-starter \
--port $LLAMA_STACK_PORT \
--env OLLAMA_URL=http://localhost:11434
```
:::
</TabItem>
</Tabs>
You will see output like below:
```
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
Now you can use the Llama Stack client to run inference and build agents!
You can reuse the server setup or use the [Llama Stack Client](https://github.com/meta-llama/llama-stack-client-python/).
Note that the client package is already included in the `llama-stack` package.
### Step 3: Run Client CLI
Open a new terminal and navigate to the same directory you started the server from. Then set up a new or activate your
existing server virtual environment.
<Tabs>
<TabItem value="reuse" label="Reuse Server venv">
```bash
# The client is included in the llama-stack package so we just activate the server venv
source .venv/bin/activate
```
</TabItem>
<TabItem value="install" label="Install with venv">
```bash
uv venv client --python 3.12
source client/bin/activate
pip install llama-stack-client
```
</TabItem>
</Tabs>
Now let's use the `llama-stack-client` [CLI](/docs/references/llama-stack-client-cli-reference) to check the
connectivity to the server.
```bash
llama-stack-client configure --endpoint http://localhost:8321 --api-key none
```
You will see the below:
```
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
```
List the models
```bash
llama-stack-client models list
Available Models
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ embedding │ ollama/all-minilm:l6-v2 │ all-minilm:l6-v2 │ {'embedding_dimension': 384.0} │ ollama │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼───────────────────────┤
│ ... │ ... │ ... │ │ ... │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼───────────────────────┤
│ llm │ ollama/Llama-3.2:3b │ llama3.2:3b │ │ ollama │
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴───────────────────────┘
```
You can test basic Llama inference completion using the CLI.
```bash
llama-stack-client inference chat-completion --model-id "ollama/llama3.2:3b" --message "tell me a joke"
```
Sample output:
```python
OpenAIChatCompletion(
id="chatcmpl-08d7b2be-40f3-47ed-8f16-a6f29f2436af",
choices=[
OpenAIChatCompletionChoice(
finish_reason="stop",
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role="assistant",
content="Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired.",
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None,
),
logprobs=None,
)
],
created=1751725254,
model="llama3.2:3b",
object="chat.completion",
service_tier=None,
system_fingerprint="fp_ollama",
usage={
"completion_tokens": 18,
"prompt_tokens": 29,
"total_tokens": 47,
"completion_tokens_details": None,
"prompt_tokens_details": None,
},
)
```
### Step 4: Run the Demos
Note that these demos show the [Python Client SDK](/docs/references/python-sdk-reference).
Other SDKs are also available, please refer to the [Client SDK](/docs/getting-started#client-sdks) list for the complete options.
<Tabs>
<TabItem value="basic" label="Basic Inference">
Now you can run inference using the Llama Stack client SDK.
#### i. Create the Script
Create a file `inference.py` and add the following code:
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
# List available models
models = client.models.list()
# Select the first LLM
llm = next(m for m in models if m.model_type == "llm" and m.provider_id == "ollama")
model_id = llm.identifier
print("Model:", model_id)
response = client.chat.completions.create(
model=model_id,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding"},
],
)
print(response)
```
#### ii. Run the Script
Let's run the script using `uv`
```bash
uv run python inference.py
```
Which will output:
```
Model: ollama/llama3.2:3b
OpenAIChatCompletion(id='chatcmpl-30cd0f28-a2ad-4b6d-934b-13707fc60ebf', choices=[OpenAIChatCompletionChoice(finish_reason='stop', index=0, message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(role='assistant', content="Lines of code unfold\nAlgorithms dance with ease\nLogic's gentle kiss", name=None, tool_calls=None, refusal=None, annotations=None, audio=None, function_call=None), logprobs=None)], created=1751732480, model='llama3.2:3b', object='chat.completion', service_tier=None, system_fingerprint='fp_ollama', usage={'completion_tokens': 16, 'prompt_tokens': 37, 'total_tokens': 53, 'completion_tokens_details': None, 'prompt_tokens_details': None})
```
</TabItem>
<TabItem value="simple" label="Build a Simple Agent">
Next we can move beyond simple inference and build an agent that can perform tasks using the Llama Stack server.
#### i. Create the Script
Create a file `agent.py` and add the following code:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from rich.pretty import pprint
import uuid
client = LlamaStackClient(base_url=f"http://localhost:8321")
models = client.models.list()
llm = next(m for m in models if m.model_type == "llm" and m.provider_id == "ollama")
model_id = llm.identifier
agent = Agent(client, model=model_id, instructions="You are a helpful assistant.")
s_id = agent.create_session(session_name=f"s{uuid.uuid4().hex}")
print("Non-streaming ...")
response = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}],
session_id=s_id,
stream=False,
)
print("agent>", response.output_message.content)
print("Streaming ...")
stream = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in stream:
pprint(event)
print("Streaming with print helper...")
stream = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in AgentEventLogger().log(stream):
event.print()
```
### ii. Run the Script
Let's run the script using `uv`
```bash
uv run python agent.py
```
<details>
<summary>👋 Click here to see the sample output</summary>
Non-streaming ...
agent> I'm an artificial intelligence designed to assist and communicate with users like you. I don't have a personal identity, but I can provide information, answer questions, and help with tasks to the best of my abilities.
I'm a large language model, which means I've been trained on a massive dataset of text from various sources, allowing me to understand and respond to a wide range of topics and questions. My purpose is to provide helpful and accurate information, and I'm constantly learning and improving my responses based on the interactions I have with users like you.
I can help with:
* Answering questions on various subjects
* Providing definitions and explanations
* Offering suggestions and ideas
* Assisting with language-related tasks, such as proofreading and editing
* Generating text and content
* And more!
Feel free to ask me anything, and I'll do my best to help!
Streaming ...
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepStartPayload(
│ │ │ event_type='step_start',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference',
│ │ │ metadata={}
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepProgressPayload(
│ │ │ delta=TextDelta(text='As', type='text'),
│ │ │ event_type='step_progress',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepProgressPayload(
│ │ │ delta=TextDelta(text=' a', type='text'),
│ │ │ event_type='step_progress',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
...
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepCompletePayload(
│ │ │ event_type='step_complete',
│ │ │ step_details=InferenceStep(
│ │ │ │ api_model_response=CompletionMessage(
│ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ role='assistant',
│ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ tool_calls=[]
│ │ │ │ ),
│ │ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ │ step_type='inference',
│ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│ │ │ ),
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseTurnCompletePayload(
│ │ │ event_type='turn_complete',
│ │ │ turn=Turn(
│ │ │ │ input_messages=[UserMessage(content='Who are you?', role='user', context=None)],
│ │ │ │ output_message=CompletionMessage(
│ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ role='assistant',
│ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ tool_calls=[]
│ │ │ │ ),
│ │ │ │ session_id='abd4afea-4324-43f4-9513-cfe3970d92e8',
│ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28722, tzinfo=TzInfo(UTC)),
│ │ │ │ steps=[
│ │ │ │ │ InferenceStep(
│ │ │ │ │ │ api_model_response=CompletionMessage(
│ │ │ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ │ │ role='assistant',
│ │ │ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ │ │ tool_calls=[]
│ │ │ │ │ │ ),
│ │ │ │ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ │ │ │ step_type='inference',
│ │ │ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│ │ │ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│ │ │ │ │ )
│ │ │ │ ],
│ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 727364, tzinfo=TzInfo(UTC)),
│ │ │ │ output_attachments=[]
│ │ │ )
│ │ )
│ )
)
Streaming with print helper...
inference> Déjà vu! You're asking me again!
As I mentioned earlier, I'm a computer program designed to simulate conversation and answer questions. I don't have a personal identity or consciousness like a human would. I exist solely as a digital entity, running on computer servers and responding to inputs from users like you.
I'm a type of artificial intelligence (AI) called a large language model, which means I've been trained on a massive dataset of text from various sources. This training allows me to understand and respond to a wide range of questions and topics.
My purpose is to provide helpful and accurate information, answer questions, and assist users like you with tasks and conversations. I don't have personal preferences, emotions, or opinions like humans do. My goal is to be informative, neutral, and respectful in my responses.
So, that's me in a nutshell!
</details>
</TabItem>
<TabItem value="rag" label="Build a RAG Agent">
For our last demo, we can build a RAG agent that can answer questions about the Torchtune project using the documents
in a vector database.
#### i. Create the Script
Create a file `rag_agent.py` and add the following code:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.types import Document
import uuid
client = LlamaStackClient(base_url="http://localhost:8321")
# Create a vector database instance
embed_lm = next(m for m in client.models.list() if m.model_type == "embedding")
embedding_model = embed_lm.identifier
vector_db_id = f"v{uuid.uuid4().hex}"
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=embedding_model,
)
# Create Documents
urls = [
"memory_optimizations.rst",
"chat.rst",
"llama3.rst",
"qat_finetune.rst",
"lora_finetune.rst",
]
documents = [
Document(
document_id=f"num-{i}",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
metadata={},
)
for i, url in enumerate(urls)
]
# Insert documents
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
# Get the model being served
llm = next(
m
for m in client.models.list()
if m.model_type == "llm" and m.provider_id == "ollama"
)
model = llm.identifier
# Create the RAG agent
rag_agent = Agent(
client,
model=model,
instructions="You are a helpful assistant. Use the RAG tool to answer questions as needed.",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": [vector_db_id]},
}
],
)
session_id = rag_agent.create_session(session_name=f"s{uuid.uuid4().hex}")
turns = ["what is torchtune", "tell me about dora"]
for t in turns:
print("user>", t)
stream = rag_agent.create_turn(
messages=[{"role": "user", "content": t}], session_id=session_id, stream=True
)
for event in AgentEventLogger().log(stream):
event.print()
```
#### ii. Run the Script
Let's run the script using `uv`
```bash
uv run python rag_agent.py
```
<details>
<summary>👋 Click here to see the sample output</summary>
```
user> what is torchtune
inference> [knowledge_search(query='TorchTune')]
tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent: conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
inference> Here is a high-level overview of the text:
**LoRA Finetuning with PyTorch Tune**
PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command.
...
Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.
```
</details>
</TabItem>
</Tabs>
**You're Ready to Build Your Own Apps!**
Congrats! 🥳 Now you're ready to [build your own Llama Stack applications](/docs/building-applications)! 🚀

View file

@ -0,0 +1,149 @@
---
description: environments.
sidebar_label: Quickstart
sidebar_position: 1
title: Quickstart
---
Get started with Llama Stack in minutes!
Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different
environments. You can build and test using a local server first and deploy to a hosted endpoint for production.
In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](/docs/providers/inference) for a Llama Model.
**💡 Notebook Version:** You can also follow this quickstart guide in a Jupyter notebook format: [quick_start.ipynb](https://github.com/meta-llama/llama-stack/blob/main/docs/quick_start.ipynb)
#### Step 1: Install and setup
1. Install [uv](https://docs.astral.sh/uv/)
2. Run inference on a Llama model with [Ollama](https://ollama.com/download)
```bash
ollama run llama3.2:3b --keepalive 60m
```
#### Step 2: Run the Llama Stack server
We will use `uv` to run the Llama Stack server.
```bash
OLLAMA_URL=http://localhost:11434 \
uv run --with llama-stack llama stack build --distro starter --image-type venv --run
```
#### Step 3: Run the demo
Now open up a new terminal and copy the following script into a file named `demo_script.py`.
```python title="demo_script.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
vector_db_id = "my_demo_vector_db"
client = LlamaStackClient(base_url="http://localhost:8321")
models = client.models.list()
# Select the first LLM and first embedding models
model_id = next(m for m in models if m.model_type == "llm").identifier
embedding_model_id = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = em.metadata["embedding_dimension"]
vector_db = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=embedding_model_id,
embedding_dimension=embedding_dimension,
provider_id="faiss",
)
vector_db_id = vector_db.identifier
source = "https://www.paulgraham.com/greatwork.html"
print("rag_tool> Ingesting document:", source)
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={},
)
client.tool_runtime.rag_tool.insert(
documents=[document],
vector_db_id=vector_db_id,
chunk_size_in_tokens=100,
)
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": [vector_db_id]},
}
],
)
prompt = "How do you do great work?"
print("prompt>", prompt)
use_stream = True
response = agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=agent.create_session("rag_session"),
stream=use_stream,
)
# Only call `AgentEventLogger().log(response)` for streaming responses.
if use_stream:
for log in AgentEventLogger().log(response):
log.print()
else:
print(response)
```
We will use `uv` to run the script
```
uv run --with llama-stack-client,fire,requests demo_script.py
```
And you should see output like below.
```
rag_tool> Ingesting document: https://www.paulgraham.com/greatwork.html
prompt> How do you do great work?
inference> [knowledge_search(query="What is the key to doing great work")]
tool_execution> Tool:knowledge_search Args:{'query': 'What is the key to doing great work'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 2:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 3:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 4:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 5:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
inference> Based on the search results, it seems that doing great work means doing something important so well that you expand people's ideas of what's possible. However, there is no clear threshold for importance, and it can be difficult to judge at the time.
To further clarify, I would suggest that doing great work involves:
* Completing tasks with high quality and attention to detail
* Expanding on existing knowledge or ideas
* Making a positive impact on others through your work
* Striving for excellence and continuous improvement
Ultimately, great work is about making a meaningful contribution and leaving a lasting impression.
```
Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳
:::tip HuggingFace access
If you are getting a **401 Client Error** from HuggingFace for the **all-MiniLM-L6-v2** model, try setting **HF_TOKEN** to a valid HuggingFace token in your environment
:::
### Next Steps
Now you're ready to dive deeper into Llama Stack!
- Explore the [Detailed Tutorial](/docs/detailed_tutorial).
- Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
- Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
- Learn about Llama Stack [Concepts](/docs/concepts).
- Discover how to [Build Llama Stacks](/docs/distributions).
- Refer to our [References](/docs/references) for details on the Llama CLI and Python SDK.
- Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.

View file

@ -0,0 +1,15 @@
---
description: We have a number of client-side SDKs available for different languages.
sidebar_label: Libraries
sidebar_position: 2
title: Libraries (SDKs)
---
We have a number of client-side SDKs available for different languages.
| **Language** | **Client SDK** | **Package** |
| :----: | :----: | :----: |
| Python | [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/)
| Swift | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/tree/latest-release) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift)
| Node | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)

101
docs/docs/intro.mdx Normal file
View file

@ -0,0 +1,101 @@
---
sidebar_position: 1
title: Welcome to Llama Stack
description: Llama Stack is the open-source framework for building generative AI applications
sidebar_label: Intro
tags:
- getting-started
- overview
---
# Welcome to Llama Stack
Llama Stack is the open-source framework for building generative AI applications.
:::tip Llama 4 is here!
Check out [Getting Started with Llama 4](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started_llama4.ipynb)
:::
:::tip News
Llama Stack is now available! See the [release notes](https://github.com/meta-llama/llama-stack/releases) for more details.
:::
## What is Llama Stack?
Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. More specifically, it provides:
- **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
- **Plugin architecture** to support the rich ecosystem of implementations of the different APIs in different environments like local development, on-premises, cloud, and mobile.
- **Prepackaged verified distributions** which offer a one-stop solution for developers to get started quickly and reliably in any environment
- **Multiple developer interfaces** like CLI and SDKs for Python, Node, iOS, and Android
- **Standalone applications** as examples for how to build production-grade AI applications with Llama Stack
<img src="/img/llama-stack.png" alt="Llama Stack" width="400px" />
Our goal is to provide pre-packaged implementations (aka "distributions") which can be run in a variety of deployment environments. LlamaStack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience is available.
## How does Llama Stack work?
Llama Stack consists of a server (with multiple pluggable API providers) and Client SDKs meant to be used in your applications. The server can be run in a variety of environments, including local (inline) development, on-premises, and cloud. The client SDKs are available for Python, Swift, Node, and Kotlin.
## Quick Links
- Ready to build? Check out the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) to get started.
- Want to contribute? See the [Contributing Guide](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md).
- Explore [Example Applications](https://github.com/meta-llama/llama-stack-apps) built with Llama Stack.
## Rich Ecosystem Support
Llama Stack provides adapters for popular providers across all API categories:
- **Inference**: Meta Reference, Ollama, Fireworks, Together, NVIDIA, vLLM, AWS Bedrock, OpenAI, Anthropic, and more
- **Vector Databases**: FAISS, Chroma, Milvus, Postgres, Weaviate, Qdrant, and others
- **Safety**: Llama Guard, Prompt Guard, Code Scanner, AWS Bedrock
- **Training & Evaluation**: HuggingFace, TorchTune, NVIDIA NEMO
:::info Provider Details
For complete provider compatibility and setup instructions, see our [Providers Documentation](https://llama-stack.readthedocs.io/en/latest/providers/index.html).
:::
## Get Started Today
<div style={{display: 'flex', gap: '1rem', flexWrap: 'wrap', margin: '2rem 0'}}>
<a href="https://llama-stack.readthedocs.io/en/latest/getting_started/index.html"
style={{
background: 'var(--ifm-color-primary)',
color: 'white',
padding: '0.75rem 1.5rem',
borderRadius: '0.5rem',
textDecoration: 'none',
fontWeight: 'bold'
}}>
🚀 Quick Start Guide
</a>
<a href="https://github.com/meta-llama/llama-stack-apps"
style={{
border: '2px solid var(--ifm-color-primary)',
color: 'var(--ifm-color-primary)',
padding: '0.75rem 1.5rem',
borderRadius: '0.5rem',
textDecoration: 'none',
fontWeight: 'bold'
}}>
📚 Example Apps
</a>
<a href="https://github.com/meta-llama/llama-stack"
style={{
border: '2px solid #666',
color: '#666',
padding: '0.75rem 1.5rem',
borderRadius: '0.5rem',
textDecoration: 'none',
fontWeight: 'bold'
}}>
⭐ Star on GitHub
</a>
</div>

View file

@ -0,0 +1,25 @@
---
description: Available providers for the agents API
sidebar_label: Overview
sidebar_position: 1
title: Agents
---
# Agents
## Overview
Agents API for creating and interacting with agentic systems.
Main functionalities provided by this API:
- Create agents with specific instructions and ability to use tools.
- Interactions with agents are grouped into sessions ("threads"), and each interaction is called a "turn".
- Agents can be provided with various tools (see the ToolGroups and ToolRuntime APIs for more details).
- Agents can be provided with various shields (see the Safety API for more details).
- Agents can also use Memory to retrieve information from knowledge bases. See the RAG Tool and Vector IO APIs for more details.
This section contains documentation for all available providers for the **agents** API.
## Providers
- **[Meta Reference](./inline_meta-reference)** - Inline provider

View file

@ -0,0 +1,31 @@
---
description: Meta's reference implementation of an agent system that can use tools,
access vector databases, and perform complex reasoning tasks
sidebar_label: Meta Reference
sidebar_position: 2
title: inline::meta-reference
---
# inline::meta-reference
## Description
Meta's reference implementation of an agent system that can use tools, access vector databases, and perform complex reasoning tasks.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `persistence_store` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | |
| `responses_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | |
## Sample Configuration
```yaml
persistence_store:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/agents_store.db
responses_store:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/responses_store.db
```

View file

@ -0,0 +1,27 @@
---
description: Available providers for the batches API
sidebar_label: Overview
sidebar_position: 1
title: Batches
---
# Batches
## Overview
The Batches API enables efficient processing of multiple requests in a single operation,
particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale.
The API is designed to allow use of openai client libraries for seamless integration.
This API provides the following extensions:
- idempotent batch creation
Note: This API is currently under active development and may undergo changes.
This section contains documentation for all available providers for the **batches** API.
## Providers
- **[Reference](./inline_reference)** - Inline provider

View file

@ -0,0 +1,28 @@
---
description: Reference implementation of batches API with KVStore persistence
sidebar_label: Reference
sidebar_position: 2
title: inline::reference
---
# inline::reference
## Description
Reference implementation of batches API with KVStore persistence.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Configuration for the key-value store backend. |
| `max_concurrent_batches` | `<class 'int'>` | No | 1 | Maximum number of concurrent batches to process simultaneously. |
| `max_concurrent_requests_per_batch` | `<class 'int'>` | No | 10 | Maximum number of concurrent requests to process per batch. |
## Sample Configuration
```yaml
kvstore:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/batches.db
```

View file

@ -0,0 +1,18 @@
---
description: Available providers for the datasetio API
sidebar_label: Overview
sidebar_position: 1
title: Datasetio
---
# Datasetio
## Overview
This section contains documentation for all available providers for the **datasetio** API.
## Providers
- **[Localfs](./inline_localfs)** - Inline provider
- **[Huggingface](./remote_huggingface)** - Remote provider
- **[Nvidia](./remote_nvidia)** - Remote provider

View file

@ -0,0 +1,27 @@
---
description: Local filesystem-based dataset I/O provider for reading and writing datasets
to local storage
sidebar_label: Localfs
sidebar_position: 2
title: inline::localfs
---
# inline::localfs
## Description
Local filesystem-based dataset I/O provider for reading and writing datasets to local storage.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | |
## Sample Configuration
```yaml
kvstore:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/localfs_datasetio.db
```

View file

@ -0,0 +1,27 @@
---
description: HuggingFace datasets provider for accessing and managing datasets from
the HuggingFace Hub
sidebar_label: Huggingface
sidebar_position: 3
title: remote::huggingface
---
# remote::huggingface
## Description
HuggingFace datasets provider for accessing and managing datasets from the HuggingFace Hub.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | |
## Sample Configuration
```yaml
kvstore:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/huggingface_datasetio.db
```

View file

@ -0,0 +1,31 @@
---
description: NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data
platform
sidebar_label: Nvidia
sidebar_position: 4
title: remote::nvidia
---
# remote::nvidia
## Description
NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The NVIDIA API key. |
| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
| `project_id` | `str \| None` | No | test-project | The NVIDIA project ID. |
| `datasets_url` | `<class 'str'>` | No | http://nemo.test | Base URL for the NeMo Dataset API |
## Sample Configuration
```yaml
api_key: ${env.NVIDIA_API_KEY:=}
dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
```

View file

@ -0,0 +1,19 @@
---
description: Available providers for the eval API
sidebar_label: Overview
sidebar_position: 1
title: Eval
---
# Eval
## Overview
Llama Stack Evaluation API for running evaluations on model and agent candidates.
This section contains documentation for all available providers for the **eval** API.
## Providers
- **[Meta Reference](./inline_meta-reference)** - Inline provider
- **[Nvidia](./remote_nvidia)** - Remote provider

View file

@ -0,0 +1,27 @@
---
description: Meta's reference implementation of evaluation tasks with support for
multiple languages and evaluation metrics
sidebar_label: Meta Reference
sidebar_position: 2
title: inline::meta-reference
---
# inline::meta-reference
## Description
Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | |
## Sample Configuration
```yaml
kvstore:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/meta_reference_eval.db
```

View file

@ -0,0 +1,25 @@
---
description: NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's
platform
sidebar_label: Nvidia
sidebar_position: 3
title: remote::nvidia
---
# remote::nvidia
## Description
NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `evaluator_url` | `<class 'str'>` | No | http://0.0.0.0:7331 | The url for accessing the evaluator service |
## Sample Configuration
```yaml
evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331}
```

View file

@ -0,0 +1,350 @@
---
title: Creating External Providers
description: Comprehensive guide for developing custom Llama Stack providers
sidebar_label: Creation Guide
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Creating External Providers
This guide walks you through creating custom Llama Stack providers that live outside the main codebase.
## Configuration Methods
There are two ways to configure external providers in Llama Stack:
<Tabs>
<TabItem value="module" label="Module Method (Recommended)">
To enable external providers, add `module` into your build yaml, allowing Llama Stack to install the required package corresponding to the external provider.
An example entry in your build.yaml should look like:
```yaml
- provider_type: remote::ramalama
module: ramalama_stack
```
This method automatically handles package installation and provider discovery.
</TabItem>
<TabItem value="directory" label="Directory Method (Deprecated)">
:::warning[Deprecated Method]
This method is in the process of being deprecated in favor of the `module` method.
:::
You can configure the `external_providers_dir` in your Llama Stack configuration:
```yaml
external_providers_dir: ~/.llama/providers.d/
```
The external provider directory should contain your external provider specifications.
</TabItem>
</Tabs>
## Directory Structure
When using the directory method, the external providers directory should follow this structure:
```
providers.d/
remote/
inference/
custom_ollama.yaml
vllm.yaml
vector_io/
qdrant.yaml
safety/
llama-guard.yaml
inline/
inference/
custom_ollama.yaml
vllm.yaml
vector_io/
qdrant.yaml
safety/
llama-guard.yaml
```
Each YAML file in these directories defines a provider specification for that particular API.
## Provider Types
Llama Stack supports two types of external providers:
### 🌐 Remote Providers
Providers that communicate with external services (e.g., cloud APIs)
#### Remote Provider Specification
Here's an example for a custom Ollama provider:
```yaml
adapter:
adapter_type: custom_ollama
pip_packages:
- ollama
- aiohttp
config_class: llama_stack_ollama_provider.config.OllamaImplConfig
module: llama_stack_ollama_provider
api_dependencies: []
optional_api_dependencies: []
```
#### Adapter Configuration
The `adapter` section defines how to load and configure the provider:
- **`adapter_type`**: A unique identifier for this adapter
- **`pip_packages`**: List of Python packages required by the provider
- **`config_class`**: The full path to the configuration class
- **`module`**: The Python module containing the provider implementation
### 🏠 Inline Providers
Providers that run locally within the Llama Stack process.
#### Inline Provider Specification
Here's an example for a custom vector store provider:
```yaml
module: llama_stack_vector_provider
config_class: llama_stack_vector_provider.config.VectorStoreConfig
pip_packages:
- faiss-cpu
- numpy
api_dependencies:
- inference
optional_api_dependencies:
- vector_io
provider_data_validator: llama_stack_vector_provider.validator.VectorStoreValidator
container_image: custom-vector-store:latest # optional
```
#### Inline Provider Fields
- **`module`**: The Python module containing the provider implementation
- **`config_class`**: The full path to the configuration class
- **`pip_packages`**: List of Python packages required by the provider
- **`api_dependencies`**: List of Llama Stack APIs that this provider depends on
- **`optional_api_dependencies`**: List of optional Llama Stack APIs that this provider can use
- **`provider_data_validator`**: Optional validator for provider data
- **`container_image`**: Optional container image to use instead of pip packages
## Required Implementation
### All Providers
All providers must contain a `get_provider_spec` function in their `provider` module. This is a standardized structure that Llama Stack expects and is necessary for getting things such as the config class.
The `get_provider_spec` method returns a structure identical to the `adapter`. An example function may look like:
```python
from llama_stack.providers.datatypes import (
ProviderSpec,
Api,
AdapterSpec,
remote_provider_spec,
)
def get_provider_spec() -> ProviderSpec:
return remote_provider_spec(
api=Api.inference,
adapter=AdapterSpec(
adapter_type="ramalama",
pip_packages=["ramalama>=0.8.5", "pymilvus"],
config_class="ramalama_stack.config.RamalamaImplConfig",
module="ramalama_stack",
),
)
```
### Remote Providers
Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
1. **`config`**: An instance of the provider's config class
2. **`deps`**: A dictionary of API dependencies
This function must return an instance of the provider's adapter class that implements the required protocol for the API.
**Example:**
```python
async def get_adapter_impl(
config: OllamaImplConfig, deps: Dict[Api, Any]
) -> OllamaInferenceAdapter:
return OllamaInferenceAdapter(config)
```
### Inline Providers
Inline providers must expose a `get_provider_impl()` function in their module that takes two arguments:
1. **`config`**: An instance of the provider's config class
2. **`deps`**: A dictionary of API dependencies
**Example:**
```python
async def get_provider_impl(
config: VectorStoreConfig, deps: Dict[Api, Any]
) -> VectorStoreImpl:
impl = VectorStoreImpl(config, deps[Api.inference])
await impl.initialize()
return impl
```
## Dependencies
The provider package must be installed on the system. For example:
```bash
$ uv pip show llama-stack-ollama-provider
Name: llama-stack-ollama-provider
Version: 0.1.0
Location: /path/to/venv/lib/python3.10/site-packages
```
## Best Practices
### 📦 Package Naming
Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
### 🏷️ Version Management
Keep your provider package versioned and compatible with the Llama Stack version you're using.
### ⚡ Dependencies
Only include the minimum required dependencies in your provider package.
### 📚 Documentation
Include clear documentation in your provider package about:
- Installation requirements
- Configuration options
- Usage examples
- Any limitations or known issues
### 🧪 Testing
Include tests in your provider package to ensure it works correctly with Llama Stack.
You can refer to the [integration tests guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more information. Execute the test for the Provider type you are developing.
## Troubleshooting
If your external provider isn't being loaded:
1. **Check Module Path**: Verify that `module` points to a published pip package with a top level `provider` module including `get_provider_spec`.
2. **Check Directory Path**: Verify that the `external_providers_dir` path is correct and accessible.
3. **Validate YAML**: Ensure that the YAML files are properly formatted.
4. **Check Dependencies**: Ensure all required Python packages are installed.
5. **Review Logs**: Check the Llama Stack server logs for any error messages - turn on debug logging to get more information using `LLAMA_STACK_LOGGING=all=debug`.
6. **Verify Installation**: Verify that the provider package is installed in your Python environment if using `external_providers_dir`.
## Complete Examples
### Example 1: Using `external_providers_dir` (Custom Ollama Provider)
Here's a complete example of creating and using a custom Ollama provider:
#### 1. Create the Provider Package
```bash
mkdir -p llama-stack-provider-ollama
cd llama-stack-provider-ollama
git init
uv init
```
#### 2. Configure Package
Edit `pyproject.toml`:
```toml
[project]
name = "llama-stack-provider-ollama"
version = "0.1.0"
description = "Ollama provider for Llama Stack"
requires-python = ">=3.12"
dependencies = ["llama-stack", "pydantic", "ollama", "aiohttp"]
```
#### 3. Create Provider Specification
```yaml
# ~/.llama/providers.d/remote/inference/custom_ollama.yaml
adapter:
adapter_type: custom_ollama
pip_packages: ["ollama", "aiohttp"]
config_class: llama_stack_provider_ollama.config.OllamaImplConfig
module: llama_stack_provider_ollama
api_dependencies: []
optional_api_dependencies: []
```
#### 4. Install the Provider
```bash
uv pip install -e .
```
#### 5. Configure Llama Stack
```yaml
external_providers_dir: ~/.llama/providers.d/
```
The provider will now be available in Llama Stack with the type `remote::custom_ollama`.
### Example 2: Using `module` (ramalama-stack)
[ramalama-stack](https://github.com/containers/ramalama-stack) is a recognized external provider that supports installation via module.
To install Llama Stack with this external provider a user can provide the following build.yaml:
```yaml
version: 2
distribution_spec:
description: Use (an external) Ramalama server for running LLM inference
container_image: null
providers:
inference:
- provider_type: remote::ramalama
module: ramalama_stack==0.3.0a0
image_type: venv
image_name: null
external_providers_dir: null
additional_pip_packages:
- aiosqlite
- sqlalchemy[asyncio]
```
No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.
The provider will now be available in Llama Stack with the type `remote::ramalama`.
## Next Steps
1. **📖 Study Examples**: Review the [Known External Providers](./external-providers-list) for real-world examples
2. **🔧 Start Building**: Create your first provider using the templates above
3. **🧪 Test Integration**: Use the integration test framework to validate your provider
4. **📦 Publish**: Share your provider with the community via PyPI
5. **📢 Share**: Add your provider to our [community list](./external-providers-list)
## Related Resources
- **[External Providers Overview](./index)** - Understanding external providers
- **[Known External Providers](./external-providers-list)** - Community provider directory
- **[API Protocols](/docs/api/)** - Provider interface specifications
- **[Integration Tests](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md)** - Testing your providers

View file

@ -0,0 +1,116 @@
---
title: Known External Providers
description: Community-contributed external providers for Llama Stack
sidebar_label: Community Providers
sidebar_position: 3
---
# Known External Providers
Here's a list of known external providers that you can use with Llama Stack:
## Available Providers
| Name | Description | API | Type | Repository |
|------|-------------|-----|------|------------|
| **KubeFlow Training** | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
| **KubeFlow Pipelines** | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
| **RamaLama** | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
| **TrustyAI LM-Eval** | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
| **MongoDB** | VectorIO with MongoDB | Vector IO | Remote | [mongodb-llama-stack](https://github.com/mongodb-partners/mongodb-llama-stack) |
## Using External Providers
To use any of these providers, you can add them to your build configuration:
### Method 1: Module-based Installation (Recommended)
```yaml
version: 2
distribution_spec:
providers:
inference:
- provider_type: remote::ramalama
module: ramalama_stack==0.3.0a0
```
### Method 2: Manual Installation
1. **Install the provider package:**
```bash
pip install provider-package-name
```
2. **Configure in your build.yaml:**
```yaml
providers:
inference:
- provider_type: remote::custom_provider
```
## Provider Categories
### 🚀 **Inference Providers**
- **RamaLama**: Container-native AI model inference
### 🎯 **Post Training Providers**
- **KubeFlow Training**: Enterprise-grade model training
- **KubeFlow Pipelines**: ML pipeline-based training
### 📊 **Evaluation Providers**
- **TrustyAI LM-Eval**: Comprehensive model evaluation suite
### 🗃️ **Vector IO Providers**
- **MongoDB**: Document-based vector storage and retrieval
## Contributing Your Provider
Have you created an external provider? We'd love to include it in this list!
### Submission Process
1. **Create a Pull Request** adding your provider to this list
2. **Include the following information:**
- Provider name and description
- API type (Inference, Safety, etc.)
- Provider type (Remote/Inline)
- Repository link
- Basic usage example
3. **Requirements for listing:**
- ✅ Open source repository
- ✅ Clear documentation
- ✅ Working examples
- ✅ Compatible with current Llama Stack version
### Template Entry
```markdown
| **Your Provider Name** | Brief description | API Type | Remote/Inline | [repository-link](https://github.com/your-org/your-repo) |
```
## Getting Help
### Provider Development
- **📚 [Creation Guide](./external-providers-guide)** - Learn how to build providers
- **🔧 [Provider Architecture](/docs/concepts/api-providers)** - Understanding the system
- **💬 [GitHub Discussions](https://github.com/meta-llama/llama-stack/discussions)** - Ask questions
### Integration Issues
- **🐛 [Issue Tracker](https://github.com/meta-llama/llama-stack/issues)** - Report bugs
- **📖 [Integration Tests](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md)** - Test your provider
## Community
Join the growing ecosystem of Llama Stack providers:
- **Share** your providers with the community
- **Discover** new providers for your use cases
- **Collaborate** on provider development
- **Contribute** to existing provider projects
## Related Resources
- **[External Providers Overview](./index)** - Understanding external providers
- **[Creating External Providers](./external-providers-guide)** - Development guide
- **[Building Distributions](/docs/distributions/building-distro)** - Using providers in distributions

104
docs/docs/providers/external/index.mdx vendored Normal file
View file

@ -0,0 +1,104 @@
---
title: External Providers
description: Create and maintain custom Llama Stack providers outside the main codebase
sidebar_label: Overview
sidebar_position: 1
---
# External Providers
Llama Stack supports external providers that live outside of the main codebase. This allows you to:
- **Create and maintain your own providers independently**
- **Share providers with others** without contributing to the main codebase
- **Keep provider-specific code separate** from the core Llama Stack code
- **Extend functionality** with custom implementations
## What are External Providers?
External providers are custom implementations of Llama Stack APIs that:
- **Live in separate packages** from the core Llama Stack codebase
- **Follow the same interfaces** as built-in providers
- **Can be distributed independently** via PyPI or other package managers
- **Integrate seamlessly** with existing Llama Stack distributions
## Types of External Providers
Llama Stack supports two types of external providers:
### 🌐 **Remote Providers**
Providers that communicate with external services (e.g., cloud APIs, remote servers)
**Examples:**
- Custom cloud inference endpoints
- Third-party safety services
- External vector databases
- Remote evaluation services
### 🏠 **Inline Providers**
Providers that run locally within the Llama Stack process
**Examples:**
- Custom local inference engines
- Local safety filters
- Custom embedding models
- Local vector storage implementations
## Getting Started
### Using External Providers
To use an external provider, you can specify it in your build configuration:
```yaml
version: 2
distribution_spec:
providers:
inference:
- provider_type: remote::custom_provider
module: my-custom-provider==1.0.0
```
### Creating External Providers
Ready to create your own provider? Check out our comprehensive guide:
**📖 [Creating External Providers Guide](./external-providers-guide)**
### Community Providers
Explore providers created by the community:
**🔌 [Known External Providers](./external-providers-list)**
## Benefits
- **🚀 Rapid Development**: Create and iterate on providers independently
- **🔒 Security**: Keep proprietary code separate from open source
- **🌍 Community**: Share and discover community-contributed providers
- **⚡ Performance**: Optimize providers for specific use cases
- **🔧 Flexibility**: Implement custom business logic and integrations
## Architecture
External providers integrate with Llama Stack through a standardized interface:
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ Llama Stack │◄──►│ Provider Interface│◄──►│ External Provider │
│ Core │ │ (Protocol) │ │ (Your Code) │
└─────────────────┘ └──────────────────┘ └─────────────────────┘
```
## Next Steps
1. **🔍 [Browse Available Providers](./external-providers-list)** - See what's already available
2. **📚 [Read the Creation Guide](./external-providers-guide)** - Learn how to build your own
3. **💡 Join the Community** - Share your providers and get help
## Related Resources
- **[Provider Architecture](/docs/concepts/api-providers)** - Understanding the provider system
- **[Building Distributions](/docs/distributions/building-distro)** - Using external providers in distributions
- **[API Reference](/docs/api/)** - Provider interface specifications

View file

@ -0,0 +1,17 @@
---
description: Available providers for the files API
sidebar_label: Overview
sidebar_position: 1
title: Files
---
# Files
## Overview
This section contains documentation for all available providers for the **files** API.
## Providers
- **[Localfs](./inline_localfs)** - Inline provider
- **[S3](./remote_s3)** - Remote provider

View file

@ -0,0 +1,30 @@
---
description: Local filesystem-based file storage provider for managing files and documents
locally
sidebar_label: Localfs
sidebar_position: 2
title: inline::localfs
---
# inline::localfs
## Description
Local filesystem-based file storage provider for managing files and documents locally.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `storage_dir` | `<class 'str'>` | No | | Directory to store uploaded files |
| `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
| `ttl_secs` | `<class 'int'>` | No | 31536000 | |
## Sample Configuration
```yaml
storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/dummy/files}
metadata_store:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/files_metadata.db
```

View file

@ -0,0 +1,39 @@
---
description: AWS S3-based file storage provider for scalable cloud file management
with metadata persistence
sidebar_label: S3
sidebar_position: 3
title: remote::s3
---
# remote::s3
## Description
AWS S3-based file storage provider for scalable cloud file management with metadata persistence.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `bucket_name` | `<class 'str'>` | No | | S3 bucket name to store files |
| `region` | `<class 'str'>` | No | us-east-1 | AWS region where the bucket is located |
| `aws_access_key_id` | `str \| None` | No | | AWS access key ID (optional if using IAM roles) |
| `aws_secret_access_key` | `str \| None` | No | | AWS secret access key (optional if using IAM roles) |
| `endpoint_url` | `str \| None` | No | | Custom S3 endpoint URL (for MinIO, LocalStack, etc.) |
| `auto_create_bucket` | `<class 'bool'>` | No | False | Automatically create the S3 bucket if it doesn't exist |
| `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
## Sample Configuration
```yaml
bucket_name: ${env.S3_BUCKET_NAME}
region: ${env.AWS_REGION:=us-east-1}
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID:=}
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY:=}
endpoint_url: ${env.S3_ENDPOINT_URL:=}
auto_create_bucket: ${env.S3_AUTO_CREATE_BUCKET:=false}
metadata_store:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/s3_files_metadata.db
```

View file

@ -0,0 +1,33 @@
---
title: API Providers
description: Ecosystem of providers for swapping implementations across the same API
sidebar_label: Overview
sidebar_position: 1
---
# API Providers
The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
- LLM inference providers (e.g., Meta Reference, Ollama, Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, OpenAI, Anthropic, Gemini, WatsonX, etc.),
- Vector databases (e.g., FAISS, SQLite-Vec, ChromaDB, Weaviate, Qdrant, Milvus, PGVector, etc.),
- Safety providers (e.g., Meta's Llama Guard, Prompt Guard, Code Scanner, AWS Bedrock Guardrails, etc.),
- Tool Runtime providers (e.g., RAG Runtime, Brave Search, etc.)
Providers come in two flavors:
- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code.
- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack.
Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.
## Provider Categories
- **[External Providers](./external/)** - Guide for building and using external providers
- **[OpenAI Compatibility](./openai)** - OpenAI API compatibility layer
- **[Inference](./inference/)** - LLM and embedding model providers
- **[Agents](./agents/)** - Agentic system providers
- **[DatasetIO](./datasetio/)** - Dataset and data loader providers
- **[Safety](./safety/)** - Content moderation and safety providers
- **[Telemetry](./telemetry/)** - Monitoring and observability providers
- **[Vector IO](./vector-io/)** - Vector database providers
- **[Tool Runtime](./tool-runtime/)** - Tool and protocol providers
- **[Files](./files/)** - File system and storage providers

View file

@ -0,0 +1,46 @@
---
description: Available providers for the inference API
sidebar_label: Overview
sidebar_position: 1
title: Inference
---
# Inference
## Overview
Llama Stack Inference API for generating completions, chat completions, and embeddings.
This API provides the raw interface to the underlying models. Two kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
This section contains documentation for all available providers for the **inference** API.
## Providers
- **[Meta Reference](./inline_meta-reference)** - Inline provider
- **[Sentence Transformers](./inline_sentence-transformers)** - Inline provider
- **[Anthropic](./remote_anthropic)** - Remote provider
- **[Azure](./remote_azure)** - Remote provider
- **[Bedrock](./remote_bedrock)** - Remote provider
- **[Cerebras](./remote_cerebras)** - Remote provider
- **[Databricks](./remote_databricks)** - Remote provider
- **[Fireworks](./remote_fireworks)** - Remote provider
- **[Gemini](./remote_gemini)** - Remote provider
- **[Groq](./remote_groq)** - Remote provider
- **[Hugging Face Endpoint](./remote_hf_endpoint)** - Remote provider
- **[Hugging Face Serverless](./remote_hf_serverless)** - Remote provider
- **[Llama OpenAI Compatible](./remote_llama-openai-compat)** - Remote provider
- **[Nvidia](./remote_nvidia)** - Remote provider
- **[Ollama](./remote_ollama)** - Remote provider
- **[Openai](./remote_openai)** - Remote provider
- **[Passthrough](./remote_passthrough)** - Remote provider
- **[Runpod](./remote_runpod)** - Remote provider
- **[Sambanova](./remote_sambanova)** - Remote provider
- **[SambaNova OpenAI Compatible](./remote_sambanova-openai-compat)** - Remote provider
- **[Tgi](./remote_tgi)** - Remote provider
- **[Together](./remote_together)** - Remote provider
- **[Vertexai](./remote_vertexai)** - Remote provider
- **[Vllm](./remote_vllm)** - Remote provider
- **[Watsonx](./remote_watsonx)** - Remote provider

View file

@ -0,0 +1,38 @@
---
description: Meta's reference implementation of inference with support for various
model formats and optimization techniques
sidebar_label: Meta Reference
sidebar_position: 2
title: inline::meta-reference
---
# inline::meta-reference
## Description
Meta's reference implementation of inference with support for various model formats and optimization techniques.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `model` | `str \| None` | No | | |
| `torch_seed` | `int \| None` | No | | |
| `max_seq_len` | `<class 'int'>` | No | 4096 | |
| `max_batch_size` | `<class 'int'>` | No | 1 | |
| `model_parallel_size` | `int \| None` | No | | |
| `create_distributed_process_group` | `<class 'bool'>` | No | True | |
| `checkpoint_dir` | `str \| None` | No | | |
| `quantization` | `Bf16QuantizationConfig \| Fp8QuantizationConfig \| Int4QuantizationConfig, annotation=NoneType, required=True, discriminator='type'` | No | | |
## Sample Configuration
```yaml
model: Llama3.2-3B-Instruct
checkpoint_dir: ${env.CHECKPOINT_DIR:=null}
quantization:
type: ${env.QUANTIZATION_TYPE:=bf16}
model_parallel_size: ${env.MODEL_PARALLEL_SIZE:=0}
max_batch_size: ${env.MAX_BATCH_SIZE:=1}
max_seq_len: ${env.MAX_SEQ_LEN:=4096}
```

View file

@ -0,0 +1,23 @@
---
description: Sentence Transformers inference provider for text embeddings and similarity
search
sidebar_label: Sentence Transformers
sidebar_position: 3
title: inline::sentence-transformers
---
# inline::sentence-transformers
## Description
Sentence Transformers inference provider for text embeddings and similarity search.
## Configuration
No configuration options available.
## Sample Configuration
```yaml
{}
```

View file

@ -0,0 +1,25 @@
---
description: Anthropic inference provider for accessing Claude models and Anthropic's
AI services
sidebar_label: Anthropic
sidebar_position: 4
title: remote::anthropic
---
# remote::anthropic
## Description
Anthropic inference provider for accessing Claude models and Anthropic's AI services.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | API key for Anthropic models |
## Sample Configuration
```yaml
api_key: ${env.ANTHROPIC_API_KEY:=}
```

View file

@ -0,0 +1,33 @@
---
description: Azure OpenAI inference provider for accessing GPT models and other Azure
services
sidebar_label: Azure
sidebar_position: 5
title: remote::azure
---
# remote::azure
## Description
Azure OpenAI inference provider for accessing GPT models and other Azure services.
Provider documentation
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `<class 'pydantic.types.SecretStr'>` | No | | Azure API key for Azure |
| `api_base` | `<class 'pydantic.networks.HttpUrl'>` | No | | Azure API base for Azure (e.g., https://your-resource-name.openai.azure.com) |
| `api_version` | `str \| None` | No | | Azure API version for Azure (e.g., 2024-12-01-preview) |
| `api_type` | `str \| None` | No | azure | Azure API type for Azure (e.g., azure) |
## Sample Configuration
```yaml
api_key: ${env.AZURE_API_KEY:=}
api_base: ${env.AZURE_API_BASE:=}
api_version: ${env.AZURE_API_VERSION:=}
api_type: ${env.AZURE_API_TYPE:=}
```

View file

@ -0,0 +1,34 @@
---
description: AWS Bedrock inference provider for accessing various AI models through
AWS's managed service
sidebar_label: Bedrock
sidebar_position: 6
title: remote::bedrock
---
# remote::bedrock
## Description
AWS Bedrock inference provider for accessing various AI models through AWS's managed service.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `aws_access_key_id` | `str \| None` | No | | The AWS access key to use. Default use environment variable: AWS_ACCESS_KEY_ID |
| `aws_secret_access_key` | `str \| None` | No | | The AWS secret access key to use. Default use environment variable: AWS_SECRET_ACCESS_KEY |
| `aws_session_token` | `str \| None` | No | | The AWS session token to use. Default use environment variable: AWS_SESSION_TOKEN |
| `region_name` | `str \| None` | No | | The default AWS Region to use, for example, us-west-1 or us-west-2.Default use environment variable: AWS_DEFAULT_REGION |
| `profile_name` | `str \| None` | No | | The profile name that contains credentials to use.Default use environment variable: AWS_PROFILE |
| `total_max_attempts` | `int \| None` | No | | An integer representing the maximum number of attempts that will be made for a single request, including the initial attempt. Default use environment variable: AWS_MAX_ATTEMPTS |
| `retry_mode` | `str \| None` | No | | A string representing the type of retries Boto3 will perform.Default use environment variable: AWS_RETRY_MODE |
| `connect_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds. |
| `read_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to read from a connection.The default is 60 seconds. |
| `session_ttl` | `int \| None` | No | 3600 | The time in seconds till a session expires. The default is 3600 seconds (1 hour). |
## Sample Configuration
```yaml
{}
```

View file

@ -0,0 +1,26 @@
---
description: Cerebras inference provider for running models on Cerebras Cloud platform
sidebar_label: Cerebras
sidebar_position: 7
title: remote::cerebras
---
# remote::cerebras
## Description
Cerebras inference provider for running models on Cerebras Cloud platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `base_url` | `<class 'str'>` | No | https://api.cerebras.ai | Base URL for the Cerebras API |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | Cerebras API Key |
## Sample Configuration
```yaml
base_url: https://api.cerebras.ai
api_key: ${env.CEREBRAS_API_KEY:=}
```

View file

@ -0,0 +1,27 @@
---
description: Databricks inference provider for running models on Databricks' unified
analytics platform
sidebar_label: Databricks
sidebar_position: 8
title: remote::databricks
---
# remote::databricks
## Description
Databricks inference provider for running models on Databricks' unified analytics platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | | The URL for the Databricks model serving endpoint |
| `api_token` | `<class 'str'>` | No | | The Databricks API token |
## Sample Configuration
```yaml
url: ${env.DATABRICKS_URL:=}
api_token: ${env.DATABRICKS_API_TOKEN:=}
```

View file

@ -0,0 +1,28 @@
---
description: Fireworks AI inference provider for Llama models and other AI models
on the Fireworks platform
sidebar_label: Fireworks
sidebar_position: 9
title: remote::fireworks
---
# remote::fireworks
## Description
Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `allowed_models` | `list[str \| None` | No | | List of models that should be registered with the model registry. If None, all models are allowed. |
| `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The Fireworks.ai API Key |
## Sample Configuration
```yaml
url: https://api.fireworks.ai/inference/v1
api_key: ${env.FIREWORKS_API_KEY:=}
```

View file

@ -0,0 +1,25 @@
---
description: Google Gemini inference provider for accessing Gemini models and Google's
AI services
sidebar_label: Gemini
sidebar_position: 10
title: remote::gemini
---
# remote::gemini
## Description
Google Gemini inference provider for accessing Gemini models and Google's AI services.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | API key for Gemini models |
## Sample Configuration
```yaml
api_key: ${env.GEMINI_API_KEY:=}
```

View file

@ -0,0 +1,26 @@
---
description: Groq inference provider for ultra-fast inference using Groq's LPU technology
sidebar_label: Groq
sidebar_position: 11
title: remote::groq
---
# remote::groq
## Description
Groq inference provider for ultra-fast inference using Groq's LPU technology.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The Groq API key |
| `url` | `<class 'str'>` | No | https://api.groq.com | The URL for the Groq AI server |
## Sample Configuration
```yaml
url: https://api.groq.com
api_key: ${env.GROQ_API_KEY:=}
```

View file

@ -0,0 +1,26 @@
---
description: HuggingFace Inference Endpoints provider for dedicated model serving
sidebar_label: Hugging Face Endpoint
sidebar_position: 12
title: remote::hf::endpoint
---
# remote::hf::endpoint
## Description
HuggingFace Inference Endpoints provider for dedicated model serving.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `endpoint_name` | `<class 'str'>` | No | | The name of the Hugging Face Inference Endpoint in the format of `{namespace}/{endpoint_name}` (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
| `api_token` | `pydantic.types.SecretStr or None` | No | | Your Hugging Face user access token (will default to locally saved token if not provided) |
## Sample Configuration
```yaml
endpoint_name: ${env.INFERENCE_ENDPOINT_NAME}
api_token: ${env.HF_API_TOKEN}
```

View file

@ -0,0 +1,26 @@
---
description: HuggingFace Inference API serverless provider for on-demand model inference
sidebar_label: Hugging Face Serverless
sidebar_position: 13
title: remote::hf::serverless
---
# remote::hf::serverless
## Description
HuggingFace Inference API serverless provider for on-demand model inference.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `huggingface_repo` | `<class 'str'>` | No | | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
| `api_token` | `pydantic.types.SecretStr \| None` | No | | Your Hugging Face user access token (will default to locally saved token if not provided) |
## Sample Configuration
```yaml
huggingface_repo: ${env.INFERENCE_MODEL}
api_token: ${env.HF_API_TOKEN}
```

View file

@ -0,0 +1,27 @@
---
description: Llama OpenAI-compatible provider for using Llama models with OpenAI API
format
sidebar_label: Llama OpenAI Compatible
sidebar_position: 14
title: remote::llama-openai-compat
---
# remote::llama-openai-compat
## Description
Llama OpenAI-compatible provider for using Llama models with OpenAI API format.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The Llama API key |
| `openai_compat_api_base` | `<class 'str'>` | No | https://api.llama.com/compat/v1/ | The URL for the Llama API server |
## Sample Configuration
```yaml
openai_compat_api_base: https://api.llama.com/compat/v1/
api_key: ${env.LLAMA_API_KEY}
```

View file

@ -0,0 +1,29 @@
---
description: NVIDIA inference provider for accessing NVIDIA NIM models and AI services
sidebar_label: Nvidia
sidebar_position: 15
title: remote::nvidia
---
# remote::nvidia
## Description
NVIDIA inference provider for accessing NVIDIA NIM models and AI services.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | https://integrate.api.nvidia.com | A base url for accessing the NVIDIA NIM |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The NVIDIA API key, only needed of using the hosted service |
| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
| `append_api_version` | `<class 'bool'>` | No | True | When set to false, the API version will not be appended to the base_url. By default, it is true. |
## Sample Configuration
```yaml
url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
api_key: ${env.NVIDIA_API_KEY:=}
append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
```

View file

@ -0,0 +1,26 @@
---
description: Ollama inference provider for running local models through the Ollama
runtime
sidebar_label: Ollama
sidebar_position: 16
title: remote::ollama
---
# remote::ollama
## Description
Ollama inference provider for running local models through the Ollama runtime.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | http://localhost:11434 | |
| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
## Sample Configuration
```yaml
url: ${env.OLLAMA_URL:=http://localhost:11434}
```

View file

@ -0,0 +1,26 @@
---
description: OpenAI inference provider for accessing GPT models and other OpenAI services
sidebar_label: Openai
sidebar_position: 17
title: remote::openai
---
# remote::openai
## Description
OpenAI inference provider for accessing GPT models and other OpenAI services.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | API key for OpenAI models |
| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |
## Sample Configuration
```yaml
api_key: ${env.OPENAI_API_KEY:=}
base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}
```

View file

@ -0,0 +1,27 @@
---
description: Passthrough inference provider for connecting to any external inference
service not directly supported
sidebar_label: Passthrough
sidebar_position: 18
title: remote::passthrough
---
# remote::passthrough
## Description
Passthrough inference provider for connecting to any external inference service not directly supported.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | | The URL for the passthrough endpoint |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | API Key for the passthrouth endpoint |
## Sample Configuration
```yaml
url: ${env.PASSTHROUGH_URL}
api_key: ${env.PASSTHROUGH_API_KEY}
```

View file

@ -0,0 +1,26 @@
---
description: RunPod inference provider for running models on RunPod's cloud GPU platform
sidebar_label: Runpod
sidebar_position: 19
title: remote::runpod
---
# remote::runpod
## Description
RunPod inference provider for running models on RunPod's cloud GPU platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `str \| None` | No | | The URL for the Runpod model serving endpoint |
| `api_token` | `str \| None` | No | | The API token |
## Sample Configuration
```yaml
url: ${env.RUNPOD_URL:=}
api_token: ${env.RUNPOD_API_TOKEN}
```

View file

@ -0,0 +1,27 @@
---
description: SambaNova OpenAI-compatible provider for using SambaNova models with
OpenAI API format
sidebar_label: SambaNova OpenAI Compatible
sidebar_position: 21
title: remote::sambanova-openai-compat
---
# remote::sambanova-openai-compat
## Description
SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API format.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The SambaNova API key |
| `openai_compat_api_base` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova API server |
## Sample Configuration
```yaml
openai_compat_api_base: https://api.sambanova.ai/v1
api_key: ${env.SAMBANOVA_API_KEY:=}
```

View file

@ -0,0 +1,27 @@
---
description: SambaNova inference provider for running models on SambaNova's dataflow
architecture
sidebar_label: Sambanova
sidebar_position: 20
title: remote::sambanova
---
# remote::sambanova
## Description
SambaNova inference provider for running models on SambaNova's dataflow architecture.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova AI server |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The SambaNova cloud API Key |
## Sample Configuration
```yaml
url: https://api.sambanova.ai/v1
api_key: ${env.SAMBANOVA_API_KEY:=}
```

View file

@ -0,0 +1,24 @@
---
description: Text Generation Inference (TGI) provider for HuggingFace model serving
sidebar_label: Tgi
sidebar_position: 22
title: remote::tgi
---
# remote::tgi
## Description
Text Generation Inference (TGI) provider for HuggingFace model serving.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | | The URL for the TGI serving endpoint |
## Sample Configuration
```yaml
url: ${env.TGI_URL:=}
```

View file

@ -0,0 +1,28 @@
---
description: Together AI inference provider for open-source models and collaborative
AI development
sidebar_label: Together
sidebar_position: 23
title: remote::together
---
# remote::together
## Description
Together AI inference provider for open-source models and collaborative AI development.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `allowed_models` | `list[str \| None` | No | | List of models that should be registered with the model registry. If None, all models are allowed. |
| `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The Together AI API Key |
## Sample Configuration
```yaml
url: https://api.together.xyz/v1
api_key: ${env.TOGETHER_API_KEY:=}
```

View file

@ -0,0 +1,56 @@
---
description: "Google Vertex AI inference provider enables you to use Google's Gemini\
\ models through Google Cloud's Vertex AI platform, providing several advantages:\n\
\n\u2022 Enterprise-grade security: Uses Google Cloud's security controls and IAM\n\
\u2022 Better integration: Seamless integration with other Google Cloud services\n\
\u2022 Advanced features: Access to additional Vertex AI features like model tuning\
\ and monitoring\n\u2022 Authentication: Uses Google Cloud Application Default Credentials\
\ (ADC) instead of API keys\n\nConfiguration:\n- Set VERTEX_AI_PROJECT environment\
\ variable (required)\n- Set VERTEX_AI_LOCATION environment variable (optional,\
\ defaults to us-central1)\n- Use Google Cloud Application Default Credentials or\
\ service account key\n\nAuthentication Setup:\nOption 1 (Recommended): gcloud auth\
\ application-default login\nOption 2: Set GOOGLE_APPLICATION_CREDENTIALS to service\
\ account key path\n\nAvailable Models:\n- vertex_ai/gemini-2"
sidebar_label: Vertexai
sidebar_position: 24
title: remote::vertexai
---
# remote::vertexai
## Description
Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
• Enterprise-grade security: Uses Google Cloud's security controls and IAM
• Better integration: Seamless integration with other Google Cloud services
• Advanced features: Access to additional Vertex AI features like model tuning and monitoring
• Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
Configuration:
- Set VERTEX_AI_PROJECT environment variable (required)
- Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
- Use Google Cloud Application Default Credentials or service account key
Authentication Setup:
Option 1 (Recommended): gcloud auth application-default login
Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
Available Models:
- vertex_ai/gemini-2.0-flash
- vertex_ai/gemini-2.5-flash
- vertex_ai/gemini-2.5-pro
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `project` | `<class 'str'>` | No | | Google Cloud project ID for Vertex AI |
| `location` | `<class 'str'>` | No | us-central1 | Google Cloud location for Vertex AI |
## Sample Configuration
```yaml
project: ${env.VERTEX_AI_PROJECT:=}
location: ${env.VERTEX_AI_LOCATION:=us-central1}
```

View file

@ -0,0 +1,31 @@
---
description: Remote vLLM inference provider for connecting to vLLM servers
sidebar_label: Vllm
sidebar_position: 25
title: remote::vllm
---
# remote::vllm
## Description
Remote vLLM inference provider for connecting to vLLM servers.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `str \| None` | No | | The URL for the vLLM model serving endpoint |
| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
| `api_token` | `str \| None` | No | fake | The API token |
| `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
## Sample Configuration
```yaml
url: ${env.VLLM_URL:=}
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
api_token: ${env.VLLM_API_TOKEN:=fake}
tls_verify: ${env.VLLM_TLS_VERIFY:=true}
```

View file

@ -0,0 +1,30 @@
---
description: IBM WatsonX inference provider for accessing AI models on IBM's WatsonX
platform
sidebar_label: Watsonx
sidebar_position: 26
title: remote::watsonx
---
# remote::watsonx
## Description
IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | `<class 'str'>` | No | https://us-south.ml.cloud.ibm.com | A base url for accessing the watsonx.ai |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The watsonx API key, only needed of using the hosted service |
| `project_id` | `str \| None` | No | | The Project ID key, only needed of using the hosted service |
| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
## Sample Configuration
```yaml
url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
api_key: ${env.WATSONX_API_KEY:=}
project_id: ${env.WATSONX_PROJECT_ID:=}
```

View file

@ -0,0 +1,332 @@
---
title: OpenAI API Compatibility
description: Use OpenAI clients and libraries with Llama Stack for seamless integration
sidebar_label: OpenAI Compatibility
sidebar_position: 11
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# OpenAI API Compatibility
Llama Stack provides OpenAI-compatible API endpoints, allowing you to use existing OpenAI clients and libraries with Llama Stack servers.
## Server Path
Llama Stack exposes an OpenAI-compatible API endpoint at `/v1/openai/v1`.
For a Llama Stack server running locally on port `8321`, the full URL to the OpenAI-compatible API endpoint is:
```
http://localhost:8321/v1/openai/v1
```
## Client Configuration
You can use any client that speaks OpenAI APIs with Llama Stack. We regularly test with the official Llama Stack clients as well as OpenAI's official Python client.
<Tabs>
<TabItem value="llamastack" label="Llama Stack Client">
When using the Llama Stack client, set the `base_url` to the root of your Llama Stack server. It will automatically route OpenAI-compatible requests to the right server endpoint for you.
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
```
</TabItem>
<TabItem value="openai" label="OpenAI Client">
When using an OpenAI client, set the `base_url` to the `/v1/openai/v1` path on your Llama Stack server.
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="none" # API key not required for local Llama Stack
)
```
</TabItem>
</Tabs>
Regardless of the client you choose, the following code examples should all work the same.
## Available APIs
### Models
Many of the APIs require you to pass in a model parameter. To see the list of models available in your Llama Stack server:
```python
models = client.models.list()
for model in models:
print(f"Model ID: {model.id}")
```
### Responses
:::info[Development Status]
The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.
:::
#### Simple Inference
Request:
```python
response = client.responses.create(
model="meta-llama/Llama-3.2-3B-Instruct",
input="Write a haiku about coding."
)
print(response.output_text)
```
**Example output:**
```text
Pixels dancing slow
Syntax whispers secrets sweet
Code's gentle silence
```
#### Structured Output
Request:
```python
response = client.responses.create(
model="meta-llama/Llama-3.2-3B-Instruct",
input=[
{
"role": "system",
"content": "Extract the participants from the event information.",
},
{
"role": "user",
"content": "Alice and Bob are going to a science fair on Friday.",
},
],
text={
"format": {
"type": "json_schema",
"name": "participants",
"schema": {
"type": "object",
"properties": {
"participants": {"type": "array", "items": {"type": "string"}}
},
"required": ["participants"],
},
}
},
)
print(response.output_text)
```
**Example output:**
```json
{ "participants": ["Alice", "Bob"] }
```
### Chat Completions
#### Simple Inference
Request:
```python
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about coding."}],
)
print(chat_completion.choices[0].message.content)
```
**Example output:**
```text
Lines of code unfold
Logic flows like a river
Code's gentle beauty
```
#### Structured Output
Request:
```python
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{
"role": "system",
"content": "Extract the participants from the event information.",
},
{
"role": "user",
"content": "Alice and Bob are going to a science fair on Friday.",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "participants",
"schema": {
"type": "object",
"properties": {
"participants": {"type": "array", "items": {"type": "string"}}
},
"required": ["participants"],
},
},
},
)
print(chat_completion.choices[0].message.content)
```
**Example output:**
```json
{ "participants": ["Alice", "Bob"] }
```
#### Streaming Responses
Request:
```python
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Count from 1 to 10."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
### Completions
#### Simple Inference
Request:
```python
completion = client.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
prompt="Write a haiku about coding."
)
print(completion.choices[0].text)
```
**Example output:**
```text
Lines of code unfurl
Logic whispers in the dark
Art in hidden form
```
## Migration from OpenAI
### Quick Migration Steps
1. **Update base URL**: Change from OpenAI's API endpoint to your Llama Stack server
2. **Remove API key requirement**: Local Llama Stack doesn't require authentication by default
3. **Update model names**: Use Llama Stack model identifiers
4. **Test functionality**: Verify that your existing code works as expected
### Migration Example
<Tabs>
<TabItem value="before" label="Before (OpenAI)">
```python
from openai import OpenAI
client = OpenAI(api_key="your-openai-api-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
```
</TabItem>
<TabItem value="after" label="After (Llama Stack)">
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="none" # Not required for local instance
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct", # Use Llama model
messages=[{"role": "user", "content": "Hello!"}]
)
```
</TabItem>
</Tabs>
## Benefits
### 🔄 **Seamless Integration**
Use existing OpenAI-compatible code with minimal changes
### 🏠 **Local Control**
Run models locally while keeping familiar API patterns
### 📚 **Library Compatibility**
Works with existing tools and libraries built for OpenAI APIs
### 🔧 **Easy Migration**
Gradually migrate from OpenAI to Llama Stack without rewriting applications
## Limitations
While Llama Stack strives for full OpenAI compatibility, some features may have differences:
- **Model-specific capabilities**: Different models may support different features
- **Response format variations**: Some response fields may differ slightly
- **Rate limiting**: Different rate limiting behavior compared to OpenAI
- **Authentication**: Local instances may not require authentication
## Troubleshooting
### Common Issues
**Connection Errors:**
- Verify the Llama Stack server is running
- Check the base URL includes the correct path (`/v1/openai/v1`)
- Ensure the port number is correct
**Model Not Found:**
- Use `client.models.list()` to see available models
- Verify the model is loaded in your Llama Stack configuration
**API Differences:**
- Check the Llama Stack logs for detailed error information
- Refer to the [API documentation](/docs/api/) for Llama Stack-specific details
## Next Steps
1. **🚀 [Set up a Llama Stack server](/docs/distributions/)** if you haven't already
2. **📖 [Explore the full API reference](/docs/api/)** for advanced features
3. **🔧 [Configure your models](/docs/distributions/configuration)** for optimal performance
4. **💬 [Join the community](https://github.com/meta-llama/llama-stack/discussions)** for support and tips
## Related Resources
- **[Llama Stack Client](/docs/getting-started/libraries)** - Official Python client
- **[Provider Documentation](/docs/providers/)** - Understanding the provider system
- **[Configuration Guide](/docs/distributions/configuration)** - Server configuration options

View file

@ -0,0 +1,22 @@
---
description: Available providers for the post_training API
sidebar_label: Overview
sidebar_position: 1
title: Post_Training
---
# Post_Training
## Overview
This section contains documentation for all available providers for the **post_training** API.
## Providers
- **[Huggingface](./inline_huggingface)** - Inline provider
- **[Huggingface Cpu](./inline_huggingface-cpu)** - Inline provider
- **[Huggingface Gpu](./inline_huggingface-gpu)** - Inline provider
- **[Torchtune](./inline_torchtune)** - Inline provider
- **[Torchtune Cpu](./inline_torchtune-cpu)** - Inline provider
- **[Torchtune Gpu](./inline_torchtune-gpu)** - Inline provider
- **[Nvidia](./remote_nvidia)** - Remote provider

View file

@ -0,0 +1,44 @@
---
description: HuggingFace-based post-training provider for fine-tuning models using
the HuggingFace ecosystem
sidebar_label: Huggingface Cpu
sidebar_position: 3
title: inline::huggingface-cpu
---
# inline::huggingface-cpu
## Description
HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `device` | `<class 'str'>` | No | cuda | |
| `distributed_backend` | `Literal['fsdp', 'deepspeed']` | No | | |
| `checkpoint_format` | `Literal['full_state', 'huggingface']` | No | huggingface | |
| `chat_template` | `<class 'str'>` | No | `<\|user\|>{input}<\|assistant\|>{output}` | |
| `model_specific_config` | `<class 'dict'>` | No | `{'trust_remote_code': True, 'attn_implementation': 'sdpa'}` | |
| `max_seq_length` | `<class 'int'>` | No | 2048 | |
| `gradient_checkpointing` | `<class 'bool'>` | No | False | |
| `save_total_limit` | `<class 'int'>` | No | 3 | |
| `logging_steps` | `<class 'int'>` | No | 10 | |
| `warmup_ratio` | `<class 'float'>` | No | 0.1 | |
| `weight_decay` | `<class 'float'>` | No | 0.01 | |
| `dataloader_num_workers` | `<class 'int'>` | No | 4 | |
| `dataloader_pin_memory` | `<class 'bool'>` | No | True | |
| `dpo_beta` | `<class 'float'>` | No | 0.1 | |
| `use_reference_model` | `<class 'bool'>` | No | True | |
| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid | |
| `dpo_output_dir` | `<class 'str'>` | No | | |
## Sample Configuration
```yaml
checkpoint_format: huggingface
distributed_backend: null
device: cpu
dpo_output_dir: ~/.llama/dummy/dpo_output
```

View file

@ -0,0 +1,44 @@
---
description: HuggingFace-based post-training provider for fine-tuning models using
the HuggingFace ecosystem
sidebar_label: Huggingface Gpu
sidebar_position: 4
title: inline::huggingface-gpu
---
# inline::huggingface-gpu
## Description
HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `device` | `<class 'str'>` | No | cuda | |
| `distributed_backend` | `Literal['fsdp', 'deepspeed']` | No | | |
| `checkpoint_format` | `Literal['full_state', 'huggingface']` | No | huggingface | |
| `chat_template` | `<class 'str'>` | No | `<\|user\|>{input}<\|assistant\|>{output}` | |
| `model_specific_config` | `<class 'dict'>` | No | `{'trust_remote_code': True, 'attn_implementation': 'sdpa'}` | |
| `max_seq_length` | `<class 'int'>` | No | 2048 | |
| `gradient_checkpointing` | `<class 'bool'>` | No | False | |
| `save_total_limit` | `<class 'int'>` | No | 3 | |
| `logging_steps` | `<class 'int'>` | No | 10 | |
| `warmup_ratio` | `<class 'float'>` | No | 0.1 | |
| `weight_decay` | `<class 'float'>` | No | 0.01 | |
| `dataloader_num_workers` | `<class 'int'>` | No | 4 | |
| `dataloader_pin_memory` | `<class 'bool'>` | No | True | |
| `dpo_beta` | `<class 'float'>` | No | 0.1 | |
| `use_reference_model` | `<class 'bool'>` | No | True | |
| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid | |
| `dpo_output_dir` | `<class 'str'>` | No | | |
## Sample Configuration
```yaml
checkpoint_format: huggingface
distributed_backend: null
device: cpu
dpo_output_dir: ~/.llama/dummy/dpo_output
```

View file

@ -0,0 +1,44 @@
---
description: HuggingFace-based post-training provider for fine-tuning models using
the HuggingFace ecosystem
sidebar_label: Huggingface
sidebar_position: 2
title: inline::huggingface
---
# inline::huggingface
## Description
HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `device` | `<class 'str'>` | No | cuda | |
| `distributed_backend` | `Literal['fsdp', 'deepspeed']` | No | | |
| `checkpoint_format` | `Literal['full_state', 'huggingface']` | No | huggingface | |
| `chat_template` | `<class 'str'>` | No | `<\|user\|>{input}<\|assistant\|>{output}` | |
| `model_specific_config` | `<class 'dict'>` | No | `{'trust_remote_code': True, 'attn_implementation': 'sdpa'}` | |
| `max_seq_length` | `<class 'int'>` | No | 2048 | |
| `gradient_checkpointing` | `<class 'bool'>` | No | False | |
| `save_total_limit` | `<class 'int'>` | No | 3 | |
| `logging_steps` | `<class 'int'>` | No | 10 | |
| `warmup_ratio` | `<class 'float'>` | No | 0.1 | |
| `weight_decay` | `<class 'float'>` | No | 0.01 | |
| `dataloader_num_workers` | `<class 'int'>` | No | 4 | |
| `dataloader_pin_memory` | `<class 'bool'>` | No | True | |
| `dpo_beta` | `<class 'float'>` | No | 0.1 | |
| `use_reference_model` | `<class 'bool'>` | No | True | |
| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid | |
| `dpo_output_dir` | `<class 'str'>` | No | | |
## Sample Configuration
```yaml
checkpoint_format: huggingface
distributed_backend: null
device: cpu
dpo_output_dir: ~/.llama/dummy/dpo_output
```

View file

@ -0,0 +1,26 @@
---
description: TorchTune-based post-training provider for fine-tuning and optimizing
models using Meta's TorchTune framework
sidebar_label: Torchtune Cpu
sidebar_position: 6
title: inline::torchtune-cpu
---
# inline::torchtune-cpu
## Description
TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `torch_seed` | `int \| None` | No | | |
| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta | |
## Sample Configuration
```yaml
checkpoint_format: meta
```

View file

@ -0,0 +1,26 @@
---
description: TorchTune-based post-training provider for fine-tuning and optimizing
models using Meta's TorchTune framework
sidebar_label: Torchtune Gpu
sidebar_position: 7
title: inline::torchtune-gpu
---
# inline::torchtune-gpu
## Description
TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `torch_seed` | `int \| None` | No | | |
| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta | |
## Sample Configuration
```yaml
checkpoint_format: meta
```

View file

@ -0,0 +1,26 @@
---
description: TorchTune-based post-training provider for fine-tuning and optimizing
models using Meta's TorchTune framework
sidebar_label: Torchtune
sidebar_position: 5
title: inline::torchtune
---
# inline::torchtune
## Description
TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `torch_seed` | `int \| None` | No | | |
| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta | |
## Sample Configuration
```yaml
checkpoint_format: meta
```

View file

@ -0,0 +1,33 @@
---
description: NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform
sidebar_label: Nvidia
sidebar_position: 8
title: remote::nvidia
---
# remote::nvidia
## Description
NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
## Configuration
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `api_key` | `str \| None` | No | | The NVIDIA API key. |
| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
| `project_id` | `str \| None` | No | test-example-model@v1 | The NVIDIA project ID. |
| `customizer_url` | `str \| None` | No | | Base URL for the NeMo Customizer API |
| `timeout` | `<class 'int'>` | No | 300 | Timeout for the NVIDIA Post Training API |
| `max_retries` | `<class 'int'>` | No | 3 | Maximum number of retries for the NVIDIA Post Training API |
| `output_model_dir` | `<class 'str'>` | No | test-example-model@v1 | Directory to save the output model |
## Sample Configuration
```yaml
api_key: ${env.NVIDIA_API_KEY:=}
dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
```

Some files were not shown because too many files have changed in this diff Show more