Merge remote-tracking branch 'upstream/main' into cdgamarose/add_nvidia_distro

merged with upstream
This commit is contained in:
Chantal D Gama Rose 2025-01-10 21:53:16 +05:30
commit 10faffcb44
404 changed files with 36136 additions and 8936 deletions

View file

@ -0,0 +1,167 @@
# Benchmark Evaluations
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
Llama Stack provides the building blocks needed to run benchmark and application evaluations. This guide will walk you through how to use these components to run open benchmark evaluations. Visit our [Evaluation Concepts](../concepts/evaluation_concepts.md) guide for more details on how evaluations work in Llama Stack, and our [Evaluation Reference](../references/evals_reference/index.md) guide for a comprehensive reference on the APIs. Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) on working examples on how you can use Llama Stack for running benchmark evaluations.
### 1. Open Benchmark Model Evaluation
This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
#### 1.1 Running MMMU
- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
```python
import datasets
ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
eval_rows = ds.to_pandas().to_dict(orient="records")
```
- Next, we will run evaluation on an model candidate, we will need to:
- Define a system prompt
- Define an EvalCandidate
- Run evaluate on the dataset
```python
SYSTEM_PROMPT_TEMPLATE = """
You are an expert in Agriculture whose job is to answer questions from the user using images.
First, reason about the correct answer.
Then write the answer in the following format where X is exactly one of A,B,C,D:
Answer: X
Make sure X is one of A,B,C,D.
If you are uncertain of the correct answer, guess the most likely one.
"""
system_message = {
"role": "system",
"content": SYSTEM_PROMPT_TEMPLATE,
}
client.eval_tasks.register(
eval_task_id="meta-reference::mmmu",
dataset_id=f"mmmu-{subset}-{split}",
scoring_functions=["basic::regex_parser_multiple_choice_answer"]
)
response = client.eval.evaluate_rows(
task_id="meta-reference::mmmu",
input_rows=eval_rows,
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"sampling_params": {
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 0.9,
"repeat_penalty": 1.0,
},
"system_message": system_message
}
}
)
```
#### 1.2. Running SimpleQA
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
```python
simpleqa_dataset_id = "huggingface::simpleqa"
_ = client.datasets.register(
dataset_id=simpleqa_dataset_id,
provider_id="huggingface",
url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
metadata={
"path": "llamastack/evals",
"name": "evals__simpleqa",
"split": "train",
},
dataset_schema={
"input_query": {"type": "string"},
"expected_answer": {"type": "string"},
"chat_completion_input": {"type": "chat_completion_input"},
}
)
eval_rows = client.datasetio.get_rows_paginated(
dataset_id=simpleqa_dataset_id,
rows_in_page=5,
)
```
```python
client.eval_tasks.register(
eval_task_id="meta-reference::simpleqa",
dataset_id=simpleqa_dataset_id,
scoring_functions=["llm-as-judge::405b-simpleqa"]
)
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"sampling_params": {
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 0.9,
"repeat_penalty": 1.0,
},
}
}
)
```
### 2. Agentic Evaluation
- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
- We will continue to use the SimpleQA dataset we used in previous example.
- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
```python
agent_config = {
"model": "meta-llama/Llama-3.1-405B-Instruct",
"instructions": "You are a helpful assistant",
"sampling_params": {
"strategy": "greedy",
"temperature": 0.0,
"top_p": 0.95,
},
"tools": [
{
"type": "brave_search",
"engine": "tavily",
"api_key": userdata.get("TAVILY_SEARCH_API_KEY")
}
],
"tool_choice": "auto",
"tool_prompt_format": "json",
"input_shields": [],
"output_shields": [],
"enable_session_persistence": False
}
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "agent",
"config": agent_config,
}
}
)
```

View file

@ -1,15 +1,421 @@
# Building Applications
# Building AI Applications
```{admonition} Work in Progress
:class: warning
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1F2ksmkoGQPa4pzRjMOE6BXWeOxWFIW6n?usp=sharing)
## What can you do with the Stack?
Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively. Check out our Colab notebook on to follow along working examples on how you can build LLM-powered agentic applications using Llama Stack.
- Agents
- what is a turn? session?
- inference
- memory / RAG; pre-ingesting content or attaching content in a turn
- how does tool calling work
- can you do evaluation?
## Basic Inference
The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference:
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:5001")
# List available models
models = client.models.list()
# Simple chat completion
response = client.inference.chat_completion(
model_id="Llama3.2-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding"}
]
)
print(response.completion_message.content)
```
## Adding Memory & RAG
Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
1. **Vector Memory Banks**: For semantic search and retrieval
2. **Key-Value Memory Banks**: For structured data storage
3. **Keyword Memory Banks**: For basic text search
4. **Graph Memory Banks**: For relationship-based retrieval
Here's how to set up a vector memory bank for RAG:
```python
# Register a memory bank
bank_id = "my_documents"
response = client.memory_banks.register(
memory_bank_id=bank_id,
params={
"memory_bank_type": "vector",
"embedding_model": "all-MiniLM-L6-v2",
"chunk_size_in_tokens": 512
}
)
# Insert documents
documents = [
{
"document_id": "doc1",
"content": "Your document text here",
"mime_type": "text/plain"
}
]
client.memory.insert(bank_id, documents)
# Query documents
results = client.memory.query(
bank_id=bank_id,
query="What do you know about...",
)
```
## Implementing Safety Guardrails
Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
```python
# Register a safety shield
shield_id = "content_safety"
client.shields.register(
shield_id=shield_id,
provider_shield_id="llama-guard-basic"
)
# Run content through shield
response = client.safety.run_shield(
shield_id=shield_id,
messages=[{"role": "user", "content": "User message here"}]
)
if response.violation:
print(f"Safety violation detected: {response.violation.user_message}")
```
## Building Agents
Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
### The Agent Execution Loop
Each agent turn follows these key steps:
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
2. **Context Retrieval**:
- If RAG is enabled, the agent queries relevant documents from memory banks
- For new documents, they are first inserted into the memory bank
- Retrieved context is augmented to the user's prompt
3. **Inference Loop**: The agent enters its main execution loop:
- The LLM receives the augmented prompt (with context and/or previous tool outputs)
- The LLM generates a response, potentially with tool calls
- If tool calls are present:
- Tool inputs are safety-checked
- Tools are executed (e.g., web search, code execution)
- Tool responses are fed back to the LLM for synthesis
- The loop continues until:
- The LLM provides a final response without tool calls
- Maximum iterations are reached
- Token limit is exceeded
4. **Final Safety Check**: The agent's final response is screened through safety shields
```{mermaid}
sequenceDiagram
participant U as User
participant E as Executor
participant M as Memory Bank
participant L as LLM
participant T as Tools
participant S as Safety Shield
Note over U,S: Agent Turn Start
U->>S: 1. Submit Prompt
activate S
S->>E: Input Safety Check
deactivate S
E->>M: 2.1 Query Context
M-->>E: 2.2 Retrieved Documents
loop Inference Loop
E->>L: 3.1 Augment with Context
L-->>E: 3.2 Response (with/without tool calls)
alt Has Tool Calls
E->>S: Check Tool Input
S->>T: 4.1 Execute Tool
T-->>E: 4.2 Tool Response
E->>L: 5.1 Tool Response
L-->>E: 5.2 Synthesized Response
end
opt Stop Conditions
Note over E: Break if:
Note over E: - No tool calls
Note over E: - Max iterations reached
Note over E: - Token limit exceeded
end
end
E->>S: Output Safety Check
S->>U: 6. Final Response
```
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
```python
from llama_stack_client.lib.agents.event_logger import EventLogger
agent_config = AgentConfig(
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
# Enable both RAG and tool usage
tools=[
{
"type": "memory",
"memory_bank_configs": [{
"type": "vector",
"bank_id": "my_docs"
}],
"max_tokens_in_context": 4096
},
{
"type": "code_interpreter",
"enable_inline_code_execution": True
}
],
# Configure safety
input_shields=["content_safety"],
output_shields=["content_safety"],
# Control the inference loop
max_infer_iters=5,
sampling_params={
"temperature": 0.7,
"max_tokens": 2048
}
)
agent = Agent(client, agent_config)
session_id = agent.create_session("monitored_session")
# Stream the agent's execution steps
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
attachments=[{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain"
}],
session_id=session_id
)
# Monitor each step of execution
for log in EventLogger().log(response):
if log.event.step_type == "memory_retrieval":
print("Retrieved context:", log.event.retrieved_context)
elif log.event.step_type == "inference":
print("LLM output:", log.event.model_response)
elif log.event.step_type == "tool_execution":
print("Tool call:", log.event.tool_call)
print("Tool response:", log.event.tool_response)
elif log.event.step_type == "shield_call":
if log.event.violation:
print("Safety violation:", log.event.violation)
```
This example shows how an agent can: Llama Stack provides a high-level agent framework:
```python
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types.agent_create_params import AgentConfig
# Configure an agent
agent_config = AgentConfig(
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
tools=[
{
"type": "memory",
"memory_bank_configs": [],
"query_generator_config": {
"type": "default",
"sep": " "
}
}
],
input_shields=["content_safety"],
output_shields=["content_safety"],
enable_session_persistence=True
)
# Create an agent
agent = Agent(client, agent_config)
session_id = agent.create_session("my_session")
# Run agent turns
response = agent.create_turn(
messages=[{"role": "user", "content": "Your question here"}],
session_id=session_id
)
```
### Adding Tools to Agents
Agents can be enhanced with various tools:
1. **Search**: Web search capabilities through providers like Brave
2. **Code Interpreter**: Execute code snippets
3. **RAG**: Memory and document retrieval
4. **Function Calling**: Custom function execution
5. **WolframAlpha**: Mathematical computations
6. **Photogen**: Image generation
Example of configuring an agent with tools:
```python
agent_config = AgentConfig(
model="Llama3.2-3B-Instruct",
tools=[
{
"type": "brave_search",
"api_key": "YOUR_API_KEY",
"engine": "brave"
},
{
"type": "code_interpreter",
"enable_inline_code_execution": True
}
],
tool_choice="auto",
tool_prompt_format="json"
)
```
## Building RAG-Enhanced Agents
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
```python
from llama_stack_client.types import Attachment
# Create attachments from documents
attachments = [
Attachment(
content="https://raw.githubusercontent.com/example/doc.rst",
mime_type="text/plain"
)
]
# Configure agent with memory
agent_config = AgentConfig(
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
tools=[{
"type": "memory",
"memory_bank_configs": [],
"query_generator_config": {"type": "default", "sep": " "},
"max_tokens_in_context": 4096,
"max_chunks": 10
}],
enable_session_persistence=True
)
agent = Agent(client, agent_config)
session_id = agent.create_session("rag_session")
# Initial document ingestion
response = agent.create_turn(
messages=[{
"role": "user",
"content": "I am providing some documents for reference."
}],
attachments=attachments,
session_id=session_id
)
# Query with RAG
response = agent.create_turn(
messages=[{
"role": "user",
"content": "What are the key topics in the documents?"
}],
session_id=session_id
)
```
## Testing & Evaluation
Llama Stack provides built-in tools for evaluating your applications:
1. **Benchmarking**: Test against standard datasets
2. **Application Evaluation**: Score your application's outputs
3. **Custom Metrics**: Define your own evaluation criteria
Here's how to set up basic evaluation:
```python
# Create an evaluation task
response = client.eval_tasks.register(
eval_task_id="my_eval",
dataset_id="my_dataset",
scoring_functions=["accuracy", "relevance"]
)
# Run evaluation
job = client.eval.run_eval(
task_id="my_eval",
task_config={
"type": "app",
"eval_candidate": {
"type": "agent",
"config": agent_config
}
}
)
# Get results
result = client.eval.job_result(
task_id="my_eval",
job_id=job.job_id
)
```
## Debugging & Monitoring
Llama Stack includes comprehensive telemetry for debugging and monitoring your applications:
1. **Tracing**: Track request flows across components
2. **Metrics**: Measure performance and usage
3. **Logging**: Debug issues and track behavior
The telemetry system supports multiple output formats:
- OpenTelemetry for visualization in tools like Jaeger
- SQLite for local storage and querying
- Console output for development
Example of querying traces:
```python
# Query traces for a session
traces = client.telemetry.query_traces(
attribute_filters=[{
"key": "session_id",
"op": "eq",
"value": session_id
}]
)
# Get spans within the root span; indexed by ID
# Use parent_span_id to build a tree out of it
spans_by_id = client.telemetry.get_span_tree(
span_id=traces[0].root_span_id
)
```
For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section.
```{toctree}
:hidden:
:maxdepth: 3
telemetry
```

View file

@ -0,0 +1,242 @@
# Telemetry
```{note}
The telemetry system is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
```
The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.
## Key Concepts
### Events
The telemetry system supports three main types of events:
- **Unstructured Log Events**: Free-form log messages with severity levels
```python
unstructured_log_event = UnstructuredLogEvent(
message="This is a log message",
severity=LogSeverity.INFO
)
```
- **Metric Events**: Numerical measurements with units
```python
metric_event = MetricEvent(
metric="my_metric",
value=10,
unit="count"
)
```
- **Structured Log Events**: System events like span start/end. Extensible to add more structured log types.
```python
structured_log_event = SpanStartPayload(
name="my_span",
parent_span_id="parent_span_id"
)
```
### Spans and Traces
- **Spans**: Represent operations with timing and hierarchical relationships
- **Traces**: Collection of related spans forming a complete request flow
### Sinks
- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger.
- **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
- **Console**: Print events to the console.
## APIs
The telemetry API is designed to be flexible for different user flows like debugging/visualization in UI, monitoring, and saving traces to datasets.
The telemetry system exposes the following HTTP endpoints:
### Log Event
```http
POST /telemetry/log-event
```
Logs a telemetry event (unstructured log, metric, or structured log) with optional TTL.
### Query Traces
```http
POST /telemetry/query-traces
```
Retrieves traces based on filters with pagination support. Parameters:
- `attribute_filters`: List of conditions to filter traces
- `limit`: Maximum number of traces to return (default: 100)
- `offset`: Number of traces to skip (default: 0)
- `order_by`: List of fields to sort by
### Get Span Tree
```http
POST /telemetry/get-span-tree
```
Retrieves a hierarchical view of spans starting from a specific span. Parameters:
- `span_id`: ID of the root span to retrieve
- `attributes_to_return`: Optional list of specific attributes to include
- `max_depth`: Optional maximum depth of the span tree to return
### Query Spans
```http
POST /telemetry/query-spans
```
Retrieves spans matching specified filters and returns selected attributes. Parameters:
- `attribute_filters`: List of conditions to filter traces
- `attributes_to_return`: List of specific attributes to include in results
- `max_depth`: Optional maximum depth of spans to traverse (default: no limit)
Returns a flattened list of spans with requested attributes.
### Save Spans to Dataset
This is useful for saving traces to a dataset for running evaluations. For example, you can save the input/output of each span that is part of an agent session/turn to a dataset and then run an eval task on it. See example in [Example: Save Spans to Dataset](#example-save-spans-to-dataset).
```http
POST /telemetry/save-spans-to-dataset
```
Queries spans and saves their attributes to a dataset. Parameters:
- `attribute_filters`: List of conditions to filter traces
- `attributes_to_save`: List of span attributes to save to the dataset
- `dataset_id`: ID of the dataset to save to
- `max_depth`: Optional maximum depth of spans to traverse (default: no limit)
## Providers
### Meta-Reference Provider
Currently, only the meta-reference provider is implemented. It can be configured to send events to three sink types:
1) OpenTelemetry Collector
2) SQLite
3) Console
## Configuration
Here's an example that sends telemetry signals to all three sink types. Your configuration might use only one.
```yaml
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
sinks: ['console', 'sqlite', 'otel']
otel_endpoint: "http://localhost:4318/v1/traces"
sqlite_db_path: "/path/to/telemetry.db"
```
## Jaeger to visualize traces
The `otel` sink works with any service compatible with the OpenTelemetry collector. Let's use Jaeger to visualize this data.
Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
```bash
$ docker run --rm --name jaeger \
-p 16686:16686 -p 4318:4318 \
jaegertracing/jaeger:2.1.0
```
Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.
## Querying Traces Stored in SQLIte
The `sqlite` sink allows you to query traces without an external system. Here are some example queries:
Querying Traces for a agent session
The client SDK is not updated to support the new telemetry API. It will be updated soon. You can manually query traces using the following curl command:
``` bash
curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \
-H 'Content-Type: application/json' \
-d '{
"attribute_filters": [
{
"key": "session_id",
"op": "eq",
"value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }],
"limit": 100,
"offset": 0,
"order_by": ["start_time"]
[
{
"trace_id": "6902f54b83b4b48be18a6f422b13e16f",
"root_span_id": "5f37b85543afc15a",
"start_time": "2024-12-04T08:08:30.501587",
"end_time": "2024-12-04T08:08:36.026463"
},
........
]
}'
```
Querying spans for a specifc root span id
``` bash
curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \
-H 'Content-Type: application/json' \
-d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2 }'
{
"span_id": "6cceb4b48a156913",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "892a66d726c7f990",
"name": "retrieve_rag_context",
"start_time": "2024-12-04T09:28:21.781995",
"end_time": "2024-12-04T09:28:21.913352",
"attributes": {
"input": [
"{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}",
"{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}"
]
},
"children": [
{
"span_id": "1a2df181854064a8",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "6cceb4b48a156913",
"name": "MemoryRouter.query_documents",
"start_time": "2024-12-04T09:28:21.787620",
"end_time": "2024-12-04T09:28:21.906512",
"attributes": {
"input": null
},
"children": [],
"status": "ok"
}
],
"status": "ok"
}
```
## Example: Save Spans to Dataset
Save all spans for a specific agent session to a dataset.
``` bash
curl -X POST 'http://localhost:5000/alpha/telemetry/save-spans-to-dataset' \
-H 'Content-Type: application/json' \
-d '{
"attribute_filters": [
{
"key": "session_id",
"op": "eq",
"value": "dd667b87-ca4b-4d30-9265-5a0de318fc65"
}
],
"attributes_to_save": ["input", "output"],
"dataset_id": "my_dataset",
"max_depth": 10
}'
```
Save all spans for a specific agent turn to a dataset.
```bash
curl -X POST 'http://localhost:5000/alpha/telemetry/save-spans-to-dataset' \
-H 'Content-Type: application/json' \
-d '{
"attribute_filters": [
{
"key": "turn_id",
"op": "eq",
"value": "123e4567-e89b-12d3-a456-426614174000"
}
],
"attributes_to_save": ["input", "output"],
"dataset_id": "my_dataset",
"max_depth": 10
}'
```

View file

@ -0,0 +1,40 @@
# Evaluation Concepts
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/eval_tasks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
## Evaluation Concepts
The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
![Eval Concepts](../references/evals_reference/resources/eval-concept.png)
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `EvalTask` resource.
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
![Eval Flow](../references/evals_reference/resources/eval-flow.png)
```{admonition} Note on Benchmark v.s. Application Evaluation
:class: tip
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
```
## What's Next?
- Check out our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
- Check out our [Evaluation Reference](../references/evals_reference/index.md) for more details on the APIs.

View file

@ -58,7 +58,17 @@ While there is a lot of flexibility to mix-and-match providers, often users will
**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions.
**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Cerebras, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros.
**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Cerebras, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros.
**On-device Distro**: Finally, you may want to run Llama Stack directly on an edge device (mobile phone or a tablet.) We provide Distros for iOS and Android (coming soon.)
## More Concepts
- [Evaluation Concepts](evaluation_concepts.md)
```{toctree}
:maxdepth: 1
:hidden:
evaluation_concepts
```

View file

@ -28,6 +28,8 @@ extensions = [
"sphinx_tabs.tabs",
"sphinx_design",
"sphinxcontrib.redoc",
"sphinxcontrib.mermaid",
"sphinxcontrib.video",
]
myst_enable_extensions = ["colon_fence"]
@ -47,6 +49,7 @@ exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
myst_enable_extensions = [
"amsmath",
"attrs_inline",
"attrs_block",
"colon_fence",
"deflist",
"dollarmath",
@ -65,6 +68,7 @@ myst_substitutions = {
"docker_hub": "https://hub.docker.com/repository/docker/llamastack",
}
# Copy button settings
copybutton_prompt_text = "$ " # for bash prompts
copybutton_prompt_is_regexp = True

View file

@ -3,7 +3,7 @@
This guide contains references to walk you through adding a new API provider.
1. First, decide which API your provider falls into (e.g. Inference, Safety, Agents, Memory).
2. Decide whether your provider is a remote provider, or inline implmentation. A remote provider is a provider that makes a remote request to an service. An inline provider is a provider where implementation is executed locally. Checkout the examples, and follow the structure to add your own API provider. Please find the following code pointers:
2. Decide whether your provider is a remote provider, or inline implementation. A remote provider is a provider that makes a remote request to a service. An inline provider is a provider where implementation is executed locally. Checkout the examples, and follow the structure to add your own API provider. Please find the following code pointers:
- {repopath}`Remote Providers::llama_stack/providers/remote`
- {repopath}`Inline Providers::llama_stack/providers/inline`
@ -15,7 +15,7 @@ This guide contains references to walk you through adding a new API provider.
1. Start with an _integration test_ for your provider. That means we will instantiate the real provider, pass it real configuration and if it is a remote service, we will actually hit the remote service. We **strongly** discourage mocking for these tests at the provider level. Llama Stack is first and foremost about integration so we need to make sure stuff works end-to-end. See {repopath}`llama_stack/providers/tests/inference/test_text_inference.py` for an example.
2. In addition, if you want to unit test functionality within your provider, feel free to do so. You can find some tests in `tests/` but they aren't well supported so far.
2. In addition, if you want to unit test functionality within your provider, feel free to do so. You can find some tests in `tests/` but they aren't well-supported so far.
3. Test with a client-server Llama Stack setup. (a) Start a Llama Stack server with your own distribution which includes the new provider. (b) Send a client request to the server. See `llama_stack/apis/<api>/client.py` for how this is done. These client scripts can serve as lightweight tests.

View file

@ -1,123 +0,0 @@
# Evaluations
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/eval_tasks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases.
## Evaluation Concepts
The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
![Eval Concepts](./resources/eval-concept.png)
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `EvalTask` resource.
## Running Evaluations
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
![Eval Flow](./resources/eval-flow.png)
```{admonition} Note on Benchmark v.s. Application Evaluation
:class: tip
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
```
The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
#### Benchmark Evaluation CLI
Usage: There are 2 inputs necessary for running a benchmark eval
- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
- `dataset_id`: the identifier associated with the dataset.
- `List[scoring_function_id]`: list of scoring function identifiers.
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
```
llama-stack-client eval run_benchmark <eval-task-id> \
--eval-task-config ~/eval_task_config.json \
--visualize
```
#### Application Evaluation CLI
Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
- `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
- `Dataset` used for evaluation:
- (1) `--dataset-path`: path to local file system containing datasets to run evaluation on
- (2) `--dataset-id`: pre-registered dataset in Llama Stack
- (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
```
llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
--dataset-path <path-to-local-dataset> \
--output-dir ./
```
#### Defining EvalTaskConfig
The `EvalTaskConfig` are user specified config to define:
1. `EvalCandidate` to run generation on:
- `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
- `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API.
2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
**Example Benchmark EvalTaskConfig**
```json
{
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "Llama3.2-3B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
}
}
```
**Example Application EvalTaskConfig**
```json
{
"type": "app",
"eval_candidate": {
"type": "model",
"model": "Llama3.1-405B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
},
"scoring_params": {
"llm-as-judge::llm_as_judge_base": {
"type": "llm_as_judge",
"judge_model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt_template": "Your job is to look at a question, a gold target ........",
"judge_score_regexes": [
"(A|B|C)"
]
}
}
}
```

View file

@ -1,9 +0,0 @@
# Cookbooks
- [Evaluations Flow](evals.md)
```{toctree}
:maxdepth: 2
:hidden:
evals.md
```

View file

@ -66,121 +66,247 @@ llama stack build --list-templates
```
```
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| Template Name | Providers | Description |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| hf-serverless | { | Like local, but use Hugging Face Inference API (serverless) for running LLM |
| | "inference": "remote::hf::serverless", | inference. |
| | "memory": "meta-reference", | See https://hf.co/docs/api-inference. |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| together | { | Use Together.ai for running LLM inference |
| | "inference": "remote::together", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::weaviate" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| fireworks | { | Use Fireworks.ai for running LLM inference |
| | "inference": "remote::fireworks", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::weaviate", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| databricks | { | Use Databricks for running LLM inference |
| | "inference": "remote::databricks", | |
| | "memory": "meta-reference", | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| vllm | { | Like local, but use vLLM for running LLM inference |
| | "inference": "vllm", | |
| | "memory": "meta-reference", | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| tgi | { | Use TGI for running LLM inference |
| | "inference": "remote::tgi", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| bedrock | { | Use Amazon Bedrock APIs. |
| | "inference": "remote::bedrock", | |
| | "memory": "meta-reference", | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| meta-reference-gpu | { | Use code from `llama_stack` itself to serve all llama stack APIs |
| | "inference": "meta-reference", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| meta-reference-quantized-gpu | { | Use code from `llama_stack` itself to serve all llama stack APIs |
| | "inference": "meta-reference-quantized", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| ollama | { | Use ollama for running LLM inference |
| | "inference": "remote::ollama", | |
| | "memory": [ | |
| | "meta-reference", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
| hf-endpoint | { | Like local, but use Hugging Face Inference Endpoints for running LLM inference. |
| | "inference": "remote::hf::endpoint", | See https://hf.co/docs/api-endpoints. |
| | "memory": "meta-reference", | |
| | "safety": "meta-reference", | |
| | "agents": "meta-reference", | |
| | "telemetry": "meta-reference" | |
| | } | |
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| Template Name | Providers | Description |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| tgi | { | Use (an external) TGI server for running LLM inference |
| | "inference": [ | |
| | "remote::tgi" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| remote-vllm | { | Use (an external) vLLM server for running LLM inference |
| | "inference": [ | |
| | "remote::vllm" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| vllm-gpu | { | Use a built-in vLLM engine for running LLM inference |
| | "inference": [ | |
| | "inline::vllm" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| meta-reference-quantized-gpu | { | Use Meta Reference with fp8, int4 quantization for running LLM inference |
| | "inference": [ | |
| | "inline::meta-reference-quantized" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| meta-reference-gpu | { | Use Meta Reference for running LLM inference |
| | "inference": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| hf-serverless | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
| | "inference": [ | |
| | "remote::hf::serverless" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| together | { | Use Together.AI for running LLM inference |
| | "inference": [ | |
| | "remote::together" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| ollama | { | Use (an external) Ollama server for running LLM inference |
| | "inference": [ | |
| | "remote::ollama" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| bedrock | { | Use AWS Bedrock for running LLM inference and safety |
| | "inference": [ | |
| | "remote::bedrock" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "remote::bedrock" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| hf-endpoint | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
| | "inference": [ | |
| | "remote::hf::endpoint" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| fireworks | { | Use Fireworks.AI for running LLM inference |
| | "inference": [ | |
| | "remote::fireworks" | |
| | ], | |
| | "memory": [ | |
| | "inline::faiss", | |
| | "remote::chromadb", | |
| | "remote::pgvector" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| cerebras | { | Use Cerebras for running LLM inference |
| | "inference": [ | |
| | "remote::cerebras" | |
| | ], | |
| | "safety": [ | |
| | "inline::llama-guard" | |
| | ], | |
| | "memory": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "agents": [ | |
| | "inline::meta-reference" | |
| | ], | |
| | "telemetry": [ | |
| | "inline::meta-reference" | |
| | ] | |
| | } | |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
```
You may then pick a template to build your distribution with providers fitted to your liking.
@ -212,8 +338,8 @@ distribution_spec:
inference: remote::ollama
memory: inline::faiss
safety: inline::llama-guard
agents: meta-reference
telemetry: meta-reference
agents: inline::meta-reference
telemetry: inline::meta-reference
image_type: conda
```

View file

@ -1,6 +1,6 @@
# Configuring a Stack
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplied version of an example configuration file for the Ollama distribution:
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
```{dropdown} Sample Configuration File
@ -81,6 +81,8 @@ A few things to note:
- The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
## Resources
```
Finally, let's look at the `models` section:
```yaml
models:

View file

@ -21,7 +21,7 @@ print(response)
```python
response = await client.inference.chat_completion(
messages=[UserMessage(content="What is the capital of France?", role="user")],
model="Llama3.1-8B-Instruct",
model_id="Llama3.1-8B-Instruct",
stream=False,
)
print("\nChat completion response:")

View file

@ -8,10 +8,6 @@ building_distro
configuration
```
<!-- self_hosted_distro/index -->
<!-- remote_hosted_distro/index -->
<!-- ondevice_distro/index -->
You can instantiate a Llama Stack in one of the following ways:
- **As a Library**: this is the simplest, especially if you are using an external inference service. See [Using Llama Stack as a Library](importing_as_library)
- **Docker**: we provide a number of pre-built Docker containers so you can start a Llama Stack server instantly. You can also build your own custom Docker container.
@ -31,11 +27,15 @@ If so, we suggest:
- {dockerhub}`distribution-ollama` ([Guide](self_hosted_distro/ollama))
- **Do you have an API key for a remote inference provider like Fireworks, Together, etc.?** If so, we suggest:
- {dockerhub}`distribution-together` ([Guide](remote_hosted_distro/index))
- {dockerhub}`distribution-fireworks` ([Guide](remote_hosted_distro/index))
- {dockerhub}`distribution-together` ([Guide](self_hosted_distro/together))
- {dockerhub}`distribution-fireworks` ([Guide](self_hosted_distro/fireworks))
- **Do you want to run Llama Stack inference on your iOS / Android device** If so, we suggest:
- [iOS SDK](ondevice_distro/ios_sdk)
- Android (coming soon)
- [Android](ondevice_distro/android_sdk)
- **Do you want a hosted Llama Stack endpoint?** If so, we suggest:
- [Remote-Hosted Llama Stack Endpoints](remote_hosted_distro/index)
You can also build your own [custom distribution](building_distro).

View file

@ -0,0 +1,264 @@
# Llama Stack Client Kotlin API Library
We are excited to share a guide for a Kotlin Library that brings front the benefits of Llama Stack to your Android device. This library is a set of SDKs that provide a simple and effective way to integrate AI capabilities into your Android app whether it is local (on-device) or remote inference.
Features:
- Local Inferencing: Run Llama models purely on-device with real-time processing. We currently utilize ExecuTorch as the local inference distributor and may support others in the future.
- [ExecuTorch](https://github.com/pytorch/executorch/tree/main) is a complete end-to-end solution within the PyTorch framework for inferencing capabilities on-device with high portability and seamless performance.
- Remote Inferencing: Perform inferencing tasks remotely with Llama models hosted on a remote connection (or serverless localhost).
- Simple Integration: With easy-to-use APIs, a developer can quickly integrate Llama Stack in their Android app. The difference with local vs remote inferencing is also minimal.
Latest Release Notes: [v0.0.58](https://github.com/meta-llama/llama-stack-client-kotlin/releases/tag/v0.0.58)
*Tagged releases are stable versions of the project. While we strive to maintain a stable main branch, it's not guaranteed to be free of bugs or issues.*
## Android Demo App
Check out our demo app to see how to integrate Llama Stack into your Android app: [Android Demo App](https://github.com/meta-llama/llama-stack-apps/tree/android-kotlin-app-latest/examples/android_app)
The key files in the app are `ExampleLlamaStackLocalInference.kt`, `ExampleLlamaStackRemoteInference.kts`, and `MainActivity.java`. With encompassed business logic, the app shows how to use Llama Stack for both the environments.
## Quick Start
### Add Dependencies
#### Kotlin Library
Add the following dependency in your `build.gradle.kts` file:
```
dependencies {
implementation("com.llama.llamastack:llama-stack-client-kotlin:0.0.58")
}
```
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
If you plan on doing remote inferencing this is sufficient to get started.
#### Dependency for Local
For local inferencing, it is required to include the ExecuTorch library into your app.
Include the ExecuTorch library by:
1. Download the `download-prebuilt-et-lib.sh` script file from the [llama-stack-client-kotlin-client-local](https://github.com/meta-llama/llama-stack-client-kotlin/blob/release/0.0.58/llama-stack-client-kotlin-client-local/download-prebuilt-et-lib.sh) directory to your local machine.
2. Move the script to the top level of your Android app where the app directory resides:
<p align="center">
<img src="https://raw.githubusercontent.com/meta-llama/llama-stack-client-kotlin/refs/heads/release/0.0.58/doc/img/example_android_app_directory.png" style="width:300px">
</p>
3. Run `sh download-prebuilt-et-lib.sh` to create an `app/libs` directory and download the `executorch.aar` in that path. This generates an ExecuTorch library for the XNNPACK delegate with commit: [0a12e33](https://github.com/pytorch/executorch/commit/0a12e33d22a3d44d1aa2af5f0d0673d45b962553).
4. Add the `executorch.aar` dependency in your `build.gradle.kts` file:
```
dependencies {
...
implementation(files("libs/executorch.aar"))
...
}
```
## Llama Stack APIs in Your Android App
Breaking down the demo app, this section will show the core pieces that are used to initialize and run inference with Llama Stack using the Kotlin library.
### Setup Remote Inferencing
Start a Llama Stack server on localhost. Here is an example of how you can do this using the firework.ai distribution:
```
conda create -n stack-fireworks python=3.10
conda activate stack-fireworks
pip install llama-stack=0.0.58
llama stack build --template fireworks --image-type conda
export FIREWORKS_API_KEY=<SOME_KEY>
llama stack run /Users/<your_username>/.llama/distributions/llamastack-fireworks/fireworks-run.yaml --port=5050
```
Ensure the Llama Stack server version is the same as the Kotlin SDK Library for maximum compatibility.
Other inference providers: [Table](https://llama-stack.readthedocs.io/en/latest/index.html#supported-llama-stack-implementations)
How to set remote localhost in Demo App: [Settings](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/android_app#settings)
### Initialize the Client
A client serves as the primary interface for interacting with a specific inference type and its associated parameters. Only after client is initialized then you can configure and start inferences.
<table>
<tr>
<th>Local Inference</th>
<th>Remote Inference</th>
</tr>
<tr>
<td>
```
client = LlamaStackClientLocalClient
.builder()
.modelPath(modelPath)
.tokenizerPath(tokenizerPath)
.temperature(temperature)
.build()
```
</td>
<td>
```
// remoteURL is a string like "http://localhost:5050"
client = LlamaStackClientOkHttpClient
.builder()
.baseUrl(remoteURL)
.build()
```
</td>
</tr>
</table>
### Run Inference
With the Kotlin Library managing all the major operational logic, there are minimal to no changes when running simple chat inference for local or remote:
```
val result = client!!.inference().chatCompletion(
InferenceChatCompletionParams.builder()
.modelId(modelName)
.messages(listOfMessages)
.build()
)
// response contains string with response from model
var response = result.asChatCompletionResponse().completionMessage().content().string();
```
[Remote only] For inference with a streaming response:
```
val result = client!!.inference().chatCompletionStreaming(
InferenceChatCompletionParams.builder()
.modelId(modelName)
.messages(listOfMessages)
.build()
)
// Response can be received as a asChatCompletionResponseStreamChunk as part of a callback.
// See Android demo app for a detailed implementation example.
```
### Setup Custom Tool Calling
Android demo app for more details: [Custom Tool Calling](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/android_app#tool-calling)
## Advanced Users
The purpose of this section is to share more details with users that would like to dive deeper into the Llama Stack Kotlin Library. Whether youre interested in contributing to the open source library, debugging or just want to learn more, this section is for you!
### Prerequisite
You must complete the following steps:
1. Clone the repo (`git clone https://github.com/meta-llama/llama-stack-client-kotlin.git -b release/0.0.58`)
2. Port the appropriate ExecuTorch libraries over into your Llama Stack Kotlin library environment.
```
cd llama-stack-client-kotlin-client-local
sh download-prebuilt-et-lib.sh --unzip
```
Now you will notice that the `jni/` , `libs/`, and `AndroidManifest.xml` files from the `executorch.aar` file are present in the local module. This way the local client module will be able to realize the ExecuTorch SDK.
### Building for Development/Debugging
If youd like to contribute to the Kotlin library via development, debug, or add play around with the library with various print statements, run the following command in your terminal under the llama-stack-client-kotlin directory.
```
sh build-libs.sh
```
Output: .jar files located in the build-jars directory
Copy the .jar files over to the lib directory in your Android app. At the same time make sure to remove the llama-stack-client-kotlin dependency within your build.gradle.kts file in your app (or if you are using the demo app) to avoid having multiple llama stack client dependencies.
### Additional Options for Local Inferencing
Currently we provide additional properties support with local inferencing. In order to get the tokens/sec metric for each inference call, add the following code in your Android app after you run your chatCompletion inference function. The Reference app has this implementation as well:
```
var tps = (result.asChatCompletionResponse()._additionalProperties()["tps"] as JsonNumber).value as Float
```
We will be adding more properties in the future.
### Additional Options for Remote Inferencing
#### Network options
##### Retries
Requests that experience certain errors are automatically retried 2 times by default, with a short exponential backoff. Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors will all be retried by default.
You can provide a `maxRetries` on the client builder to configure this:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.maxRetries(4)
.build()
```
##### Timeouts
Requests time out after 1 minute by default. You can configure this on the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.timeout(Duration.ofSeconds(30))
.build()
```
##### Proxies
Requests can be routed through a proxy. You can configure this on the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.proxy(new Proxy(
Type.HTTP,
new InetSocketAddress("proxy.com", 8080)
))
.build()
```
##### Environments
Requests are made to the production environment by default. You can connect to other environments, like `sandbox`, via the client builder:
```kotlin
val client = LlamaStackClientOkHttpClient.builder()
.fromEnv()
.sandbox()
.build()
```
### Error Handling
This library throws exceptions in a single hierarchy for easy handling:
- **`LlamaStackClientException`** - Base exception for all exceptions
- **`LlamaStackClientServiceException`** - HTTP errors with a well-formed response body we were able to parse. The exception message and the `.debuggingRequestId()` will be set by the server.
| 400 | BadRequestException |
| ------ | ----------------------------- |
| 401 | AuthenticationException |
| 403 | PermissionDeniedException |
| 404 | NotFoundException |
| 422 | UnprocessableEntityException |
| 429 | RateLimitException |
| 5xx | InternalServerException |
| others | UnexpectedStatusCodeException |
- **`LlamaStackClientIoException`** - I/O networking errors
- **`LlamaStackClientInvalidDataException`** - any other exceptions on the client side, e.g.:
- We failed to serialize the request body
- We failed to parse the response body (has access to response code and body)
## Reporting Issues
If you encountered any bugs or issues following this guide please file a bug/issue on our [Github issue tracker](https://github.com/meta-llama/llama-stack-client-kotlin/issues).
## Known Issues
We're aware of the following issues and are working to resolve them:
1. Streaming response is a work-in-progress for local and remote inference
2. Due to #1, agents are not supported at the time. LS agents only work in streaming mode
3. Changing to another model is a work in progress for local and remote platforms
## Thanks
We'd like to extend our thanks to the ExecuTorch team for providing their support as we integrated ExecuTorch as one of the local inference distributors for Llama Stack. Checkout [ExecuTorch Github repo](https://github.com/pytorch/executorch/tree/main) for more information.
---
The API interface is generated using the OpenAPI standard with [Stainless](https://www.stainlessapi.com/).

View file

@ -1,6 +1,3 @@
---
orphan: true
---
# Bedrock Distribution
```{toctree}
@ -15,10 +12,14 @@ The `llamastack/distribution-bedrock` distribution consists of the following pro
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::bedrock` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `remote::bedrock` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
@ -28,6 +29,13 @@ The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
### Models
The following models are available by default:
- `meta-llama/Llama-3.1-8B-Instruct (meta.llama3-1-8b-instruct-v1:0)`
- `meta-llama/Llama-3.1-70B-Instruct (meta.llama3-1-70b-instruct-v1:0)`
- `meta-llama/Llama-3.1-405B-Instruct-FP8 (meta.llama3-1-405b-instruct-v1:0)`
### Prerequisite: API Keys

View file

@ -0,0 +1,62 @@
# Cerebras Distribution
The `llamastack/distribution-cerebras` distribution consists of the following provider configurations.
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| inference | `remote::cerebras` |
| memory | `inline::meta-reference` |
| safety | `inline::llama-guard` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
### Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `CEREBRAS_API_KEY`: Cerebras API Key (default: ``)
### Models
The following models are available by default:
- `meta-llama/Llama-3.1-8B-Instruct (llama3.1-8b)`
- `meta-llama/Llama-3.3-70B-Instruct (llama-3.3-70b)`
### Prerequisite: API Keys
Make sure you have access to a Cerebras API Key. You can get one by visiting [cloud.cerebras.ai](https://cloud.cerebras.ai/).
## Running Llama Stack with Cerebras
You can do this via Conda (build code) or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-cerebras \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
```
### Via Conda
```bash
llama stack build --template cerebras --image-type conda
llama stack run ./run.yaml \
--port 5001 \
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
```

View file

@ -15,10 +15,14 @@ The `llamastack/distribution-fireworks` distribution consists of the following p
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::fireworks` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
### Environment Variables
@ -39,6 +43,7 @@ The following models are available by default:
- `meta-llama/Llama-3.2-3B-Instruct (fireworks/llama-v3p2-3b-instruct)`
- `meta-llama/Llama-3.2-11B-Vision-Instruct (fireworks/llama-v3p2-11b-vision-instruct)`
- `meta-llama/Llama-3.2-90B-Vision-Instruct (fireworks/llama-v3p2-90b-vision-instruct)`
- `meta-llama/Llama-3.3-70B-Instruct (fireworks/llama-v3p3-70b-instruct)`
- `meta-llama/Llama-Guard-3-8B (fireworks/llama-guard-3-8b)`
- `meta-llama/Llama-Guard-3-11B-Vision (fireworks/llama-guard-3-11b-vision)`

View file

@ -15,10 +15,14 @@ The `llamastack/distribution-meta-reference-gpu` distribution consists of the fo
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `inline::meta-reference` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
@ -57,6 +61,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
@ -68,6 +73,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \

View file

@ -15,10 +15,14 @@ The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `inline::meta-reference-quantized` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
@ -57,6 +61,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-quantized-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
@ -68,6 +73,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-quantized-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \

View file

@ -15,10 +15,14 @@ The `llamastack/distribution-ollama` distribution consists of the following prov
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::ollama` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.### Environment Variables
@ -119,7 +123,7 @@ llama stack run ./run-with-safety.yaml \
### (Optional) Update Model Serving Configuration
```{note}
Please check the [model_aliases](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/ollama/ollama.py#L45) variable for supported Ollama models.
Please check the [model_aliases](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/ollama/ollama.py#L45) for the supported Ollama models.
```
To serve a new model with `ollama`

View file

@ -18,6 +18,7 @@ The `llamastack/distribution-remote-vllm` distribution consists of the following
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
@ -28,7 +29,7 @@ The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100}/v1`)
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`)
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)

View file

@ -16,10 +16,14 @@ The `llamastack/distribution-tgi` distribution consists of the following provide
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::tgi` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
You can use this distribution if you have GPUs and want to run an independent TGI server container for running inference.

View file

@ -15,10 +15,14 @@ The `llamastack/distribution-together` distribution consists of the following pr
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::together` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::memory-runtime` |
### Environment Variables
@ -38,6 +42,7 @@ The following models are available by default:
- `meta-llama/Llama-3.2-3B-Instruct`
- `meta-llama/Llama-3.2-11B-Vision-Instruct`
- `meta-llama/Llama-3.2-90B-Vision-Instruct`
- `meta-llama/Llama-3.3-70B-Instruct`
- `meta-llama/Llama-Guard-3-8B`
- `meta-llama/Llama-Guard-3-11B-Vision`

View file

@ -19,16 +19,17 @@ export LLAMA_STACK_PORT=5001
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
```
By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to enspagents/agenure the model remains loaded for sometime.
By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime.
### 2. Start the Llama Stack server
Llama Stack is based on a client-server architecture. It consists of a server which can be configured very flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Memory, Agents, Telemetry, Evals and so forth.
To get started quickly, we provide various Docker images for the server component that work with different inference providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the Docker image.
```bash
docker run \
-it \
docker run -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama \
@ -42,8 +43,7 @@ Configuration for this is available at `distributions/ollama/run.yaml`.
### 3. Use the Llama Stack client SDK
You can interact with the Llama Stack server using the `llama-stack-client` CLI or via the Python SDK.
You can interact with the Llama Stack server using various client SDKs. We will use the Python SDK which you can install using the following command. Note that you must be using Python 3.10 or newer:
```bash
pip install llama-stack-client
```
@ -51,7 +51,8 @@ pip install llama-stack-client
Let's use the `llama-stack-client` CLI to check the connectivity to the server.
```bash
llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT models list
llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT
llama-stack-client models list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ provider_resource_id ┃ metadata ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
@ -61,8 +62,8 @@ llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT models list
You can test basic Llama inference completion using the CLI too.
```bash
llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT \
inference chat_completion \
llama-stack-client \
inference chat-completion \
--message "hello, what model are you?"
```
@ -118,11 +119,11 @@ async def run_main():
model=os.environ["INFERENCE_MODEL"],
instructions="You are a helpful assistant",
tools=[{"type": "memory"}], # enable Memory aka RAG
enable_session_persistence=True,
)
agent = Agent(client, agent_config)
session_id = agent.create_session("test-session")
print(f"Created session_id={session_id} for Agent({agent.agent_id})")
user_prompts = [
(
"I am attaching documentation for Torchtune. Help me answer questions I will ask next.",
@ -139,7 +140,7 @@ async def run_main():
attachments=attachments,
session_id=session_id,
)
async for log in EventLogger().log(response):
for log in EventLogger().log(response):
log.print()

View file

@ -13,55 +13,16 @@ Our goal is to provide pre-packaged implementations which can be operated in a v
The Stack APIs are rapidly improving but still a work-in-progress. We invite feedback as well as direct contributions.
```
## Philosophy
## Quick Links
### Service-oriented design
- New to Llama Stack? Start with the [Introduction](introduction/index) to understand our motivation and vision.
- Ready to build? Check out the [Quick Start](getting_started/index) to get started.
- Need specific providers? Browse [Distributions](distributions/index) to see all the options available.
- Want to contribute? See the [Contributing](contributing/index) guide.
Unlike other frameworks, Llama Stack is built with a service-oriented, REST API-first approach. Such a design not only allows for seamless transitions from a local to remote deployments, but also forces the design to be more declarative. We believe this restriction can result in a much simpler, robust developer experience. This will necessarily trade-off against expressivity however if we get the APIs right, it can lead to a very powerful platform.
## Available SDKs
### Composability
We expect the set of APIs we design to be composable. An Agent abstractly depends on { Inference, Memory, Safety } APIs but does not care about the actual implementation details. Safety itself may require model inference and hence can depend on the Inference API.
### Turnkey one-stop solutions
We expect to provide turnkey solutions for popular deployment scenarios. It should be easy to deploy a Llama Stack server on AWS or on a private data center. Either of these should allow a developer to get started with powerful agentic apps, model evaluations or fine-tuning services in a matter of minutes. They should all result in the same uniform observability and developer experience.
### Focus on Llama models
As a Meta initiated project, we have started by explicitly focusing on Meta's Llama series of models. Supporting the broad set of open models is no easy task and we want to start with models we understand best.
### Supporting the Ecosystem
There is a vibrant ecosystem of Providers which provide efficient inference or scalable vector stores or powerful observability solutions. We want to make sure it is easy for developers to pick and choose the best implementations for their use cases. We also want to make sure it is easy for new Providers to onboard and participate in the ecosystem.
Additionally, we have designed every element of the Stack such that APIs as well as Resources (like Models) can be federated.
## Supported Llama Stack Implementations
Llama Stack already has a number of "adapters" available for some popular Inference and Memory (Vector Store) providers. For other APIs (particularly Safety and Agents), we provide *reference implementations* you can use to get started. We expect this list to grow over time. We are slowly onboarding more providers to the ecosystem as we get more confidence in the APIs.
| **API Provider** | **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| Meta Reference | Single Node | Y | Y | Y | Y | Y |
| Fireworks | Hosted | Y | Y | Y | | |
| AWS Bedrock | Hosted | | Y | | Y | |
| Together | Hosted | Y | Y | | Y | |
| Ollama | Single Node | | Y | | |
| TGI | Hosted and Single Node | | Y | | |
| Chroma | Single Node | | | Y | | |
| Postgres | Single Node | | | Y | | |
| PyTorch ExecuTorch | On-device iOS | Y | Y | | |
## Dive In
- Look at [Quick Start](getting_started/index) section to get started with Llama Stack.
- Learn more about [Llama Stack Concepts](concepts/index) to understand how different components fit together.
- Check out [Zero to Hero](https://github.com/meta-llama/llama-stack/tree/main/docs/zero_to_hero_guide) guide to learn in details about how to build your first agent.
- See how you can use [Llama Stack Distributions](distributions/index) to get started with popular inference and other service providers.
We also provide a number of Client side SDKs to make it easier to connect to Llama Stack server in your preferred language.
We have a number of client-side SDKs available for different languages.
| **Language** | **Client SDK** | **Package** |
| :----: | :----: | :----: |
@ -70,17 +31,36 @@ We also provide a number of Client side SDKs to make it easier to connect to Lla
| Node | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
## Supported Llama Stack Implementations
A number of "adapters" are available for some popular Inference and Memory (Vector Store) providers. For other APIs (particularly Safety and Agents), we provide *reference implementations* you can use to get started. We expect this list to grow over time. We are slowly onboarding more providers to the ecosystem as we get more confidence in the APIs.
| **API Provider** | **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| Meta Reference | Single Node | Y | Y | Y | Y | Y |
| Cerebras | Single Node | | Y | | | |
| Fireworks | Hosted | Y | Y | Y | | |
| AWS Bedrock | Hosted | | Y | | Y | |
| Together | Hosted | Y | Y | | Y | |
| Ollama | Single Node | | Y | | |
| TGI | Hosted and Single Node | | Y | | |
| [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) | Hosted and Single Node | | Y | | |
| Chroma | Single Node | | | Y | | |
| Postgres | Single Node | | | Y | | |
| PyTorch ExecuTorch | On-device iOS | Y | Y | | |
| PyTorch ExecuTorch | On-device Android | | Y | | |
```{toctree}
:hidden:
:maxdepth: 3
introduction/index
getting_started/index
concepts/index
distributions/index
building_applications/index
benchmark_evaluations/index
playground/index
contributing/index
references/index
cookbooks/index
```

View file

@ -0,0 +1,95 @@
# Why Llama Stack?
Building production AI applications today requires solving multiple challenges:
**Infrastructure Complexity**
- Running large language models efficiently requires specialized infrastructure.
- Different deployment scenarios (local development, cloud, edge) need different solutions.
- Moving from development to production often requires significant rework.
**Essential Capabilities**
- Safety guardrails and content filtering are necessary in an enterprise setting.
- Just model inference is not enough - Knowledge retrieval and RAG capabilities are required.
- Nearly any application needs composable multi-step workflows.
- Finally, without monitoring, observability and evaluation, you end up operating in the dark.
**Lack of Flexibility and Choice**
- Directly integrating with multiple providers creates tight coupling.
- Different providers have different APIs and abstractions.
- Changing providers requires significant code changes.
### The Vision: A Universal Stack
```{image} ../../_static/llama-stack.png
:alt: Llama Stack
:width: 400px
```
Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. These building blocks are presented as interoperable APIs with a broad set of Service Providers providing their implementations.
#### Service-oriented Design
Unlike other frameworks, Llama Stack is built with a service-oriented, REST API-first approach. Such a design not only allows for seamless transitions from local to remote deployments but also forces the design to be more declarative. This restriction can result in a much simpler, robust developer experience. The same code works across different environments:
- Local development with CPU-only setups
- Self-hosted with GPU acceleration
- Cloud-hosted on providers like AWS, Fireworks, Together
- On-device for iOS and Android
#### Composability
The APIs we design are composable. An Agent abstractly depends on { Inference, Memory, Safety } APIs but does not care about the actual implementation details. Safety itself may require model inference and hence can depend on the Inference API.
#### Turnkey Solutions
We provide turnkey solutions for popular deployment scenarios. It should be easy to deploy a Llama Stack server on AWS or in a private data center. Either of these should allow a developer to get started with powerful agentic apps, model evaluations, or fine-tuning services in minutes.
We have built-in support for critical needs:
- Safety guardrails and content filtering
- Comprehensive evaluation capabilities
- Full observability and monitoring
- Provider federation and fallback
#### Focus on Llama Models
As a Meta-initiated project, we explicitly focus on Meta's Llama series of models. Supporting the broad set of open models is no easy task and we want to start with models we understand best.
#### Supporting the Ecosystem
There is a vibrant ecosystem of Providers which provide efficient inference or scalable vector stores or powerful observability solutions. We want to make sure it is easy for developers to pick and choose the best implementations for their use cases. We also want to make sure it is easy for new Providers to onboard and participate in the ecosystem.
Additionally, we have designed every element of the Stack such that APIs as well as Resources (like Models) can be federated.
#### Rich Provider Ecosystem
```{list-table}
:header-rows: 1
* - Provider
- Local
- Self-hosted
- Cloud
* - Inference
- Ollama
- vLLM, TGI
- Fireworks, Together, AWS
* - Memory
- FAISS
- Chroma, pgvector
- Weaviate
* - Safety
- Llama Guard
- -
- AWS Bedrock
```
### Unified API Layer
Llama Stack provides a consistent interface for:
- **Inference**: Run LLM models efficiently
- **Safety**: Apply content filtering and safety policies
- **Memory**: Store and retrieve knowledge for RAG
- **Agents**: Build multi-step workflows
- **Evaluation**: Test and improve application quality

View file

@ -0,0 +1,109 @@
# Llama Stack Playground
```{note}
The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
```
The Llama Stack Playground is an simple interface which aims to:
- Showcase **capabilities** and **concepts** of Llama Stack in an interactive environment
- Demo **end-to-end** application code to help users get started to build their own applications
- Provide an **UI** to help users inspect and understand Llama Stack API providers and resources
## Key Features
#### Playground
Interactive pages for users to play with and explore Llama Stack API capabilities.
##### Chatbot
```{eval-rst}
.. video:: https://github.com/user-attachments/assets/8d2ef802-5812-4a28-96e1-316038c84cbf
:autoplay:
:playsinline:
:muted:
:loop:
:width: 100%
```
- **Chat**: Chat with Llama models.
- This page is a simple chatbot that allows you to chat with Llama models. Under the hood, it uses the `/inference/chat-completion` streaming API to send messages to the model and receive responses.
- **RAG**: Uploading documents to memory_banks and chat with RAG agent
- This page allows you to upload documents as a `memory_bank` and then chat with a RAG agent to query information about the uploaded documents.
- Under the hood, it uses Llama Stack's `/agents` API to define and create a RAG agent and chat with it in a session.
##### Evaluations
```{eval-rst}
.. video:: https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5
:autoplay:
:playsinline:
:muted:
:loop:
:width: 100%
```
- **Evaluations (Scoring)**: Run evaluations on your AI application datasets.
- This page demonstrates the flow evaluation API to run evaluations on your custom AI application datasets. You may upload your own evaluation datasets and run evaluations using available scoring functions.
- Under the hood, it uses Llama Stack's `/scoring` API to run evaluations on selected scoring functions.
```{eval-rst}
.. video:: https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf
:autoplay:
:playsinline:
:muted:
:loop:
:width: 100%
```
- **Evaluations (Generation + Scoring)**: Use pre-registered evaluation tasks to evaluate an model or agent candidate
- This page demonstrates the flow for evaluation API to evaluate an model or agent candidate on pre-defined evaluation tasks. An evaluation task is a combination of dataset and scoring functions.
- Under the hood, it uses Llama Stack's `/eval` API to run generations and scorings on specified evaluation configs.
- In order to run this page, you may need to register evaluation tasks and datasets as resources first through the following commands.
```bash
$ llama-stack-client datasets register \
--dataset-id "mmlu" \
--provider-id "huggingface" \
--url "https://huggingface.co/datasets/llamastack/evals" \
--metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \
--schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string"}, "chat_completion_input": {"type": "string"}}'
```
```bash
$ llama-stack-client eval_tasks register \
--eval-task-id meta-reference-mmlu \
--provider-id meta-reference \
--dataset-id mmlu \
--scoring-functions basic::regex_parser_multiple_choice_answer
```
##### Inspect
```{eval-rst}
.. video:: https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99
:autoplay:
:playsinline:
:muted:
:loop:
:width: 100%
```
- **API Providers**: Inspect Llama Stack API providers
- This page allows you to inspect Llama Stack API providers and resources.
- Under the hood, it uses Llama Stack's `/providers` API to get information about the providers.
- **API Resources**: Inspect Llama Stack API resources
- This page allows you to inspect Llama Stack API resources (`models`, `datasets`, `memory_banks`, `eval_tasks`, `shields`).
- Under the hood, it uses Llama Stack's `/<resources>/list` API to get information about each resources.
- Please visit [Core Concepts](https://llama-stack.readthedocs.io/en/latest/concepts/index.html) for more details about the resources.
## Starting the Llama Stack Playground
To start the Llama Stack Playground, run the following commands:
1. Start up the Llama Stack API server
```bash
llama stack build --template together --image-type conda
llama stack run together
```
2. Start Streamlit UI
```bash
cd llama_stack/distribution/ui
pip install -r requirements.txt
streamlit run app.py
```

View file

@ -0,0 +1,359 @@
# Evaluations
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/eval_tasks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
## Evaluation Concepts
The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
![Eval Concepts](./resources/eval-concept.png)
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `EvalTask` resource.
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
![Eval Flow](./resources/eval-flow.png)
```{admonition} Note on Benchmark v.s. Application Evaluation
:class: tip
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
```
## Evaluation Examples Walkthrough
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
It is best to open this notebook in Colab to follow along with the examples.
### 1. Open Benchmark Model Evaluation
This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
#### 1.1 Running MMMU
- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in this [GitHub Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
```python
import datasets
ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
eval_rows = ds.to_pandas().to_dict(orient="records")
```
- Next, we will run evaluation on an model candidate, we will need to:
- Define a system prompt
- Define an EvalCandidate
- Run evaluate on the dataset
```python
SYSTEM_PROMPT_TEMPLATE = """
You are an expert in Agriculture whose job is to answer questions from the user using images.
First, reason about the correct answer.
Then write the answer in the following format where X is exactly one of A,B,C,D:
Answer: X
Make sure X is one of A,B,C,D.
If you are uncertain of the correct answer, guess the most likely one.
"""
system_message = {
"role": "system",
"content": SYSTEM_PROMPT_TEMPLATE,
}
client.eval_tasks.register(
eval_task_id="meta-reference::mmmu",
dataset_id=f"mmmu-{subset}-{split}",
scoring_functions=["basic::regex_parser_multiple_choice_answer"]
)
response = client.eval.evaluate_rows(
task_id="meta-reference::mmmu",
input_rows=eval_rows,
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"sampling_params": {
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 0.9,
"repeat_penalty": 1.0,
},
"system_message": system_message
}
}
)
```
#### 1.2. Running SimpleQA
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
```python
simpleqa_dataset_id = "huggingface::simpleqa"
_ = client.datasets.register(
dataset_id=simpleqa_dataset_id,
provider_id="huggingface",
url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
metadata={
"path": "llamastack/evals",
"name": "evals__simpleqa",
"split": "train",
},
dataset_schema={
"input_query": {"type": "string"},
"expected_answer": {"type": "string"},
"chat_completion_input": {"type": "chat_completion_input"},
}
)
eval_rows = client.datasetio.get_rows_paginated(
dataset_id=simpleqa_dataset_id,
rows_in_page=5,
)
```
```python
client.eval_tasks.register(
eval_task_id="meta-reference::simpleqa",
dataset_id=simpleqa_dataset_id,
scoring_functions=["llm-as-judge::405b-simpleqa"]
)
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"sampling_params": {
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 0.9,
"repeat_penalty": 1.0,
},
}
}
)
```
### 2. Agentic Evaluation
- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
- We will continue to use the SimpleQA dataset we used in previous example.
- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
```python
agent_config = {
"model": "meta-llama/Llama-3.1-405B-Instruct",
"instructions": "You are a helpful assistant",
"sampling_params": {
"strategy": "greedy",
"temperature": 0.0,
"top_p": 0.95,
},
"tools": [
{
"type": "brave_search",
"engine": "tavily",
"api_key": userdata.get("TAVILY_SEARCH_API_KEY")
}
],
"tool_choice": "auto",
"tool_prompt_format": "json",
"input_shields": [],
"output_shields": [],
"enable_session_persistence": False
}
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
"type": "benchmark",
"eval_candidate": {
"type": "agent",
"config": agent_config,
}
}
)
```
### 3. Agentic Application Dataset Scoring
- Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
- In this example, we will work with an example RAG dataset and couple of scoring functions for evaluation.
- `llm-as-judge::base`: LLM-As-Judge with custom judge prompt & model.
- `braintrust::factuality`: Factuality scorer from [braintrust](https://github.com/braintrustdata/autoevals).
- `basic::subset_of`: Basic checking if generated answer is a subset of expected answer.
- Please checkout our [Llama Stack Playground](https://llama-stack.readthedocs.io/en/latest/playground/index.html) for an interactive interface to upload datasets and run scorings.
```python
judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"
JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.
Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
(B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
(C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
(D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Give your answer in the format "Answer: One of ABCDE, Explanation: ".
Your actual task:
QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""
input_query = "What are the top 5 topics that were explained? Only list succinct bullet points."
generated_answer = """
Here are the top 5 topics that were explained in the documentation for Torchtune:
* What is LoRA and how does it work?
* Fine-tuning with LoRA: memory savings and parameter-efficient finetuning
* Running a LoRA finetune with Torchtune: overview and recipe
* Experimenting with different LoRA configurations: rank, alpha, and attention modules
* LoRA finetuning
"""
expected_answer = """LoRA"""
dataset_rows = [
{
"input_query": input_query,
"generated_answer": generated_answer,
"expected_answer": expected_answer,
},
]
scoring_params = {
"llm-as-judge::base": {
"judge_model": judge_model_id,
"prompt_template": JUDGE_PROMPT,
"type": "llm_as_judge",
"judge_score_regexes": ["Answer: (A|B|C|D|E)"],
},
"basic::subset_of": None,
"braintrust::factuality": None,
}
response = client.scoring.score(input_rows=dataset_rows, scoring_functions=scoring_params)
```
## Running Evaluations via CLI
The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
#### Benchmark Evaluation CLI
Usage: There are 2 inputs necessary for running a benchmark eval
- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
- `dataset_id`: the identifier associated with the dataset.
- `List[scoring_function_id]`: list of scoring function identifiers.
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
```
llama-stack-client eval run_benchmark <eval-task-id> \
--eval-task-config ~/eval_task_config.json \
--visualize
```
#### Application Evaluation CLI
Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
- `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
- `Dataset` used for evaluation:
- (1) `--dataset-path`: path to local file system containing datasets to run evaluation on
- (2) `--dataset-id`: pre-registered dataset in Llama Stack
- (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
```
llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
--dataset-path <path-to-local-dataset> \
--output-dir ./
```
#### Defining EvalTaskConfig
The `EvalTaskConfig` are user specified config to define:
1. `EvalCandidate` to run generation on:
- `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
- `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API.
2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
**Example Benchmark EvalTaskConfig**
```json
{
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "Llama3.2-3B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
}
}
```
**Example Application EvalTaskConfig**
```json
{
"type": "app",
"eval_candidate": {
"type": "model",
"model": "Llama3.1-405B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
},
"scoring_params": {
"llm-as-judge::llm_as_judge_base": {
"type": "llm_as_judge",
"judge_model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt_template": "Your job is to look at a question, a gold target ........",
"judge_score_regexes": [
"(A|B|C)"
]
}
}
}
```

View file

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 68 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 249 KiB

After

Width:  |  Height:  |  Size: 249 KiB

Before After
Before After

View file

@ -14,4 +14,5 @@ python_sdk_reference/index
llama_cli_reference/index
llama_stack_client_cli_reference
llama_cli_reference/download_models
evals_reference/index
```