llama-stack-mirror/docs/docs/building_applications/evals.mdx
Alexey Rybak c71ce8df61
docs: concepts and building_applications migration (#3534)
# What does this PR do?

- Migrates the remaining documentation sections to the new documentation format

<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->

<!-- If resolving an issue, uncomment and update the line below -->

<!-- Closes #[issue-number] -->

## Test Plan

- Partial migration

<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
2025-09-24 14:05:30 -07:00

256 lines
8 KiB
Text

---
title: Evaluations
description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
sidebar_label: Evaluations
sidebar_position: 7
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.
:::tip[Interactive Examples]
Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
:::
## Application Evaluation Example
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
In this example, we will show you how to:
1. **Build an Agent** with Llama Stack
2. **Query the agent's sessions, turns, and steps** to analyze execution
3. **Evaluate the results** using scoring functions
## Step-by-Step Evaluation Process
### 1. Building a Search Agent
First, let's create an agent that can search the web to answer questions:
```python
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant. Use search tool to answer the questions.",
tools=["builtin::websearch"],
)
# Test prompts for evaluation
user_prompts = [
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
# Execute all prompts in the session
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
```
### 2. Query Agent Execution Steps
Now, let's analyze the agent's execution steps to understand its performance:
<Tabs>
<TabItem value="session-analysis" label="Session Analysis">
```python
from rich.pretty import pprint
# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
pprint(session_response)
```
</TabItem>
<TabItem value="tool-validation" label="Tool Usage Validation">
```python
# Sanity check: Verify that all user prompts are followed by tool calls
num_tool_call = 0
for turn in session_response.turns:
for step in turn.steps:
if (
step.step_type == "tool_execution"
and step.tool_calls[0].tool_name == "brave_search"
):
num_tool_call += 1
print(
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)
```
</TabItem>
</Tabs>
### 3. Evaluate Agent Responses
Now we'll evaluate the agent's responses using Llama Stack's scoring API:
<Tabs>
<TabItem value="data-preparation" label="Data Preparation">
```python
# Process agent execution history into evaluation rows
eval_rows = []
# Define expected answers for our test prompts
expected_answers = [
"Dallas Mavericks and the Minnesota Timberwolves",
"Season 4, Episode 12",
"King Cobra",
]
# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
eval_rows.append(
{
"input_query": turn.input_messages[0].content,
"generated_answer": turn.output_message.content,
"expected_answer": expected_answers[i],
}
)
pprint(eval_rows)
```
</TabItem>
<TabItem value="scoring" label="Scoring & Evaluation">
```python
# Configure scoring parameters
scoring_params = {
"basic::subset_of": None, # Check if generated answer contains expected answer
}
# Run evaluation using Llama Stack's scoring API
scoring_response = client.scoring.score(
input_rows=eval_rows,
scoring_functions=scoring_params
)
pprint(scoring_response)
# Analyze results
for i, result in enumerate(scoring_response.results):
print(f"Query {i+1}: {result.score}")
print(f" Generated: {eval_rows[i]['generated_answer'][:100]}...")
print(f" Expected: {expected_answers[i]}")
print(f" Score: {result.score}")
print()
```
</TabItem>
</Tabs>
## Available Scoring Functions
Llama Stack provides several built-in scoring functions:
### Basic Scoring Functions
- **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
- **`basic::exact_match`**: Performs exact string matching between expected and generated answers
- **`basic::regex_match`**: Uses regular expressions to match patterns in responses
### Advanced Scoring Functions
- **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
- **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
- **`llm_as_judge::safety`**: Assesses response safety and appropriateness
### Custom Scoring Functions
You can also create custom scoring functions for domain-specific evaluation needs.
## Evaluation Workflow Best Practices
### 🎯 **Dataset Preparation**
- Use diverse test cases that cover edge cases and common scenarios
- Include clear expected answers or success criteria
- Balance your dataset across different difficulty levels
### 📊 **Metrics Selection**
- Choose appropriate scoring functions for your use case
- Combine multiple metrics for comprehensive evaluation
- Consider both automated and human evaluation metrics
### 🔄 **Iterative Improvement**
- Run evaluations regularly during development
- Use evaluation results to identify areas for improvement
- Track performance changes over time
### 📈 **Analysis & Reporting**
- Analyze failures to understand model limitations
- Generate comprehensive evaluation reports
- Share results with stakeholders for informed decision-making
## Advanced Evaluation Scenarios
### Batch Evaluation
For evaluating large datasets efficiently:
```python
# Prepare large evaluation dataset
large_eval_dataset = [
{"input_query": query, "expected_answer": answer}
for query, answer in zip(queries, expected_answers)
]
# Run batch evaluation
batch_results = client.scoring.score(
input_rows=large_eval_dataset,
scoring_functions={
"basic::subset_of": None,
"llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
}
)
```
### Multi-Metric Evaluation
Combining different scoring approaches:
```python
comprehensive_scoring = {
"exact_match": "basic::exact_match",
"subset_match": "basic::subset_of",
"llm_judge": "llm_as_judge::accuracy",
"safety_check": "llm_as_judge::safety",
}
results = client.scoring.score(
input_rows=eval_rows,
scoring_functions=comprehensive_scoring
)
```
## Related Resources
- **[Agents](./agent)** - Building agents for evaluation
- **[Tools Integration](./tools)** - Using tools in evaluated agents
- **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
- **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
- **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios