mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-04 04:04:14 +00:00
# What does this PR do? - Migrates the remaining documentation sections to the new documentation format <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan - Partial migration <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
256 lines
8 KiB
Text
256 lines
8 KiB
Text
---
|
|
title: Evaluations
|
|
description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
|
|
sidebar_label: Evaluations
|
|
sidebar_position: 7
|
|
---
|
|
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.
|
|
|
|
:::tip[Interactive Examples]
|
|
Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
|
|
:::
|
|
|
|
## Application Evaluation Example
|
|
|
|
[](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
|
|
|
|
Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
|
|
|
|
In this example, we will show you how to:
|
|
1. **Build an Agent** with Llama Stack
|
|
2. **Query the agent's sessions, turns, and steps** to analyze execution
|
|
3. **Evaluate the results** using scoring functions
|
|
|
|
## Step-by-Step Evaluation Process
|
|
|
|
### 1. Building a Search Agent
|
|
|
|
First, let's create an agent that can search the web to answer questions:
|
|
|
|
```python
|
|
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
|
|
|
|
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
|
|
|
|
agent = Agent(
|
|
client,
|
|
model="meta-llama/Llama-3.3-70B-Instruct",
|
|
instructions="You are a helpful assistant. Use search tool to answer the questions.",
|
|
tools=["builtin::websearch"],
|
|
)
|
|
|
|
# Test prompts for evaluation
|
|
user_prompts = [
|
|
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
|
|
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
|
|
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
|
|
]
|
|
|
|
session_id = agent.create_session("test-session")
|
|
|
|
# Execute all prompts in the session
|
|
for prompt in user_prompts:
|
|
response = agent.create_turn(
|
|
messages=[
|
|
{
|
|
"role": "user",
|
|
"content": prompt,
|
|
}
|
|
],
|
|
session_id=session_id,
|
|
)
|
|
|
|
for log in AgentEventLogger().log(response):
|
|
log.print()
|
|
```
|
|
|
|
### 2. Query Agent Execution Steps
|
|
|
|
Now, let's analyze the agent's execution steps to understand its performance:
|
|
|
|
<Tabs>
|
|
<TabItem value="session-analysis" label="Session Analysis">
|
|
|
|
```python
|
|
from rich.pretty import pprint
|
|
|
|
# Query the agent's session to get detailed execution data
|
|
session_response = client.agents.session.retrieve(
|
|
session_id=session_id,
|
|
agent_id=agent.agent_id,
|
|
)
|
|
|
|
pprint(session_response)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="tool-validation" label="Tool Usage Validation">
|
|
|
|
```python
|
|
# Sanity check: Verify that all user prompts are followed by tool calls
|
|
num_tool_call = 0
|
|
for turn in session_response.turns:
|
|
for step in turn.steps:
|
|
if (
|
|
step.step_type == "tool_execution"
|
|
and step.tool_calls[0].tool_name == "brave_search"
|
|
):
|
|
num_tool_call += 1
|
|
|
|
print(
|
|
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
|
|
)
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
### 3. Evaluate Agent Responses
|
|
|
|
Now we'll evaluate the agent's responses using Llama Stack's scoring API:
|
|
|
|
<Tabs>
|
|
<TabItem value="data-preparation" label="Data Preparation">
|
|
|
|
```python
|
|
# Process agent execution history into evaluation rows
|
|
eval_rows = []
|
|
|
|
# Define expected answers for our test prompts
|
|
expected_answers = [
|
|
"Dallas Mavericks and the Minnesota Timberwolves",
|
|
"Season 4, Episode 12",
|
|
"King Cobra",
|
|
]
|
|
|
|
# Create evaluation dataset from agent responses
|
|
for i, turn in enumerate(session_response.turns):
|
|
eval_rows.append(
|
|
{
|
|
"input_query": turn.input_messages[0].content,
|
|
"generated_answer": turn.output_message.content,
|
|
"expected_answer": expected_answers[i],
|
|
}
|
|
)
|
|
|
|
pprint(eval_rows)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="scoring" label="Scoring & Evaluation">
|
|
|
|
```python
|
|
# Configure scoring parameters
|
|
scoring_params = {
|
|
"basic::subset_of": None, # Check if generated answer contains expected answer
|
|
}
|
|
|
|
# Run evaluation using Llama Stack's scoring API
|
|
scoring_response = client.scoring.score(
|
|
input_rows=eval_rows,
|
|
scoring_functions=scoring_params
|
|
)
|
|
|
|
pprint(scoring_response)
|
|
|
|
# Analyze results
|
|
for i, result in enumerate(scoring_response.results):
|
|
print(f"Query {i+1}: {result.score}")
|
|
print(f" Generated: {eval_rows[i]['generated_answer'][:100]}...")
|
|
print(f" Expected: {expected_answers[i]}")
|
|
print(f" Score: {result.score}")
|
|
print()
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Available Scoring Functions
|
|
|
|
Llama Stack provides several built-in scoring functions:
|
|
|
|
### Basic Scoring Functions
|
|
- **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
|
|
- **`basic::exact_match`**: Performs exact string matching between expected and generated answers
|
|
- **`basic::regex_match`**: Uses regular expressions to match patterns in responses
|
|
|
|
### Advanced Scoring Functions
|
|
- **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
|
|
- **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
|
|
- **`llm_as_judge::safety`**: Assesses response safety and appropriateness
|
|
|
|
### Custom Scoring Functions
|
|
You can also create custom scoring functions for domain-specific evaluation needs.
|
|
|
|
## Evaluation Workflow Best Practices
|
|
|
|
### 🎯 **Dataset Preparation**
|
|
- Use diverse test cases that cover edge cases and common scenarios
|
|
- Include clear expected answers or success criteria
|
|
- Balance your dataset across different difficulty levels
|
|
|
|
### 📊 **Metrics Selection**
|
|
- Choose appropriate scoring functions for your use case
|
|
- Combine multiple metrics for comprehensive evaluation
|
|
- Consider both automated and human evaluation metrics
|
|
|
|
### 🔄 **Iterative Improvement**
|
|
- Run evaluations regularly during development
|
|
- Use evaluation results to identify areas for improvement
|
|
- Track performance changes over time
|
|
|
|
### 📈 **Analysis & Reporting**
|
|
- Analyze failures to understand model limitations
|
|
- Generate comprehensive evaluation reports
|
|
- Share results with stakeholders for informed decision-making
|
|
|
|
## Advanced Evaluation Scenarios
|
|
|
|
### Batch Evaluation
|
|
For evaluating large datasets efficiently:
|
|
|
|
```python
|
|
# Prepare large evaluation dataset
|
|
large_eval_dataset = [
|
|
{"input_query": query, "expected_answer": answer}
|
|
for query, answer in zip(queries, expected_answers)
|
|
]
|
|
|
|
# Run batch evaluation
|
|
batch_results = client.scoring.score(
|
|
input_rows=large_eval_dataset,
|
|
scoring_functions={
|
|
"basic::subset_of": None,
|
|
"llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
|
|
}
|
|
)
|
|
```
|
|
|
|
### Multi-Metric Evaluation
|
|
Combining different scoring approaches:
|
|
|
|
```python
|
|
comprehensive_scoring = {
|
|
"exact_match": "basic::exact_match",
|
|
"subset_match": "basic::subset_of",
|
|
"llm_judge": "llm_as_judge::accuracy",
|
|
"safety_check": "llm_as_judge::safety",
|
|
}
|
|
|
|
results = client.scoring.score(
|
|
input_rows=eval_rows,
|
|
scoring_functions=comprehensive_scoring
|
|
)
|
|
```
|
|
|
|
## Related Resources
|
|
|
|
- **[Agents](./agent)** - Building agents for evaluation
|
|
- **[Tools Integration](./tools)** - Using tools in evaluated agents
|
|
- **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
|
|
- **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
|
|
- **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios
|