llama-stack-mirror/docs/docs/building_applications/evals.mdx

---
title: Evaluations
description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
sidebar_label: Evaluations
sidebar_position: 7
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.

:::tip[Interactive Examples]
Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
:::

## Application Evaluation Example

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)

Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.

In this example, we will show you how to:
1. **Build an Agent** with Llama Stack
2. **Query the agent's sessions, turns, and steps** to analyze execution
3. **Evaluate the results** using scoring functions

## Step-by-Step Evaluation Process

### 1. Building a Search Agent

First, let's create an agent that can search the web to answer questions:

```python
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
    client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions.",
    tools=["builtin::websearch"],
)

# Test prompts for evaluation
user_prompts = [
    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]

session_id = agent.create_session("test-session")

# Execute all prompts in the session
for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in AgentEventLogger().log(response):
        log.print()
```

### 2. Query Agent Execution Steps

Now, let's analyze the agent's execution steps to understand its performance:

<Tabs>
<TabItem value="session-analysis" label="Session Analysis">

```python
from rich.pretty import pprint

# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
)

pprint(session_response)
```

</TabItem>
<TabItem value="tool-validation" label="Tool Usage Validation">

```python
# Sanity check: Verify that all user prompts are followed by tool calls
num_tool_call = 0
for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1

print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)
```

</TabItem>
</Tabs>

### 3. Evaluate Agent Responses

Now we'll evaluate the agent's responses using Llama Stack's scoring API:

<Tabs>
<TabItem value="data-preparation" label="Data Preparation">

```python
# Process agent execution history into evaluation rows
eval_rows = []

# Define expected answers for our test prompts
expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
]

# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )

pprint(eval_rows)
```

</TabItem>
<TabItem value="scoring" label="Scoring & Evaluation">

```python
# Configure scoring parameters
scoring_params = {
    "basic::subset_of": None,  # Check if generated answer contains expected answer
}

# Run evaluation using Llama Stack's scoring API
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params
)

pprint(scoring_response)

# Analyze results
for i, result in enumerate(scoring_response.results):
    print(f"Query {i+1}: {result.score}")
    print(f"  Generated: {eval_rows[i]['generated_answer'][:100]}...")
    print(f"  Expected: {expected_answers[i]}")
    print(f"  Score: {result.score}")
    print()
```

</TabItem>
</Tabs>

## Available Scoring Functions

Llama Stack provides several built-in scoring functions:

### Basic Scoring Functions
- **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
- **`basic::exact_match`**: Performs exact string matching between expected and generated answers
- **`basic::regex_match`**: Uses regular expressions to match patterns in responses

### Advanced Scoring Functions
- **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
- **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
- **`llm_as_judge::safety`**: Assesses response safety and appropriateness

### Custom Scoring Functions
You can also create custom scoring functions for domain-specific evaluation needs.

## Evaluation Workflow Best Practices

### 🎯 **Dataset Preparation**
- Use diverse test cases that cover edge cases and common scenarios
- Include clear expected answers or success criteria
- Balance your dataset across different difficulty levels

### 📊 **Metrics Selection**
- Choose appropriate scoring functions for your use case
- Combine multiple metrics for comprehensive evaluation
- Consider both automated and human evaluation metrics

### 🔄 **Iterative Improvement**
- Run evaluations regularly during development
- Use evaluation results to identify areas for improvement
- Track performance changes over time

### 📈 **Analysis & Reporting**
- Analyze failures to understand model limitations
- Generate comprehensive evaluation reports
- Share results with stakeholders for informed decision-making

## Advanced Evaluation Scenarios

### Batch Evaluation
For evaluating large datasets efficiently:

```python
# Prepare large evaluation dataset
large_eval_dataset = [
    {"input_query": query, "expected_answer": answer}
    for query, answer in zip(queries, expected_answers)
]

# Run batch evaluation
batch_results = client.scoring.score(
    input_rows=large_eval_dataset,
    scoring_functions={
        "basic::subset_of": None,
        "llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
    }
)
```

### Multi-Metric Evaluation
Combining different scoring approaches:

```python
comprehensive_scoring = {
    "exact_match": "basic::exact_match",
    "subset_match": "basic::subset_of",
    "llm_judge": "llm_as_judge::accuracy",
    "safety_check": "llm_as_judge::safety",
}

results = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=comprehensive_scoring
)
```

## Related Resources

- **[Agents](./agent)** - Building agents for evaluation
- **[Tools Integration](./tools)** - Using tools in evaluated agents
- **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
- **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
- **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios