docs: update eval doc (#1453)

# What does this PR do? - Update eval doc to reflect latest changes - Closes https://github.com/meta-llama/llama-stack/issues/1441 ## Test Plan read [//]: # (## Documentation)
2025-03-06 14:14:10 -08:00 · 2025-03-06 14:14:10 -08:00 · 564977c646
commit 564977c646
parent db4ee7a9ff
4 changed files with 140 additions and 244 deletions
--- a/docs/source/building_applications/evals.md
+++ b/docs/source/building_applications/evals.md
@ -1,169 +1,128 @@
-# Evals
+# Evaluations
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
+The Llama Stack provides a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
 - `/datasetio` + `/datasets` API
 - `/scoring` + `/scoring_functions` API
 - `/eval` + `/benchmarks` API
 Llama Stack provides the building blocks needed to run benchmark and application evaluations. This guide will walk you through how to use these components to run open benchmark evaluations. Visit our [Evaluation Concepts](../concepts/evaluation_concepts.md) guide for more details on how evaluations work in Llama Stack, and our [Evaluation Reference](../references/evals_reference/index.md) guide for a comprehensive reference on the APIs.
 ### 1. Open Benchmark Model Evaluation
-This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
+This guides walks you through the process of evaluating an LLM application built using Llama Stack. Checkout the [Evaluation Reference](../references/evals_reference/index.md) guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for benchmark and application use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
 - [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
 - [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
 #### 1.1 Running MMMU
 - We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
 ## Application Evaluation
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
 Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
 In this example, we will show you how to:
 1. Build an Agent with Llama Stack
 2. Query the agent's sessions, turns, and steps
 3. Evaluate the results.
 ##### Building a Search Agent
 ```python
-import datasets
+from llama_stack_client.lib.agents.agent import Agent
 from llama_stack_client.lib.agents.event_logger import EventLogger
 from llama_stack_client.types.agent_create_params import AgentConfig
-ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
+agent_config = AgentConfig(
-ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
+    model="meta-llama/Llama-3.3-70B-Instruct",
-eval_rows = ds.to_pandas().to_dict(orient="records")
+    instructions="You are a helpful assistant. Use search tool to answer the questions. ",
-```
+    toolgroups=["builtin::websearch"],
-
+    input_shields=[],
- Next, we will run evaluation on an model candidate, we will need to:
+    output_shields=[],
-  - Define a system prompt
+    enable_session_persistence=False,
  - Define an EvalCandidate
  - Run evaluate on the dataset
 ```python
 SYSTEM_PROMPT_TEMPLATE = """
 You are an expert in Agriculture whose job is to answer questions from the user using images.
 First, reason about the correct answer.
 Then write the answer in the following format where X is exactly one of A,B,C,D:
 Answer: X
 Make sure X is one of A,B,C,D.
 If you are uncertain of the correct answer, guess the most likely one.
 """
 system_message = {
    "role": "system",
    "content": SYSTEM_PROMPT_TEMPLATE,
 }
 client.benchmarks.register(
    benchmark_id="meta-reference::mmmu",
    dataset_id=f"mmmu-{subset}-{split}",
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
 )
 agent = Agent(client, agent_config)
 user_prompts = [
    "Which teams played in the NBA western conference finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
 ]
-response = client.eval.evaluate_rows(
+session_id = agent.create_session("test-session")
    benchmark_id="meta-reference::mmmu",
    input_rows=eval_rows,
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
            "system_message": system_message,
        },
    },
 )
 ```
-#### 1.2. Running SimpleQA
+for prompt in user_prompts:
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
+    response = agent.create_turn(
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
+        messages=[
 ```python
 simpleqa_dataset_id = "huggingface::simpleqa"
 _ = client.datasets.register(
    dataset_id=simpleqa_dataset_id,
    provider_id="huggingface",
    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
    metadata={
        "path": "llamastack/evals",
        "name": "evals__simpleqa",
        "split": "train",
    },
    dataset_schema={
        "input_query": {"type": "string"},
        "expected_answer": {"type": "string"},
        "chat_completion_input": {"type": "chat_completion_input"},
    },
 )
 eval_rows = client.datasetio.get_rows_paginated(
    dataset_id=simpleqa_dataset_id,
    rows_in_page=5,
 )
 ```
 ```python
 client.benchmarks.register(
    benchmark_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
 )
 response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.rows,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
        },
    },
 )
 ```
 ### 2. Agentic Evaluation
 - In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
 - We will continue to use the SimpleQA dataset we used in previous example.
 - Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
 ```python
 agent_config = {
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "instructions": "You are a helpful assistant",
    "sampling_params": {
        "strategy": {
            "type": "greedy",
        },
    },
    "tools": [
            {
-            "type": "brave_search",
+                "role": "user",
-            "engine": "tavily",
+                "content": prompt,
            "api_key": userdata.get("TAVILY_SEARCH_API_KEY"),
            }
        ],
-    "tool_choice": "auto",
+        session_id=session_id,
-    "input_shields": [],
+    )
    "output_shields": [],
    "enable_session_persistence": False,
 }
-response = client.eval.evaluate_rows(
+    for log in EventLogger().log(response):
-    benchmark_id="meta-reference::simpleqa",
+        log.print()
-    input_rows=eval_rows.rows,
+```
-    scoring_functions=["llm-as-judge::405b-simpleqa"],
+
-    benchmark_config={
+
-        "type": "benchmark",
+##### Query Agent Execution Steps
-        "eval_candidate": {
+
-            "type": "agent",
+Now, let's look deeper into the agent's execution steps and see if how well our agent performs.
-            "config": agent_config,
+```python
-        },
+# query the agents session
-    },
+from rich.pretty import pprint
 session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
 )
 pprint(session_response)
 ```
 As a sanity check, we will first check if all user prompts is followed by a tool call to `brave_search`.
 ```python
 num_tool_call = 0
 for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1
 print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
 )
 ```
 ##### Evaluate Agent Responses
 Now, we want to evaluate the agent's responses to the user prompts.
 1. First, we will process the agent's execution history into a list of rows that can be used for evaluation.
 2. Next, we will label the rows with the expected answer.
 3. Finally, we will use the `/scoring` API to score the agent's responses.
 ```python
 eval_rows = []
 expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
 ]
 for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )
 pprint(eval_rows)
 scoring_params = {
    "basic::subset_of": None,
 }
 scoring_response = client.scoring.score(
    input_rows=eval_rows, scoring_functions=scoring_params
 )
 pprint(scoring_response)
 ```
--- a/docs/source/building_applications/evaluation.md
+++ b/docs/source/building_applications/evaluation.md
@ -1,30 +0,0 @@
 ## Testing & Evaluation
 Llama Stack provides built-in tools for evaluating your applications:
 1. **Benchmarking**: Test against standard datasets
 2. **Application Evaluation**: Score your application's outputs
 3. **Custom Metrics**: Define your own evaluation criteria
 Here's how to set up basic evaluation:
 ```python
 # Create an evaluation task
 response = client.benchmarks.register(
    benchmark_id="my_eval",
    dataset_id="my_dataset",
    scoring_functions=["accuracy", "relevance"],
 )
 # Run evaluation
 job = client.eval.run_eval(
    benchmark_id="my_eval",
    benchmark_config={
        "type": "app",
        "eval_candidate": {"type": "agent", "config": agent_config},
    },
 )
 # Get results
 result = client.eval.job_result(benchmark_id="my_eval", job_id=job.job_id)
 ```
--- a/docs/source/concepts/evaluation_concepts.md
+++ b/docs/source/concepts/evaluation_concepts.md
@ -24,17 +24,8 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
  - Associated with `Benchmark` resource.
 Use the following decision tree to decide how to use LlamaStack Evaluation flow.
 ![Eval Flow](../references/evals_reference/resources/eval-flow.png)
 ```{admonition} Note on Benchmark v.s. Application Evaluation
 :class: tip
 - **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
 - **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
 ```
 ## What's Next?
- Check out our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
+- Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP).
 - Check out our [Building Applications - Evaluation](../building_applications/evals.md) guide for more details on how to use the Evaluation APIs to evaluate your applications.
 - Check out our [Evaluation Reference](../references/evals_reference/index.md) for more details on the APIs.
--- a/docs/source/references/evals_reference/index.md
+++ b/docs/source/references/evals_reference/index.md
@ -24,19 +24,9 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
  - Associated with `Benchmark` resource.
 Use the following decision tree to decide how to use LlamaStack Evaluation flow.
 ![Eval Flow](./resources/eval-flow.png)
 ```{admonition} Note on Benchmark v.s. Application Evaluation
 :class: tip
 - **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
 - **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
 ```
 ## Evaluation Examples Walkthrough
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb)
 It is best to open this notebook in Colab to follow along with the examples.
@ -63,20 +53,29 @@ eval_rows = ds.to_pandas().to_dict(orient="records")
  - Run evaluate on the dataset
 ```python
 from rich.pretty import pprint
 from tqdm import tqdm
 SYSTEM_PROMPT_TEMPLATE = """
-You are an expert in Agriculture whose job is to answer questions from the user using images.
+You are an expert in {subject} whose job is to answer questions from the user using images.
 First, reason about the correct answer.
 Then write the answer in the following format where X is exactly one of A,B,C,D:
 Answer: X
 Make sure X is one of A,B,C,D.
 If you are uncertain of the correct answer, guess the most likely one.
 """
 system_message = {
    "role": "system",
-    "content": SYSTEM_PROMPT_TEMPLATE,
+    "content": SYSTEM_PROMPT_TEMPLATE.format(subject=subset),
 }
 # register the evaluation benchmark task with the dataset and scoring function
 client.benchmarks.register(
    benchmark_id="meta-reference::mmmu",
    dataset_id=f"mmmu-{subset}-{split}",
@ -88,13 +87,14 @@ response = client.eval.evaluate_rows(
    input_rows=eval_rows,
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
-                    "type": "greedy",
+                    "type": "top_p",
                    "temperature": 1.0,
                    "top_p": 0.95,
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
@ -103,6 +103,7 @@ response = client.eval.evaluate_rows(
        },
    },
 )
 pprint(response)
 ```
 #### 1.2. Running SimpleQA
@ -115,10 +116,9 @@ simpleqa_dataset_id = "huggingface::simpleqa"
 _ = client.datasets.register(
    dataset_id=simpleqa_dataset_id,
    provider_id="huggingface",
-    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
+    url={"uri": "https://huggingface.co/datasets/llamastack/simpleqa"},
    metadata={
-        "path": "llamastack/evals",
+        "path": "llamastack/simpleqa",
        "name": "evals__simpleqa",
        "split": "train",
    },
    dataset_schema={
@ -146,7 +146,6 @@ response = client.eval.evaluate_rows(
    input_rows=eval_rows.rows,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
@ -160,6 +159,7 @@ response = client.eval.evaluate_rows(
        },
    },
 )
 pprint(response)
 ```
@ -170,19 +170,17 @@ response = client.eval.evaluate_rows(
 ```python
 agent_config = {
-    "model": "meta-llama/Llama-3.1-405B-Instruct",
+    "model": "meta-llama/Llama-3.3-70B-Instruct",
-    "instructions": "You are a helpful assistant",
+    "instructions": "You are a helpful assistant that have access to tool to search the web. ",
    "sampling_params": {
        "strategy": {
-            "type": "greedy",
+            "type": "top_p",
-        },
+            "temperature": 0.5,
-    },
+            "top_p": 0.9,
    "tools": [
        {
            "type": "brave_search",
            "engine": "tavily",
            "api_key": userdata.get("TAVILY_SEARCH_API_KEY"),
        }
    },
    "toolgroups": [
        "builtin::websearch",
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
@ -196,24 +194,21 @@ response = client.eval.evaluate_rows(
    input_rows=eval_rows.rows,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "agent",
            "config": agent_config,
        },
    },
 )
 pprint(response)
 ```
 ### 3. Agentic Application Dataset Scoring
- Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
- In this example, we will work with an example RAG dataset and couple of scoring functions for evaluation.
+Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
  - `llm-as-judge::base`: LLM-As-Judge with custom judge prompt & model.
  - `braintrust::factuality`: Factuality scorer from [braintrust](https://github.com/braintrustdata/autoevals).
  - `basic::subset_of`: Basic checking if generated answer is a subset of expected answer.
- Please checkout our [Llama Stack Playground](https://llama-stack.readthedocs.io/en/latest/playground/index.html) for an interactive interface to upload datasets and run scorings.
+In this example, we will work with an example RAG dataset you have built previously, label with an annotation, and use LLM-As-Judge with custom judge prompt for scoring. Please checkout our [Llama Stack Playground](https://llama-stack.readthedocs.io/en/latest/playground/index.html) for an interactive interface to upload datasets and run scorings.
 ```python
 judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"
@ -317,28 +312,9 @@ The `BenchmarkConfig` are user specified config to define:
 2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
-**Example Benchmark BenchmarkConfig**
+**Example BenchmarkConfig**
 ```json
 {
    "type": "benchmark",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.2-3B-Instruct",
        "sampling_params": {
            "strategy": {
                "type": "greedy",
            },
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    }
 }
 ```
 **Example Application BenchmarkConfig**
 ```json
 {
    "type": "app",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.1-405B-Instruct",