Updates to ReadTheDocs (#859)

Move evals section to AI Agents section drop from top level and other minor fixes
2025-01-23 12:42:15 -08:00 · 2025-01-23 12:42:15 -08:00 · a6a4270eef
commit a6a4270eef
parent d78027f3b5
6 changed files with 18 additions and 70 deletions
--- a/docs/source/building_applications/agent_execution_loop.md
+++ b/docs/source/building_applications/agent_execution_loop.md
@ -1,4 +1,4 @@
-# Agent Execution Loop
+## Agent Execution Loop

 Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.

--- a/docs/source/building_applications/evals.md
+++ b/docs/source/building_applications/evals.md
@ -0,0 +1,169 @@
+# Evals
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
+
+Llama Stack provides the building blocks needed to run benchmark and application evaluations. This guide will walk you through how to use these components to run open benchmark evaluations. Visit our [Evaluation Concepts](../concepts/evaluation_concepts.md) guide for more details on how evaluations work in Llama Stack, and our [Evaluation Reference](../references/evals_reference/index.md) guide for a comprehensive reference on the APIs.
+
+### 1. Open Benchmark Model Evaluation
+
+This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
+- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
+- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
+
+#### 1.1 Running MMMU
+- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
+
+```python
+import datasets
+ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
+ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
+eval_rows = ds.to_pandas().to_dict(orient="records")
+```
+
+- Next, we will run evaluation on an model candidate, we will need to:
+  - Define a system prompt
+  - Define an EvalCandidate
+  - Run evaluate on the dataset
+
+```python
+SYSTEM_PROMPT_TEMPLATE = """
+You are an expert in Agriculture whose job is to answer questions from the user using images.
+First, reason about the correct answer.
+Then write the answer in the following format where X is exactly one of A,B,C,D:
+Answer: X
+Make sure X is one of A,B,C,D.
+If you are uncertain of the correct answer, guess the most likely one.
+"""
+
+system_message = {
+    "role": "system",
+    "content": SYSTEM_PROMPT_TEMPLATE,
+}
+
+client.eval_tasks.register(
+    eval_task_id="meta-reference::mmmu",
+    dataset_id=f"mmmu-{subset}-{split}",
+    scoring_functions=["basic::regex_parser_multiple_choice_answer"]
+)
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::mmmu",
+    input_rows=eval_rows,
+    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "model",
+            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
+            "sampling_params": {
+                "strategy": {
+                    "type": "greedy",
+                },
+                "max_tokens": 4096,
+                "repeat_penalty": 1.0,
+            },
+            "system_message": system_message
+        }
+    }
+)
+```
+
+#### 1.2. Running SimpleQA
+- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
+- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
+
+```python
+simpleqa_dataset_id = "huggingface::simpleqa"
+
+_ = client.datasets.register(
+    dataset_id=simpleqa_dataset_id,
+    provider_id="huggingface",
+    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
+    metadata={
+        "path": "llamastack/evals",
+        "name": "evals__simpleqa",
+        "split": "train",
+    },
+    dataset_schema={
+        "input_query": {"type": "string"},
+        "expected_answer": {"type": "string"},
+        "chat_completion_input": {"type": "chat_completion_input"},
+    }
+)
+
+eval_rows = client.datasetio.get_rows_paginated(
+    dataset_id=simpleqa_dataset_id,
+    rows_in_page=5,
+)
+```
+
+```python
+client.eval_tasks.register(
+    eval_task_id="meta-reference::simpleqa",
+    dataset_id=simpleqa_dataset_id,
+    scoring_functions=["llm-as-judge::405b-simpleqa"]
+)
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::simpleqa",
+    input_rows=eval_rows.rows,
+    scoring_functions=["llm-as-judge::405b-simpleqa"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "model",
+            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
+            "sampling_params": {
+                "strategy": {
+                    "type": "greedy",
+                },
+                "max_tokens": 4096,
+                "repeat_penalty": 1.0,
+            },
+        }
+    }
+)
+```
+
+
+### 2. Agentic Evaluation
+- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
+- We will continue to use the SimpleQA dataset we used in previous example.
+- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
+
+```python
+agent_config = {
+    "model": "meta-llama/Llama-3.1-405B-Instruct",
+    "instructions": "You are a helpful assistant",
+    "sampling_params": {
+        "strategy": {
+            "type": "greedy",
+        },
+    },
+    "tools": [
+        {
+            "type": "brave_search",
+            "engine": "tavily",
+            "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
+        }
+    ],
+    "tool_choice": "auto",
+    "tool_prompt_format": "json",
+    "input_shields": [],
+    "output_shields": [],
+    "enable_session_persistence": False
+}
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::simpleqa",
+    input_rows=eval_rows.rows,
+    scoring_functions=["llm-as-judge::405b-simpleqa"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "agent",
+            "config": agent_config,
+        }
+    }
+)
+```
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -6,12 +6,14 @@ The best way to get started is to look at this notebook which walks through the

 **Notebook**: [Building AI Applications](docs/notebooks/Llama_Stack_Building_AI_Applications.ipynb)

-## Agentic Concepts
+Here are some key topics that will help you build effective agents:
+
 - **[Agent Execution Loop](agent_execution_loop)**
 - **[RAG](rag)**
 - **[Safety](safety)**
 - **[Tools](tools)**
 - **[Telemetry](telemetry)**
+- **[Evals](evals)**


 ```{toctree}
@ -23,4 +25,5 @@ rag
 safety
 tools
 telemetry
+evals
 ```
--- a/docs/source/building_applications/telemetry.md
+++ b/docs/source/building_applications/telemetry.md
@ -1,11 +1,10 @@
-# Telemetry
-
+## Telemetry

 The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.

-## Key Concepts
+#### Key Concepts

-### Events
+#### Events
 The telemetry system supports three main types of events:

 - **Unstructured Log Events**: Free-form log messages with severity levels
@ -31,24 +30,24 @@ structured_log_event = SpanStartPayload(
 )
 ```

-### Spans and Traces
+#### Spans and Traces
 - **Spans**: Represent operations with timing and hierarchical relationships
 - **Traces**: Collection of related spans forming a complete request flow

-### Sinks
+#### Sinks
 - **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger.
 - **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
 - **Console**: Print events to the console.

-## Providers
+#### Providers

-### Meta-Reference Provider
+#### Meta-Reference Provider
 Currently, only the meta-reference provider is implemented. It can be configured to send events to three sink types:
 1) OpenTelemetry Collector
 2) SQLite
 3) Console

-## Configuration
+#### Configuration

 Here's an example that sends telemetry signals to all three sink types. Your configuration might use only one.
 ```yaml
@ -61,7 +60,7 @@ Here's an example that sends telemetry signals to all three sink types. Your con
      sqlite_db_path: "/path/to/telemetry.db"
 ```

-## Jaeger to visualize traces
+#### Jaeger to visualize traces

 The `otel` sink works with any service compatible with the OpenTelemetry collector. Let's use Jaeger to visualize this data.

@ -75,6 +74,6 @@ $ docker run --rm --name jaeger \

 Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.

-## Querying Traces Stored in SQLIte
+#### Querying Traces Stored in SQLIte

 The `sqlite` sink allows you to query traces without an external system. Here are some example queries. Refer to the notebook at [Llama Stack Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on how to query traces and spaces.