[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb)

# Llama Stack Benchmark Evals

This notebook will walk you through the main sets of APIs we offer with Llama Stack for supporting running benchmark evaluations of your with working examples to explore the possibilities that Llama Stack opens up for you.

Read more about Llama Stack: https://llama-stack.readthedocs.io/en/latest/index.html

## 0. Bootstrapping Llama Stack Library

##### 0.1. Prerequisite: Create TogetherAI account

In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).

In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.
You can also use Fireworks.ai or even Ollama if you would like to.


> **Note:**  Set the API Key in the Secrets of this notebook as `TOGETHER_API_KEY`

In [None]:
# NBVAL_SKIP
!pip install -U llama-stack

In [None]:
# NBVAL_SKIP
!UV_SYSTEM_PYTHON=1 llama stack build --distro together --image-type venv

In [None]:
import os

try:
    from google.colab import userdata
    os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')
    os.environ['TAVILY_SEARCH_API_KEY'] = userdata.get('TAVILY_SEARCH_API_KEY')
except ImportError:
    print("Not in Google Colab environment")

from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("together")
_ = client.initialize()

Not in Google Colab environment




## 1. Open Benchmark Model Evaluation

The first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:

- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.

#### 1.1 Running MMMU
- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.

In [None]:
name = "llamastack/mmmu"
subset = "Agriculture"
split = "dev"

In [None]:
import datasets

ds = datasets.load_dataset(path=name, name=subset, split=split)
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
eval_rows = ds.to_pandas().to_dict(orient="records")


- **Run Evaluation on Model Candidate**
  - Define a System Prompt
  - Define an EvalCandidate
  - Run evaluate on datasets

In [None]:
from rich.pretty import pprint
from tqdm import tqdm

SYSTEM_PROMPT_TEMPLATE = """
You are an expert in {subject} whose job is to answer questions from the user using images.

First, reason about the correct answer.

Then write the answer in the following format where X is exactly one of A,B,C,D:

Answer: X

Make sure X is one of A,B,C,D.

If you are uncertain of the correct answer, guess the most likely one.
"""

system_message = {
    "role": "system",
    "content": SYSTEM_PROMPT_TEMPLATE.format(subject=subset),
}

client.benchmarks.register(
    benchmark_id="meta-reference::mmmu",
    # Note: we can use any value as `dataset_id` because we'll be using the `evaluate_rows` API which accepts the
    # `input_rows` argument and does not fetch data from the dataset.
    dataset_id=f"mmmu-{subset}-{split}",
    # Note: for the same reason as above, we can use any value as `scoring_functions`.
    scoring_functions=[],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::mmmu",
    input_rows=eval_rows,
    # Note: Here we define the actual scoring functions.
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "top_p",
                    "temperature": 1.0,
                    "top_p": 0.95,
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
            "system_message": system_message,
        },
    },
)
pprint(response)


100%|██████████| 5/5 [00:33<00:00,  6.71s/it]


#### 1.2. Running SimpleQA
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.

In [None]:
simpleqa_dataset_id = "huggingface::simpleqa"

register_dataset_response = client.datasets.register(
    purpose="eval/messages-answer",
    source={
        "type": "uri",
        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
    },
    dataset_id=simpleqa_dataset_id,
)

In [None]:
eval_rows = client.datasets.iterrows(
    dataset_id=simpleqa_dataset_id,
    limit=5,
)

In [None]:
# register 405B as LLM Judge model
client.models.register(
    model_id="meta-llama/Llama-3.1-405B-Instruct",
    provider_model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    provider_id="together",
)

client.benchmarks.register(
    benchmark_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
        },
    },
)
pprint(response)


  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:13<00:00,  2.71s/it]


## 2. Agentic Evaluation

- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.

- We will continue to use the SimpleQA dataset we used in previous example.

- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.

> You will need to set the `TAVILY_SEARCH_API_KEY` in Secrets of this notebook.

In [None]:
agent_config = {
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "instructions": "You are a helpful assistant that have access to tool to search the web. ",
    "sampling_params": {
        "strategy": {
            "type": "top_p",
            "temperature": 0.5,
            "top_p": 0.9,
        }
    },
    "toolgroups": [
        "builtin::websearch",
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "input_shields": [],
    "output_shields": [],
    "enable_session_persistence": False,
}

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "eval_candidate": {
            "type": "agent",
            "config": agent_config,
        },
    },
)
pprint(response)


5it [00:06,  1.33s/it]
