[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)

# Llama Stack Evals

This notebook will walk you through the main sets of APIs we offer with Llama Stack for supporting evaluations of your LLM applications with working examples to explore the possibilities that Llama Stack opens up for you.

Read more about Llama Stack: https://llama-stack.readthedocs.io/en/latest/index.html

Read more about the Llama Stack Evaluation flow: https://llama-stack.readthedocs.io/en/latest/cookbooks/evals.html


## 0. Bootstrapping Llama Stack Library

##### 0.1. Prerequisite: Create TogetherAI account

In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).

In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.
You can also use Fireworks.ai or even Ollama if you would like to.


> **Note:**  Set the API Key in the Secrets of this notebook as `TOGETHER_API_KEY`

In [None]:
!pip install -U llama-stack

In [None]:
!llama stack build --template together --image-type venv

In [None]:
import os
from google.colab import userdata

os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("together")
_ = client.initialize()

# register 405B as LLM Judge model
client.models.register(
    model_id="meta-llama/Llama-3.1-405B-Instruct",
    provider_model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    provider_id="together",
)



Model(identifier='meta-llama/Llama-3.1-405B-Instruct', metadata={}, provider_id='together', provider_resource_id='meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo', type='model', model_type='llm')

## 1. Open Benchmark Model Evaluation

The first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:

- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.

#### 1.1 Running MMMU
- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.

In [None]:
name = "llamastack/mmmu"
subset = "Agriculture"
split = "dev"

In [None]:
import datasets
ds = datasets.load_dataset(path=name, name=subset, split=split)
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
eval_rows = ds.to_pandas().to_dict(orient="records")

README.md:   0%|          | 0.00/36.0k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/29.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/165M [00:00<?, ?B/s]

test-00000-of-00003.parquet:   0%|          | 0.00/461M [00:00<?, ?B/s]

test-00001-of-00003.parquet:   0%|          | 0.00/454M [00:00<?, ?B/s]

test-00002-of-00003.parquet:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/30 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/287 [00:00<?, ? examples/s]

- **Run Evaluation on Model Candidate**
  - Define a System Prompt
  - Define an EvalCandidate
  - Run evaluate on datasets

In [None]:
from tqdm import tqdm
from rich.pretty import pprint

SYSTEM_PROMPT_TEMPLATE = """
You are an expert in {subject} whose job is to answer questions from the user using images.

First, reason about the correct answer.

Then write the answer in the following format where X is exactly one of A,B,C,D:

Answer: X

Make sure X is one of A,B,C,D.

If you are uncertain of the correct answer, guess the most likely one.
"""

system_message = {
    "role": "system",
    "content": SYSTEM_PROMPT_TEMPLATE.format(subject=subset),
}

client.eval_tasks.register(
    eval_task_id="meta-reference::mmmu",
    dataset_id=f"mmmu-{subset}-{split}",
    scoring_functions=["basic::regex_parser_multiple_choice_answer"]
)

response = client.eval.evaluate_rows(
    task_id="meta-reference::mmmu",
    input_rows=eval_rows,
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
    task_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "temperature": 0.0,
                "max_tokens": 4096,
                "top_p": 0.9,
                "repeat_penalty": 1.0,
            },
            "system_message": system_message
        }
    }
)
pprint(response)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:51<00:00, 10.28s/it]


#### 1.2. Running SimpleQA
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.

In [None]:
simpleqa_dataset_id = "huggingface::simpleqa"

_ = client.datasets.register(
    dataset_id=simpleqa_dataset_id,
    provider_id="huggingface",
    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
    metadata={
        "path": "llamastack/evals",
        "name": "evals__simpleqa",
        "split": "train",
    },
    dataset_schema={
        "input_query": {"type": "string"},
        "expected_answer": {"type": "string"},
        "chat_completion_input": {"type": "chat_completion_input"},
    }
)

In [None]:
eval_rows = client.datasetio.get_rows_paginated(
    dataset_id=simpleqa_dataset_id,
    rows_in_page=5,
)

In [None]:
client.eval_tasks.register(
    eval_task_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::405b-simpleqa"]
)

response = client.eval.evaluate_rows(
    task_id="meta-reference::simpleqa",
    input_rows=eval_rows.rows,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    task_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "temperature": 0.0,
                "max_tokens": 4096,
                "top_p": 0.9,
                "repeat_penalty": 1.0,
            },
        }
    }
)
pprint(response)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:48<00:00,  9.68s/it]


## 2. Agentic Evaluation

- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.

- We will continue to use the SimpleQA dataset we used in previous example.

- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.

> You will need to set the `TAVILY_SEARCH_API_KEY` in Secrets of this notebook.

In [None]:
agent_config = {
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "instructions": "You are a helpful assistant",
    "sampling_params": {
        "strategy": "greedy",
        "temperature": 0.0,
        "top_p": 0.95,
    },
    "tools": [
        {
            "type": "brave_search",
            "engine": "tavily",
            "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
        }
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "input_shields": [],
    "output_shields": [],
    "enable_session_persistence": False
}

response = client.eval.evaluate_rows(
    task_id="meta-reference::simpleqa",
    input_rows=eval_rows.rows,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    task_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "agent",
            "config": agent_config,
        }
    }
)
pprint(response)

5it [00:26,  5.29s/it]


## 3. Agentic Application Dataset Scoring

Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.

In this example, we will work with an example RAG dataset and use LLM-As-Judge with custom judge prompt for scoring. Please checkout our [Llama Stack Playground](https://llama-stack.readthedocs.io/en/latest/playground/index.html) for an interactive interface to upload datasets and run scorings.

In [None]:
import rich
from rich.pretty import pprint

judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"

JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

input_query = "What are the top 5 topics that were explained? Only list succinct bullet points."
generated_answer = """
Here are the top 5 topics that were explained in the documentation for Torchtune:

* What is LoRA and how does it work?
* Fine-tuning with LoRA: memory savings and parameter-efficient finetuning
* Running a LoRA finetune with Torchtune: overview and recipe
* Experimenting with different LoRA configurations: rank, alpha, and attention modules
* LoRA finetuning
"""
expected_answer = """LoRA"""

dataset_rows = [
    {
        "input_query": input_query,
        "generated_answer": generated_answer,
        "expected_answer": expected_answer,
    },
]

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
    "basic::subset_of": None,
}

response = client.scoring.score(input_rows=dataset_rows, scoring_functions=scoring_params)
pprint(response)

## 4. Online Evaluation Dataset Collection Using Telemetry

- Llama Stack offers built-in telemetry to collect traces and data about your agentic application.
- In this example, we will show how to build an Agent with Llama Stack, and query the agent's traces into an online dataset that can be used for evaluation.  
- Please see our [Llama Stack Showcase](https://colab.research.google.com/drive/1F2ksmkoGQPa4pzRjMOE6BXWeOxWFIW6n) notebook for more examples on building agents.

##### ðŸš§ Patches ðŸš§
- The following cells are temporary patches to get `telemetry` working.

In [None]:
# need to install on latest main
!pip uninstall llama-stack
!pip install git+https://github.com/meta-llama/llama-stack.git@main

In [None]:
# disable logging for clean server logs
import logging
def remove_root_handlers():
    root_logger = logging.getLogger()
    for handler in root_logger.handlers[:]:
        root_logger.removeHandler(handler)
        print(f"Removed handler {handler.__class__.__name__} from root logger")


remove_root_handlers()

Removed handler StreamHandler from root logger


##### Building a Search Agent

In [None]:
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from google.colab import userdata

agent_config = AgentConfig(
    model="meta-llama/Llama-3.1-405B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions. ",
    tools=(
        [
            {
                "type": "brave_search",
                "engine": "tavily",
                "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
            }
        ]
    ),
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Which teams played in the NBA western conference finals of 2024",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name?",
]

session_id = agent.create_session("test-session")

for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in EventLogger().log(response):
        log.print()

inference> Let me check the latest sports news.
inference> bravy_search.call(query="Bill Cosby South Park episode")
CustomTool> Unknown tool `bravy_search` was called.
inference> brave_search.call(query="Andrew Tate kickboxing name")
tool_execution> Tool:brave_search Args:{'query': 'Andrew Tate kickboxing name'}
tool_execution> Tool:brave_search Response:{"query": "Andrew Tate kickboxing name", "top_k": [{"title": "Andrew Tate kickboxing record: How many championships ... - FirstSportz", "url": "https://firstsportz.com/mma-how-many-championships-does-andrew-tate-have/", "content": "Andrew Tate's Kickboxing career. During his kickboxing career, he used the nickname \"King Cobra,\" which he currently uses as his Twitter name. Tate had an unorthodox style of movement inside the ring. He kept his hands down most of the time and relied on quick jabs and an overhand right to land significant strikes.", "score": 0.9996244, "raw_content": null}, {"title": "Andrew Tate: Kickboxing Record, Facts

##### Query Telemetry

In [None]:
print(f"Getting traces for session_id={session_id}")
import json
from rich.pretty import pprint

agent_logs = []

for span in client.telemetry.query_spans(
    attribute_filters=[
      {"key": "session_id", "op": "eq", "value": session_id},
    ],
    attributes_to_return=["input", "output"]
  ):
  if span.attributes["output"] != "no shields":
    agent_logs.append(span.attributes)

pprint(agent_logs)

Getting traces for session_id=ac651ce8-2281-47f2-8814-ef947c066e40


##### Post-Process Telemetry Results & Evaluate

- Now, we want to run evaluation to assert that our search agent succesfully calls brave_search from online traces.
- We will first post-process the agent's telemetry logs and run evaluation.

In [None]:
# post-process telemetry spance and prepare data for eval
# in this case, we want to assert that all user prompts is followed by a tool call
import ast
import json

eval_rows = []

for log in agent_logs:
  last_msg = log['input'][-1]
  if "\"role\":\"user\"" in last_msg:
    eval_rows.append(
        {
            "input_query": last_msg,
            "generated_answer": log["output"],
            # check if generated_answer uses tools brave_search
            "expected_answer": "brave_search",
        },
    )

pprint(eval_rows)
scoring_params = {
    "basic::subset_of": None,
}
scoring_response = client.scoring.score(input_rows=eval_rows, scoring_functions=scoring_params)
pprint(scoring_response)