fix!: update eval-tasks -> benchmarks (#1032)

# What does this PR do?

- Update `/eval-tasks` to `/benchmarks`
- ⚠️ Remove differentiation between `app` v.s. `benchmark` eval task
config. Now we only have `BenchmarkConfig`. The overloaded `benchmark`
is confusing and do not add any value. Backward compatibility is being
kept as the "type" is not being used anywhere.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- This change is backward compatible 
- Run notebook test with

```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```

<img width="846" alt="image"
src="https://github.com/user-attachments/assets/d2fc06a7-593a-444f-bc1f-10ab9b0c843d"
/>



[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Co-authored-by: Ben Browning <ben324@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu <reid201711@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
This commit is contained in:
Xi Yan 2025-02-13 16:40:58 -08:00 committed by GitHub
parent 225dd38e5c
commit 8b655e3cd2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
60 changed files with 2622 additions and 1910 deletions

View file

@ -5,7 +5,7 @@ The Llama Stack Evaluation flow allows you to run evaluations on your GenAI appl
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/eval_tasks` API
- `/eval` + `/benchmarks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
@ -21,7 +21,7 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `EvalTask` resource.
- Associated with `Benchmark` resource.
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
@ -77,14 +77,14 @@ system_message = {
"content": SYSTEM_PROMPT_TEMPLATE,
}
client.eval_tasks.register(
eval_task_id="meta-reference::mmmu",
client.benchmarks.register(
benchmark_id="meta-reference::mmmu",
dataset_id=f"mmmu-{subset}-{split}",
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
)
response = client.eval.evaluate_rows(
task_id="meta-reference::mmmu",
benchmark_id="meta-reference::mmmu",
input_rows=eval_rows,
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
task_config={
@ -135,14 +135,14 @@ eval_rows = client.datasetio.get_rows_paginated(
```
```python
client.eval_tasks.register(
eval_task_id="meta-reference::simpleqa",
client.benchmarks.register(
benchmark_id="meta-reference::simpleqa",
dataset_id=simpleqa_dataset_id,
scoring_functions=["llm-as-judge::405b-simpleqa"],
)
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
benchmark_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
@ -192,7 +192,7 @@ agent_config = {
}
response = client.eval.evaluate_rows(
task_id="meta-reference::simpleqa",
benchmark_id="meta-reference::simpleqa",
input_rows=eval_rows.rows,
scoring_functions=["llm-as-judge::405b-simpleqa"],
task_config={
@ -281,7 +281,7 @@ The following examples give the quick steps to start running evaluations using t
#### Benchmark Evaluation CLI
Usage: There are 2 inputs necessary for running a benchmark eval
- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
- `eval-task-id`: the identifier associated with the eval task. Each `Benchmark` is parametrized by
- `dataset_id`: the identifier associated with the dataset.
- `List[scoring_function_id]`: list of scoring function identifiers.
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
@ -289,7 +289,7 @@ Usage: There are 2 inputs necessary for running a benchmark eval
```
llama-stack-client eval run_benchmark <eval-task-id> \
--eval-task-config ~/eval_task_config.json \
--eval-task-config ~/benchmark_config.json \
--visualize
```
@ -309,15 +309,15 @@ llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <sco
--output-dir ./
```
#### Defining EvalTaskConfig
The `EvalTaskConfig` are user specified config to define:
#### Defining BenchmarkConfig
The `BenchmarkConfig` are user specified config to define:
1. `EvalCandidate` to run generation on:
- `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
- `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API.
2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
**Example Benchmark EvalTaskConfig**
**Example Benchmark BenchmarkConfig**
```json
{
"type": "benchmark",
@ -335,7 +335,7 @@ The `EvalTaskConfig` are user specified config to define:
}
```
**Example Application EvalTaskConfig**
**Example Application BenchmarkConfig**
```json
{
"type": "app",