forked from phoenix-oss/llama-stack-mirror
feat: open benchmark template and doc (#1465)
## What does this PR do? - Provide a distro template to let developer easily run the open benchmarks llama stack supports on llama and non-llama models. - Provide doc on how to run open benchmark eval via CLI and open benchmark contributing guide [//]: # (If resolving an issue, uncomment and update the line below) (Closes #1375 ) ## Test Plan open benchmark eval results on llama, gpt, gemini and clause <img width="771" alt="Screenshot 2025-03-06 at 7 33 05 PM" src="https://github.com/user-attachments/assets/1bd85456-b9b9-4b37-af76-4ce1d2bac00e" /> doc preview <img width="944" alt="Screenshot 2025-03-06 at 7 33 58 PM" src="https://github.com/user-attachments/assets/f4e5866d-b395-4c40-aa8b-080edeb5cdb6" /> <img width="955" alt="Screenshot 2025-03-06 at 7 34 04 PM" src="https://github.com/user-attachments/assets/629defb6-d5e4-473c-aa03-308bce386fb4" /> <img width="965" alt="Screenshot 2025-03-06 at 7 35 29 PM" src="https://github.com/user-attachments/assets/c21ff96c-9e8c-4c54-b6b8-25883125f4cf" /> <img width="957" alt="Screenshot 2025-03-06 at 7 35 37 PM" src="https://github.com/user-attachments/assets/47571c90-1381-4e2c-bbed-c4f3a60578d0" />
This commit is contained in:
parent
290cc843fc
commit
4dccf916d1
7 changed files with 585 additions and 10 deletions
|
@ -24,6 +24,56 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
|
|||
- Associated with `Benchmark` resource.
|
||||
|
||||
|
||||
## Open-benchmark Eval
|
||||
|
||||
### List of open-benchmarks Llama Stack support
|
||||
|
||||
Llama stack pre-registers several popular open-benchmarks to easily evaluate model perfomance via CLI.
|
||||
|
||||
The list of open-benchmarks we currently support:
|
||||
- [MMLU-COT](https://arxiv.org/abs/2009.03300) (Measuring Massive Multitask Language Understanding): Benchmark designed to comprehensively evaluate the breadth and depth of a model's academic and professional understanding
|
||||
- [GPQA-COT](https://arxiv.org/abs/2311.12022) (A Graduate-Level Google-Proof Q&A Benchmark): A challenging benchmark of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
|
||||
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
|
||||
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.
|
||||
|
||||
|
||||
You can follow this contributing guidance to add more open-benchmarks to Llama Stack
|
||||
|
||||
### Run evaluation on open-benchmarks via CLI
|
||||
|
||||
We have built-in functionality to run the supported open-benckmarks using llama-stack-client CLI
|
||||
|
||||
#### Spin up Llama Stack server
|
||||
|
||||
Spin up llama stack server with 'open-benchmark' template
|
||||
```
|
||||
llama stack run llama_stack/templates/open-benchmark/run.yaml
|
||||
|
||||
```
|
||||
|
||||
#### Run eval CLI
|
||||
There are 3 necessary inputs to run a benchmark eval
|
||||
- `list of benchmark_ids`: The list of benchmark ids to run evaluation on
|
||||
- `model-id`: The model id to evaluate on
|
||||
- `utput_dir`: Path to store the evaluate results
|
||||
```
|
||||
llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
|
||||
--model_id <model id to evaluate on> \
|
||||
--output_dir <directory to store the evaluate results> \
|
||||
```
|
||||
|
||||
You can run
|
||||
```
|
||||
llama-stack-client eval run-benchmark help
|
||||
```
|
||||
to see the description of all the flags that eval run-benchmark has
|
||||
|
||||
|
||||
In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggrgate
|
||||
evaluation results over there.
|
||||
|
||||
|
||||
|
||||
## What's Next?
|
||||
|
||||
- Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP).
|
||||
|
|
|
@ -275,18 +275,25 @@ response = client.scoring.score(
|
|||
The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
|
||||
|
||||
#### Benchmark Evaluation CLI
|
||||
Usage: There are 2 inputs necessary for running a benchmark eval
|
||||
- `eval-task-id`: the identifier associated with the eval task. Each `Benchmark` is parametrized by
|
||||
- `dataset_id`: the identifier associated with the dataset.
|
||||
- `List[scoring_function_id]`: list of scoring function identifiers.
|
||||
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
|
||||
There are 3 necessary input for running a benchmark eval
|
||||
- `list of benchmark_ids`: The list of benchmark ids to run evaluation on
|
||||
- `model-id`: The model id to evaluate on
|
||||
- `utput_dir`: Path to store the evaluate results
|
||||
```
|
||||
llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
|
||||
--model_id <model id to evaluate on> \
|
||||
--output_dir <directory to store the evaluate results> \
|
||||
```
|
||||
|
||||
You can run
|
||||
```
|
||||
llama-stack-client eval run-benchmark help
|
||||
```
|
||||
to see the description of all the flags to run benckmark eval
|
||||
|
||||
|
||||
```
|
||||
llama-stack-client eval run_benchmark <eval-task-id> \
|
||||
--eval-task-config ~/benchmark_config.json \
|
||||
--visualize
|
||||
```
|
||||
In the output log, you can find the path to the file that has your evaluation results. Open that file and you can see you aggrgate
|
||||
evaluation results over there.
|
||||
|
||||
|
||||
#### Application Evaluation CLI
|
||||
|
@ -338,3 +345,52 @@ The `BenchmarkConfig` are user specified config to define:
|
|||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Open-benchmark Contributing Guide
|
||||
|
||||
### Create the new dataset for your new benchmark
|
||||
An eval open-benchmark essentially contains 2 parts:
|
||||
- `raw data`: The raw dataset associated with the benchmark. You typically need to search the original paper that introduces the benchmark and find the canonical dataset (usually hosted on huggingface)
|
||||
- `prompt template`: How to ask the candidate model to generate the answer (prompt template plays a critical role to the evaluation results). Tyically, you can find the reference prompt template associated with the benchmark in benchmarks author's repo ([exmaple](https://github.com/idavidrein/gpqa/blob/main/prompts/chain_of_thought.txt)) or some other popular open source repos ([example](https://github.com/openai/simple-evals/blob/0a6e8f62e52bc5ae915f752466be3af596caf392/common.py#L14))
|
||||
|
||||
To create new open-benmark in llama stack, you need to combine the prompt template and the raw data into the `chat_completion_input` column in the evaluation dataset.
|
||||
|
||||
Llama stack enforeces the evaluate dataset schema to contain at least 3 columns:
|
||||
- `chat_completion_input`: The actual input to the model to run the generation for eval
|
||||
- `input_query`: The raw input from the raw dataset without the prompt template
|
||||
- `expected_answer`: The ground truth for scoring functions to calcalate the score from.
|
||||
|
||||
|
||||
You need to write a script [example convert script](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840) to convert the benchmark raw dataset to llama stack format eval dataset and update the dataset to huggingface [example benchmark dataset](https://huggingface.co/datasets/llamastack/mmmu)
|
||||
|
||||
|
||||
### Find scoring function for your new benchmark
|
||||
The purpose of scoring function is to calculate the score for each example based on candidate model generation result and expected_answer. It also aggregates the scores from all the examples and generate the final evaluate results.
|
||||
|
||||
|
||||
Firstly, you can see if the existing [llama stack scoring functions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/scoring) can fulfill your need. If not, you need to write a new scoring function based on what benchmark author / other open source repo describe.
|
||||
|
||||
### Add new benchmark into template
|
||||
Firstly, you need to add the evaluation dataset associated with your benchmark under `datasets` resource in templates/open-benchmark/run.yaml
|
||||
|
||||
Secondly, you need to add the new benchmark you just created under the `benchmarks` resource in the same template. To add the new benchmark, you need to have
|
||||
- `benchmark_id`: identifier of the benchmark
|
||||
- `dataset_id`: identifier of the dataset associated with your benchmark
|
||||
- `scoring_functions`: scoring function to calculate the score based on generation results and expected_answer
|
||||
|
||||
|
||||
### Test the new benchmark
|
||||
|
||||
Spin up llama stack server with 'open-benchmark' templates
|
||||
```
|
||||
llama stack run llama_stack/templates/open-benchmark/run.yaml
|
||||
|
||||
```
|
||||
|
||||
Run eval benchmark CLI with your new benchmark id
|
||||
```
|
||||
llama-stack-client eval run-benchmark <new_benchmark_id> \
|
||||
--model_id <model id to evaluate on> \
|
||||
--output_dir <directory to store the evaluate results> \
|
||||
```
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue