From ddaf929f76ff0c5c92716002baddedefde6d2d56 Mon Sep 17 00:00:00 2001 From: Botao Chen Date: Thu, 6 Mar 2025 19:21:33 -0800 Subject: [PATCH] refine --- docs/source/concepts/evaluation_concepts.md | 50 ++++++++++++ .../references/evals_reference/index.md | 76 ++++++++++++++++--- 2 files changed, 116 insertions(+), 10 deletions(-) diff --git a/docs/source/concepts/evaluation_concepts.md b/docs/source/concepts/evaluation_concepts.md index eae606712..61a695d9f 100644 --- a/docs/source/concepts/evaluation_concepts.md +++ b/docs/source/concepts/evaluation_concepts.md @@ -24,6 +24,56 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo - Associated with `Benchmark` resource. +## Open-benchmark Eval + +### List of open-benchmarks Llama Stack support + +Llama stack pre-registers several popular open-benchmarks to easily evaluate model perfomance via CLI. + +The list of open-benchmarks we currently support: +- [MMLU-COT](https://arxiv.org/abs/2009.03300) (Measuring Massive Multitask Language Understanding): Benchmark designed to comprehensively evaluate the breadth and depth of a model's academic and professional understanding +- [GPQA-COT](https://arxiv.org/abs/2311.12022) (A Graduate-Level Google-Proof Q&A Benchmark): A challenging benchmark of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. +- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions. +- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models. + + +You can follow this contributing guidance to add more open-benchmarks to Llama Stack + +### Run evaluation on open-benchmarks via CLI + +We have built-in functionality to run the supported open-benckmarks using llama-stack-client CLI + +#### Spin up Llama Stack server + +Spin up llama stack server with 'open-benchmark' template +``` +llama stack run llama_stack/templates/open-benchmark/run.yaml + +``` + +#### Run eval CLI +There are 3 necessary inputs to run a benchmark eval +- `list of benchmark_ids`: The list of benchmark ids to run evaluation on +- `model-id`: The model id to evaluate on +- `utput_dir`: Path to store the evaluate results +``` +llama-stack-client eval run-benchmark ... \ +--model_id \ +--output_dir \ +``` + +You can run +``` +llama-stack-client eval run-benchmark help +``` +to see the description of all the flags that eval run-benchmark has + + +In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggrgate +evaluation results over there. + + + ## What's Next? - Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP). diff --git a/docs/source/references/evals_reference/index.md b/docs/source/references/evals_reference/index.md index 14ce0bf34..d55537c47 100644 --- a/docs/source/references/evals_reference/index.md +++ b/docs/source/references/evals_reference/index.md @@ -275,18 +275,25 @@ response = client.scoring.score( The following examples give the quick steps to start running evaluations using the llama-stack-client CLI. #### Benchmark Evaluation CLI -Usage: There are 2 inputs necessary for running a benchmark eval -- `eval-task-id`: the identifier associated with the eval task. Each `Benchmark` is parametrized by - - `dataset_id`: the identifier associated with the dataset. - - `List[scoring_function_id]`: list of scoring function identifiers. -- `eval-task-config`: specifies the configuration of the model / agent to evaluate on. +There are 3 necessary input for running a benchmark eval +- `list of benchmark_ids`: The list of benchmark ids to run evaluation on +- `model-id`: The model id to evaluate on +- `utput_dir`: Path to store the evaluate results +``` +llama-stack-client eval run-benchmark ... \ +--model_id \ +--output_dir \ +``` + +You can run +``` +llama-stack-client eval run-benchmark help +``` +to see the description of all the flags to run benckmark eval -``` -llama-stack-client eval run_benchmark \ ---eval-task-config ~/benchmark_config.json \ ---visualize -``` +In the output log, you can find the path to the file that has your evaluation results. Open that file and you can see you aggrgate +evaluation results over there. #### Application Evaluation CLI @@ -338,3 +345,52 @@ The `BenchmarkConfig` are user specified config to define: } } ``` + + +## Open-benchmark Contributing Guide + +### Create the new dataset for your new benchmark +An eval open-benchmark essentially contains 2 parts: +- `raw data`: The raw dataset associated with the benchmark. You typically need to search the original paper that introduces the benchmark and find the canonical dataset (usually hosted on huggingface) +- `prompt template`: How to ask the candidate model to generate the answer (prompt template plays a critical role to the evaluation results). Tyically, you can find the reference prompt template associated with the benchmark in benchmarks author's repo ([exmaple](https://github.com/idavidrein/gpqa/blob/main/prompts/chain_of_thought.txt)) or some other popular open source repos ([example](https://github.com/openai/simple-evals/blob/0a6e8f62e52bc5ae915f752466be3af596caf392/common.py#L14)) + +To create new open-benmark in llama stack, you need to combine the prompt template and the raw data into the `chat_completion_input` column in the evaluation dataset. + +Llama stack enforeces the evaluate dataset schema to contain at least 3 columns: +- `chat_completion_input`: The actual input to the model to run the generation for eval +- `input_query`: The raw input from the raw dataset without the prompt template +- `expected_answer`: The ground truth for scoring functions to calcalate the score from. + + +You need to write a script [example convert script](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840) to convert the benchmark raw dataset to llama stack format eval dataset and update the dataset to huggingface [example benchmark dataset](https://huggingface.co/datasets/llamastack/mmmu) + + +### Find scoring function for your new benchmark +The purpose of scoring function is to calculate the score for each example based on candidate model generation result and expected_answer. It also aggregates the scores from all the examples and generate the final evaluate results. + + +Firstly, you can see if the existing [llama stack scoring functions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/scoring) can fulfill your need. If not, you need to write a new scoring function based on what benchmark author / other open source repo describe. + +### Add new benchmark into template +Firstly, you need to add the evaluation dataset associated with your benchmark under `datasets` resource in templates/open-benchmark/run.yaml + +Secondly, you need to add the new benchmark you just created under the `benchmarks` resource in the same template. To add the new benchmark, you need to have +- `benchmark_id`: identifier of the benchmark +- `dataset_id`: identifier of the dataset associated with your benchmark +- `scoring_functions`: scoring function to calculate the score based on generation results and expected_answer + + +### Test the new benchmark + +Spin up llama stack server with 'open-benchmark' templates +``` +llama stack run llama_stack/templates/open-benchmark/run.yaml + +``` + +Run eval benchmark CLI with your new benchmark id +``` +llama-stack-client eval run-benchmark \ +--model_id \ +--output_dir \ +```