llama-stack

forked from phoenix-oss/llama-stack-mirror

History

Botao Chen e3edca7739 feat: [new open benchmark] Math 500 (#1538 ) ## What does this PR do? Created a new math_500 open-benchmark based on OpenAI's [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's [HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) dataset. The challenge part of this benchmark is to parse the generated and expected answer and verify if they are same. For the parsing part, we refer to [Minerva: Solving Quantitative Reasoning Problems with Language Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/). To simply the parse logic, as the next step, we plan to also refer to what [simple-eval](https://github.com/openai/simple-evals) is doing, using llm as judge to check if the generated answer matches the expected answer or not ## Test Plan on sever side, spin up a server with open-benchmark template `llama stack run llama_stack/templates/open-benchamrk/run.yaml` on client side, issue an open benchmark eval request `llama-stack-client --endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id "meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/" --num-examples 20` and get ther aggregated eval results <img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM" src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb" /> check the generated answer and the related scoring and they make sense		2025-03-10 20:38:28 -07:00
..
apis	fix: Revert "feat: record token usage for inference API (#1300 )" (#1476 )	2025-03-07 10:16:47 -08:00
cli	feat(server): Use system packages for execution (#1252 )	2025-03-10 16:01:03 -07:00
distribution	feat(server): Use system packages for execution (#1252 )	2025-03-10 16:01:03 -07:00
models/llama	refactor: move a few tests to top-level tests/ directory	2025-03-03 17:33:39 -08:00
providers	feat: [new open benchmark] Math 500 (#1538 )	2025-03-10 20:38:28 -07:00
scripts	refactor(test): introduce --stack-config and simplify options (#1404 )	2025-03-05 17:02:02 -08:00
strong_typing	Ensure that deprecations for fields follow through to OpenAPI	2025-02-19 13:54:04 -08:00
templates	feat: [new open benchmark] Math 500 (#1538 )	2025-03-10 20:38:28 -07:00
__init__.py	export LibraryClient	2024-12-13 12:08:00 -08:00
env.py	refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401 )	2025-03-04 14:53:47 -08:00
log.py	chore: add color to Env Variable message (#1525 )	2025-03-10 15:29:40 -07:00
schema_utils.py	ci: add mypy for static type checking (#1101 )	2025-02-21 13:15:40 -08:00