llama-stack-mirror/llama_stack/templates
Botao Chen e3edca7739
feat: [new open benchmark] Math 500 (#1538)
## What does this PR do?
Created a new math_500 open-benchmark based on OpenAI's [Let's Verify
Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's
[HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)
dataset.

The challenge part of this benchmark is to parse the generated and
expected answer and verify if they are same. For the parsing part, we
refer to [Minerva: Solving Quantitative Reasoning Problems with Language
Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/).

To simply the parse logic, as the next step, we plan to also refer to
what [simple-eval](https://github.com/openai/simple-evals) is doing,
using llm as judge to check if the generated answer matches the expected
answer or not


## Test Plan
on sever side, spin up a server with open-benchmark template `llama
stack run llama_stack/templates/open-benchamrk/run.yaml`

on client side, issue an open benchmark eval request `llama-stack-client
--endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id
"meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/"
--num-examples 20` and get ther aggregated eval results
<img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM"
src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb"
/>

check the generated answer and the related scoring and they make sense
2025-03-10 20:38:28 -07:00
..
bedrock refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
cerebras refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
ci-tests refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
dell chore: remove straggler references to llama-models (#1345) 2025-03-01 14:26:03 -08:00
dev refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
experimental-post-training feat: [post training] support save hf safetensor format checkpoint (#845) 2025-02-25 23:29:08 -08:00
fireworks refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401) 2025-03-04 14:53:47 -08:00
groq fix: register provider model name and HF alias in run.yaml (#1304) 2025-02-27 16:39:23 -08:00
hf-endpoint refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
hf-serverless refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
meta-reference-gpu refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
meta-reference-quantized-gpu refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
nvidia fix: register provider model name and HF alias in run.yaml (#1304) 2025-02-27 16:39:23 -08:00
ollama fix: revert to using faiss for ollama distro (#1530) 2025-03-10 16:15:17 -07:00
open-benchmark feat: [new open benchmark] Math 500 (#1538) 2025-03-10 20:38:28 -07:00
passthrough feat: inference passthrough provider (#1166) 2025-02-19 21:47:00 -08:00
remote-vllm refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401) 2025-03-04 14:53:47 -08:00
sambanova refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
tgi refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
together refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401) 2025-03-04 14:53:47 -08:00
vllm-gpu feat: updated inline vllm inference provider (#880) 2025-03-07 13:38:23 -08:00
__init__.py Auto-generate distro yamls + docs (#468) 2024-11-18 14:57:06 -08:00
template.py refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00