llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-08 14:54:35 +00:00

Author	SHA1	Message	Date
yyymeta	a626b7bce3	feat: [new open benchmark] BFCL_v3 (#1578 ) # What does this PR do? create a new dataset BFCL_v3 from https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html overall each question asks the model to perform a task described in natural language, and additionally a set of available functions and their schema are given for the model to choose from. the model is required to write the function call form including function name and parameters , to achieve the stated purpose. the results are validated against provided ground truth, to make sure that the generated function call and the ground truth function call are syntactically and semantically equivalent, by checking their AST . ## Test Plan start server by ``` llama stack run ./llama_stack/templates/ollama/run.yaml ``` then send traffic ``` llama-stack-client eval run-benchmark "bfcl" --model-id meta-llama/Llama-3.2-3B-Instruct --output-dir /tmp/gpqa --num-examples 2 ``` [//]: # (## Documentation)	2025-03-14 12:50:49 -07:00
Ashwin Bharambe	d072b5fa0c	test: add unit test to ensure all config types are instantiable (#1601 )	2025-03-12 22:29:58 -07:00
Xi Yan	c7139b0b67	fix: fix precommit (#1594 ) # What does this PR do? - fix precommit [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan CI [//]: # (## Documentation)	2025-03-12 11:59:21 -07:00
Botao Chen	0b0be70605	feat: Add open benchmark template codegen (#1579 ) ## What does this PR do? As title, add codegen for open-benchmark template ## test checked the new generated run.yaml file and it's identical before and after the change Also add small improvement to together template so that missing TOGETHER_API_KEY won't crash the server which is the consistent user experience as other remote providers	2025-03-12 11:12:08 -07:00
Botao Chen	e3edca7739	feat: [new open benchmark] Math 500 (#1538 ) ## What does this PR do? Created a new math_500 open-benchmark based on OpenAI's [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's [HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) dataset. The challenge part of this benchmark is to parse the generated and expected answer and verify if they are same. For the parsing part, we refer to [Minerva: Solving Quantitative Reasoning Problems with Language Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/). To simply the parse logic, as the next step, we plan to also refer to what [simple-eval](https://github.com/openai/simple-evals) is doing, using llm as judge to check if the generated answer matches the expected answer or not ## Test Plan on sever side, spin up a server with open-benchmark template `llama stack run llama_stack/templates/open-benchamrk/run.yaml` on client side, issue an open benchmark eval request `llama-stack-client --endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id "meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/" --num-examples 20` and get ther aggregated eval results <img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM" src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb" /> check the generated answer and the related scoring and they make sense	2025-03-10 20:38:28 -07:00
Botao Chen	ade76e4a69	fix: update the open benchmark eval doc (#1497 ) ## What does this PR do? add proper links to the doc ## test preview the doc <img width="1304" alt="Screenshot 2025-03-07 at 3 03 22 PM" src="https://github.com/user-attachments/assets/0a0e2a3d-2420-4af0-99c3-a4786855fae0" /> <img width="1303" alt="Screenshot 2025-03-07 at 3 03 32 PM" src="https://github.com/user-attachments/assets/e11844e7-ee8a-4a64-8617-abafa02b2868" />	2025-03-07 15:05:27 -08:00
Botao Chen	89e449c2cb	fix: Fix open benchmark template (#1496 ) ## What does this PR do? Delete the open_benchmark template which was generated by the auto codegen by accident	2025-03-07 14:49:10 -08:00
Botao Chen	4dccf916d1	feat: open benchmark template and doc (#1465 ) ## What does this PR do? - Provide a distro template to let developer easily run the open benchmarks llama stack supports on llama and non-llama models. - Provide doc on how to run open benchmark eval via CLI and open benchmark contributing guide [//]: # (If resolving an issue, uncomment and update the line below) (Closes #1375 ) ## Test Plan open benchmark eval results on llama, gpt, gemini and clause <img width="771" alt="Screenshot 2025-03-06 at 7 33 05 PM" src="https://github.com/user-attachments/assets/1bd85456-b9b9-4b37-af76-4ce1d2bac00e" /> doc preview <img width="944" alt="Screenshot 2025-03-06 at 7 33 58 PM" src="https://github.com/user-attachments/assets/f4e5866d-b395-4c40-aa8b-080edeb5cdb6" /> <img width="955" alt="Screenshot 2025-03-06 at 7 34 04 PM" src="https://github.com/user-attachments/assets/629defb6-d5e4-473c-aa03-308bce386fb4" /> <img width="965" alt="Screenshot 2025-03-06 at 7 35 29 PM" src="https://github.com/user-attachments/assets/c21ff96c-9e8c-4c54-b6b8-25883125f4cf" /> <img width="957" alt="Screenshot 2025-03-06 at 7 35 37 PM" src="https://github.com/user-attachments/assets/47571c90-1381-4e2c-bbed-c4f3a60578d0" />	2025-03-07 10:37:55 -08:00

8 commits