llama-stack

forked from phoenix-oss/llama-stack-mirror

History

yyymeta 1c6fbd95a5 fix: regex parser to support more answer formats (#1425 ) # What does this PR do? add better-performance prompt: existing prompts expect a generated response that ends in "Answer :". But during test, we found that for GPQA, the prompt used by meta internal genEval "The best answer is [ABCD]" achieves higher accuracy . ## Test Plan ``` (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$llama-stack-client eval run-benchmark "meta-reference-gpqa-cot" --model-id meta-llama/Llama-4-17B-Llama-API --output-dir /tmp/gpqa --num-examples 20 .... Sending HTTP Request: GET http://localhost:5001/v1/scoring-functions/basic::regex_parser_multiple_choice_answer 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 [ 0:04:46 < 0:00:00 , 0 it/s ] ✓ Results saved to: /tmp/gpqa/meta-reference-gpqa-cot_results.json! (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ tail /tmp/gpqa/meta-reference-gpqa-cot_results.json { "score": 0.0 }, { "accuracy": 0.5, "num_correct": 10.0, "num_total": 20 } ] }(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ ``` [//]: # (## Documentation)		2025-03-05 11:52:07 -08:00
..
apis	docs: api documentation for agents/eval/scoring/datasets (#1400 )	2025-03-05 09:40:24 -08:00
cli	refactor(test): unify vector_io tests and make them configurable (#1398 )	2025-03-04 13:37:45 -08:00
distribution	chore: Make README code blocks more easily copy pastable (#1420 )	2025-03-05 09:11:01 -08:00
models/llama	refactor: move a few tests to top-level tests/ directory	2025-03-03 17:33:39 -08:00
providers	fix: regex parser to support more answer formats (#1425 )	2025-03-05 11:52:07 -08:00
scripts	refactor: move tests/client-sdk to tests/api (#1376 )	2025-03-03 17:28:12 -08:00
strong_typing	Ensure that deprecations for fields follow through to OpenAPI	2025-02-19 13:54:04 -08:00
templates	refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401 )	2025-03-04 14:53:47 -08:00
__init__.py	export LibraryClient	2024-12-13 12:08:00 -08:00
env.py	refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401 )	2025-03-04 14:53:47 -08:00
logcat.py	feat: add a configurable category-based logger (#1352 )	2025-03-02 18:51:14 -08:00
schema_utils.py	ci: add mypy for static type checking (#1101 )	2025-02-21 13:15:40 -08:00