llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-07 18:57:21 +00:00

History

yyymeta 1c6fbd95a5 fix: regex parser to support more answer formats (#1425 ) # What does this PR do? add better-performance prompt: existing prompts expect a generated response that ends in "Answer :". But during test, we found that for GPQA, the prompt used by meta internal genEval "The best answer is [ABCD]" achieves higher accuracy . ## Test Plan ``` (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$llama-stack-client eval run-benchmark "meta-reference-gpqa-cot" --model-id meta-llama/Llama-4-17B-Llama-API --output-dir /tmp/gpqa --num-examples 20 .... Sending HTTP Request: GET http://localhost:5001/v1/scoring-functions/basic::regex_parser_multiple_choice_answer 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 [ 0:04:46 < 0:00:00 , 0 it/s ] ✓ Results saved to: /tmp/gpqa/meta-reference-gpqa-cot_results.json! (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ tail /tmp/gpqa/meta-reference-gpqa-cot_results.json { "score": 0.0 }, { "accuracy": 0.5, "num_correct": 10.0, "num_total": 20 } ] }(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ ``` [//]: # (## Documentation)		2025-03-05 11:52:07 -08:00
..
agents	fix: Agent uses the first configured vector_db_id when documents are provided (#1276 )	2025-03-04 21:44:13 -08:00
datasetio	build: format codebase imports using ruff linter (#1028 )	2025-02-13 10:06:21 -08:00
eval	chore: rename task_config to benchmark_config (#1397 )	2025-03-04 12:44:04 -08:00
inference	refactor: move generation.py to llama3	2025-03-03 13:50:19 -08:00
ios/inference	chore: removed executorch submodule (#1265 )	2025-02-25 21:57:21 -08:00
post_training	fix: replace eval with json decoding for format_adapter (#1328 )	2025-02-28 11:25:23 -08:00
safety	chore: move all Llama Stack types from llama-models to llama-stack (#1098 )	2025-02-14 09:10:59 -08:00
scoring	fix: regex parser to support more answer formats (#1425 )	2025-03-05 11:52:07 -08:00
telemetry	refactor(test): unify vector_io tests and make them configurable (#1398 )	2025-03-04 13:37:45 -08:00
tool_runtime	chore: remove dependency on llama_models completely (#1344 )	2025-03-01 12:48:08 -08:00
vector_io	refactor(test): unify vector_io tests and make them configurable (#1398 )	2025-03-04 13:37:45 -08:00
__init__.py	`impls` -> `inline`, `adapters` -> `remote` (#381 )	2024-11-06 14:54:05 -08:00