fix: regex parser to support more answer formats (#1425)

# What does this PR do?
add better-performance prompt: existing prompts expect a generated
response that ends in "Answer :". But during test, we found that for
GPQA, the prompt used by meta internal genEval "The best answer is
[ABCD]" achieves higher accuracy .


## Test Plan

```

(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$llama-stack-client eval run-benchmark "meta-reference-gpqa-cot"  --model-id   meta-llama/Llama-4-17B-Llama-API  --output-dir /tmp/gpqa    --num-examples   20

....

Sending HTTP Request: GET http://localhost:5001/v1/scoring-functions/basic::regex_parser_multiple_choice_answer
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20  [ 0:04:46 < 0:00:00 , 0 it/s ]
✓ Results saved to: /tmp/gpqa/meta-reference-gpqa-cot_results.json!

(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ tail /tmp/gpqa/meta-reference-gpqa-cot_results.json
    {
      "score": 0.0
    },
    {
      "accuracy": 0.5,
      "num_correct": 10.0,
      "num_total": 20
    }
  ]
}(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
```

[//]: # (## Documentation)
This commit is contained in:
yyymeta 2025-03-05 11:52:07 -08:00 committed by GitHub
parent 00570fde31
commit 1c6fbd95a5
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -12,6 +12,7 @@ from llama_stack.apis.scoring_functions import (
)
MULTILINGUAL_ANSWER_REGEXES = [
r"The best answer is ",
r"Answer\s*:",
r"Answer\s*:", # Korean invisible character
r"উত্তর\s*:",