llama-stack-mirror/llama_stack/providers/inline
yyymeta 1c6fbd95a5
fix: regex parser to support more answer formats (#1425)
# What does this PR do?
add better-performance prompt: existing prompts expect a generated
response that ends in "Answer :". But during test, we found that for
GPQA, the prompt used by meta internal genEval "The best answer is
[ABCD]" achieves higher accuracy .


## Test Plan

```

(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$llama-stack-client eval run-benchmark "meta-reference-gpqa-cot"  --model-id   meta-llama/Llama-4-17B-Llama-API  --output-dir /tmp/gpqa    --num-examples   20

....

Sending HTTP Request: GET http://localhost:5001/v1/scoring-functions/basic::regex_parser_multiple_choice_answer
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20  [ 0:04:46 < 0:00:00 , 0 it/s ]
✓ Results saved to: /tmp/gpqa/meta-reference-gpqa-cot_results.json!

(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ tail /tmp/gpqa/meta-reference-gpqa-cot_results.json
    {
      "score": 0.0
    },
    {
      "accuracy": 0.5,
      "num_correct": 10.0,
      "num_total": 20
    }
  ]
}(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$
```

[//]: # (## Documentation)
2025-03-05 11:52:07 -08:00
..
agents fix: Agent uses the first configured vector_db_id when documents are provided (#1276) 2025-03-04 21:44:13 -08:00
datasetio build: format codebase imports using ruff linter (#1028) 2025-02-13 10:06:21 -08:00
eval chore: rename task_config to benchmark_config (#1397) 2025-03-04 12:44:04 -08:00
inference refactor: move generation.py to llama3 2025-03-03 13:50:19 -08:00
ios/inference chore: removed executorch submodule (#1265) 2025-02-25 21:57:21 -08:00
post_training fix: replace eval with json decoding for format_adapter (#1328) 2025-02-28 11:25:23 -08:00
safety chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
scoring fix: regex parser to support more answer formats (#1425) 2025-03-05 11:52:07 -08:00
telemetry refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
tool_runtime chore: remove dependency on llama_models completely (#1344) 2025-03-01 12:48:08 -08:00
vector_io refactor(test): unify vector_io tests and make them configurable (#1398) 2025-03-04 13:37:45 -08:00
__init__.py impls -> inline, adapters -> remote (#381) 2024-11-06 14:54:05 -08:00