feat: [New Eval Benchamark] IfEval (#1708)

# What does this PR do?
In this PR, we added a new eval open benchmark IfEval based on paper
https://arxiv.org/abs/2311.07911 to measure the model capability of
instruction following.


## Test Plan
spin up a llama stack server with open-benchmark template

run `llama-stack-client --endpoint xxx eval run-benchmark
"meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct"
--output-dir "/home/markchen1015/" --num-examples 20` on client side and
get the eval aggregate results
This commit is contained in:
Botao Chen 2025-03-19 16:39:59 -07:00 committed by GitHub
parent a7008dc15d
commit f369871083
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 3520 additions and 1 deletions

View file

@ -6268,6 +6268,7 @@
"type": "string",
"enum": [
"average",
"weighted_average",
"median",
"categorical_count",
"accuracy"