llama-stack/llama_stack/apis
Botao Chen f369871083
feat: [New Eval Benchamark] IfEval (#1708)
# What does this PR do?
In this PR, we added a new eval open benchmark IfEval based on paper
https://arxiv.org/abs/2311.07911 to measure the model capability of
instruction following.


## Test Plan
spin up a llama stack server with open-benchmark template

run `llama-stack-client --endpoint xxx eval run-benchmark
"meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct"
--output-dir "/home/markchen1015/" --num-examples 20` on client side and
get the eval aggregate results
2025-03-19 16:39:59 -07:00
..
agents chore: deprecate ToolResponseMessage in agent.resume API (#1566) 2025-03-12 12:10:21 -07:00
batch_inference fix: solve ruff B008 warnings (#1444) 2025-03-06 16:48:35 -08:00
benchmarks fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
common ci: add mypy for static type checking (#1101) 2025-02-21 13:15:40 -08:00
datasetio feat(api): (1/n) datasets api clean up (#1573) 2025-03-17 16:55:45 -07:00
datasets fix: fix open-benchmark template (#1695) 2025-03-19 11:27:11 -07:00
eval fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
files fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
inference feat(api): remove tool_name from ToolResponseMessage (#1599) 2025-03-12 19:41:48 -07:00
inspect feat: add provider API for listing and inspecting provider info (#1429) 2025-03-13 15:07:21 -07:00
models fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
post_training chore: fix mypy violations in post_training modules (#1548) 2025-03-18 14:58:16 -07:00
providers fix: OpenAPI with provider get (#1627) 2025-03-13 19:56:32 -07:00
safety chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
scoring docs: api documentation for agents/eval/scoring/datasets (#1400) 2025-03-05 09:40:24 -08:00
scoring_functions feat: [New Eval Benchamark] IfEval (#1708) 2025-03-19 16:39:59 -07:00
shields fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
synthetic_data_generation chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
telemetry feat: Add new compact MetricInResponse type (#1593) 2025-03-12 15:45:44 -07:00
tools docs: add documentation for RAGDocument (#1693) 2025-03-19 10:16:00 -07:00
vector_dbs fix: return 4xx for non-existent resources in GET requests (#1635) 2025-03-18 14:06:53 -07:00
vector_io chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
__init__.py API Updates (#73) 2024-09-17 19:51:35 -07:00
datatypes.py feat: add provider API for listing and inspecting provider info (#1429) 2025-03-13 15:07:21 -07:00
resource.py fix!: update eval-tasks -> benchmarks (#1032) 2025-02-13 16:40:58 -08:00
version.py llama-stack version alpha -> v1 2025-01-15 05:58:09 -08:00