mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-04 04:04:14 +00:00
# What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> This PR provides functionality for users to unregister ScoringFn and Benchmark resources for `scoring` and `eval` APIs. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Closes #3051 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> Updated integration and unit tests via CI workflow |
||
---|---|---|
.. | ||
__init__.py | ||
config.py | ||
eval.py | ||
README.md |
NVIDIA NeMo Evaluator Eval Provider
Overview
For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id
and scoring_functions
are not used.
Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.
Example for register an academic benchmark
POST /eval/benchmarks
{
"benchmark_id": "mmlu",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "mmlu"
}
}
Example for register a custom evaluation
POST /eval/benchmarks
{
"benchmark_id": "my-custom-benchmark",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "custom",
"params": {
"parallelism": 8
},
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"prompt": "{{prompt}}",
"max_tokens": 200
}
},
"dataset": {
"files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
},
"metrics": {
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{ideal_response}}"
]
}
}
}
}
}
}
}
Example for triggering a benchmark/custom evaluation
POST /eval/benchmarks/{benchmark_id}/jobs
{
"benchmark_id": "my-custom-benchmark",
"benchmark_config": {
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama3.1-8B-Instruct",
"sampling_params": {
"max_tokens": 100,
"temperature": 0.7
}
},
"scoring_params": {}
}
}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for getting the status of a job
GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for cancelling a job
POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel
Example for getting the results
GET /eval/benchmarks/{benchmark_id}/results
{
"generations": [],
"scores": {
"{benchmark_id}": {
"score_rows": [],
"aggregated_results": {
"tasks": {},
"groups": {}
}
}
}
}