forked from phoenix-oss/llama-stack-mirror
# What does this PR do? This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack eval module. The integration enables users to evaluate models via the Llama Stack interface. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] 1. Added unit tests and successfully ran from root of project: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py` ``` tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED ``` 2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv` Documentation added to `llama_stack/providers/remote/eval/nvidia/README.md` --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
134 lines
2.5 KiB
Markdown
134 lines
2.5 KiB
Markdown
# NVIDIA NeMo Evaluator Eval Provider
|
|
|
|
|
|
## Overview
|
|
|
|
For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The `dataset_id` and `scoring_functions` are not used.
|
|
|
|
Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.
|
|
|
|
### Example for register an academic benchmark
|
|
|
|
```
|
|
POST /eval/benchmarks
|
|
```
|
|
```json
|
|
{
|
|
"benchmark_id": "mmlu",
|
|
"dataset_id": "",
|
|
"scoring_functions": [],
|
|
"metadata": {
|
|
"type": "mmlu"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example for register a custom evaluation
|
|
|
|
```
|
|
POST /eval/benchmarks
|
|
```
|
|
```json
|
|
{
|
|
"benchmark_id": "my-custom-benchmark",
|
|
"dataset_id": "",
|
|
"scoring_functions": [],
|
|
"metadata": {
|
|
"type": "custom",
|
|
"params": {
|
|
"parallelism": 8
|
|
},
|
|
"tasks": {
|
|
"qa": {
|
|
"type": "completion",
|
|
"params": {
|
|
"template": {
|
|
"prompt": "{{prompt}}",
|
|
"max_tokens": 200
|
|
}
|
|
},
|
|
"dataset": {
|
|
"files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
|
|
},
|
|
"metrics": {
|
|
"bleu": {
|
|
"type": "bleu",
|
|
"params": {
|
|
"references": [
|
|
"{{ideal_response}}"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example for triggering a benchmark/custom evaluation
|
|
|
|
```
|
|
POST /eval/benchmarks/{benchmark_id}/jobs
|
|
```
|
|
```json
|
|
{
|
|
"benchmark_id": "my-custom-benchmark",
|
|
"benchmark_config": {
|
|
"eval_candidate": {
|
|
"type": "model",
|
|
"model": "meta-llama/Llama3.1-8B-Instruct",
|
|
"sampling_params": {
|
|
"max_tokens": 100,
|
|
"temperature": 0.7
|
|
}
|
|
},
|
|
"scoring_params": {}
|
|
}
|
|
}
|
|
```
|
|
|
|
Response example:
|
|
```json
|
|
{
|
|
"job_id": "eval-1234",
|
|
"status": "in_progress"
|
|
}
|
|
```
|
|
|
|
### Example for getting the status of a job
|
|
```
|
|
GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}
|
|
```
|
|
|
|
Response example:
|
|
```json
|
|
{
|
|
"job_id": "eval-1234",
|
|
"status": "in_progress"
|
|
}
|
|
```
|
|
|
|
### Example for cancelling a job
|
|
```
|
|
POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel
|
|
```
|
|
|
|
### Example for getting the results
|
|
```
|
|
GET /eval/benchmarks/{benchmark_id}/results
|
|
```
|
|
```json
|
|
{
|
|
"generations": [],
|
|
"scores": {
|
|
"{benchmark_id}": {
|
|
"score_rows": [],
|
|
"aggregated_results": {
|
|
"tasks": {},
|
|
"groups": {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|