forked from phoenix-oss/llama-stack-mirror
# What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> |
||
---|---|---|
.. | ||
__init__.py | ||
config.py | ||
eval.py | ||
README.md |
NVIDIA NeMo Evaluator Eval Provider
Overview
For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id
and scoring_functions
are not used.
Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.
Example for register an academic benchmark
POST /eval/benchmarks
{
"benchmark_id": "mmlu",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "mmlu"
}
}
Example for register a custom evaluation
POST /eval/benchmarks
{
"benchmark_id": "my-custom-benchmark",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "custom",
"params": {
"parallelism": 8
},
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"prompt": "{{prompt}}",
"max_tokens": 200
}
},
"dataset": {
"files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
},
"metrics": {
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{ideal_response}}"
]
}
}
}
}
}
}
}
Example for triggering a benchmark/custom evaluation
POST /eval/benchmarks/{benchmark_id}/jobs
{
"benchmark_id": "my-custom-benchmark",
"benchmark_config": {
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama3.1-8B-Instruct",
"sampling_params": {
"max_tokens": 100,
"temperature": 0.7
}
},
"scoring_params": {}
}
}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for getting the status of a job
GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for cancelling a job
POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel
Example for getting the results
GET /eval/benchmarks/{benchmark_id}/results
{
"generations": [],
"scores": {
"{benchmark_id}": {
"score_rows": [],
"aggregated_results": {
"tasks": {},
"groups": {}
}
}
}
}