mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-05 12:21:52 +00:00
currently providers have a `pip_package` list. Rather than make our own form of python dependency management, we should use `pyproject.toml` files in each provider declaring the dependencies in a more trackable manner. Each provider can then be installed using the already in place `module` field in the ProviderSpec, pointing to the directory the provider lives in we can then simply `uv pip install` this directory as opposed to installing the dependencies one by one Signed-off-by: Charlie Doern <cdoern@redhat.com> |
||
---|---|---|
.. | ||
__init__.py | ||
config.py | ||
eval.py | ||
pyproject.toml | ||
README.md |
NVIDIA NeMo Evaluator Eval Provider
Overview
For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id
and scoring_functions
are not used.
Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.
Example for register an academic benchmark
POST /eval/benchmarks
{
"benchmark_id": "mmlu",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "mmlu"
}
}
Example for register a custom evaluation
POST /eval/benchmarks
{
"benchmark_id": "my-custom-benchmark",
"dataset_id": "",
"scoring_functions": [],
"metadata": {
"type": "custom",
"params": {
"parallelism": 8
},
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"prompt": "{{prompt}}",
"max_tokens": 200
}
},
"dataset": {
"files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
},
"metrics": {
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{ideal_response}}"
]
}
}
}
}
}
}
}
Example for triggering a benchmark/custom evaluation
POST /eval/benchmarks/{benchmark_id}/jobs
{
"benchmark_id": "my-custom-benchmark",
"benchmark_config": {
"eval_candidate": {
"type": "model",
"model": "meta-llama/Llama3.1-8B-Instruct",
"sampling_params": {
"max_tokens": 100,
"temperature": 0.7
}
},
"scoring_params": {}
}
}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for getting the status of a job
GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}
Response example:
{
"job_id": "eval-1234",
"status": "in_progress"
}
Example for cancelling a job
POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel
Example for getting the results
GET /eval/benchmarks/{benchmark_id}/results
{
"generations": [],
"scores": {
"{benchmark_id}": {
"score_rows": [],
"aggregated_results": {
"tasks": {},
"groups": {}
}
}
}
}