llama-stack-mirror/llama_stack/providers/remote/eval/nvidia
Charlie Doern 41431d8bdd refactor: convert providers to be installed via package
currently providers have a `pip_package` list. Rather than make our own form of python dependency management, we should use `pyproject.toml` files in each provider declaring the dependencies in a more trackable manner.
Each provider can then be installed using the already in place `module` field in the ProviderSpec, pointing to the directory the provider lives in
we can then simply `uv pip install` this directory as opposed to installing the dependencies one by one

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-09-22 09:23:50 -04:00
..
__init__.py chore(rename): move llama_stack.distribution to llama_stack.core (#2975) 2025-07-30 23:30:53 -07:00
config.py refactor(env)!: enhanced environment variable substitution (#2490) 2025-06-26 08:20:08 +05:30
eval.py feat: create HTTP DELETE API endpoints to unregister ScoringFn and Benchmark resources in Llama Stack (#3371) 2025-09-15 12:43:38 -07:00
pyproject.toml refactor: convert providers to be installed via package 2025-09-22 09:23:50 -04:00
README.md feat: Add NVIDIA Eval integration (#1890) 2025-04-24 17:12:42 -07:00

NVIDIA NeMo Evaluator Eval Provider

Overview

For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id and scoring_functions are not used.

Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.

Example for register an academic benchmark

POST /eval/benchmarks
{
  "benchmark_id": "mmlu",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "mmlu"
  }
}

Example for register a custom evaluation

POST /eval/benchmarks
{
  "benchmark_id": "my-custom-benchmark",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "custom",
    "params": {
      "parallelism": 8
    },
    "tasks": {
      "qa": {
        "type": "completion",
        "params": {
          "template": {
            "prompt": "{{prompt}}",
            "max_tokens": 200
          }
        },
        "dataset": {
          "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
        },
        "metrics": {
          "bleu": {
            "type": "bleu",
            "params": {
              "references": [
                "{{ideal_response}}"
              ]
            }
          }
        }
      }
    }
  }
}

Example for triggering a benchmark/custom evaluation

POST /eval/benchmarks/{benchmark_id}/jobs
{
  "benchmark_id": "my-custom-benchmark",
  "benchmark_config": {
    "eval_candidate": {
      "type": "model",
      "model": "meta-llama/Llama3.1-8B-Instruct",
      "sampling_params": {
        "max_tokens": 100,
        "temperature": 0.7
      }
    },
    "scoring_params": {}
  }
}

Response example:

{
    "job_id": "eval-1234",
    "status": "in_progress"
}

Example for getting the status of a job

GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}

Response example:

{
  "job_id": "eval-1234",
  "status": "in_progress"
}

Example for cancelling a job

POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel

Example for getting the results

GET /eval/benchmarks/{benchmark_id}/results
{
  "generations": [],
  "scores": {
    "{benchmark_id}": {
      "score_rows": [],
      "aggregated_results": {
        "tasks": {},
        "groups": {}
      }
    }
  }
}