mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-21 03:59:42 +00:00

History

Jash Gulabrai c04ab0133d In-progress: e2e notebook with partial Eval integration		2025-04-08 14:08:01 -04:00
..
__init__.py	In-progress: e2e notebook with partial Eval integration	2025-04-08 14:08:01 -04:00
config.py	In-progress: e2e notebook with partial Eval integration	2025-04-08 14:08:01 -04:00
eval.py	In-progress: e2e notebook with partial Eval integration	2025-04-08 14:08:01 -04:00
README.md	In-progress: e2e notebook with partial Eval integration	2025-04-08 14:08:01 -04:00

README.md

NVIDIA NeMo Evaluator Eval Provider

Overview

For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id and scoring_functions are not used.

Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.

Example for register an academic benchmark

POST /eval/benchmarks

{
  "benchmark_id": "mmlu",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "mmlu"
  }
}

Example for register a custom evaluation

POST /eval/benchmarks

{
  "benchmark_id": "my-custom-benchmark",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "custom",
    "params": {
      "parallelism": 8
    },
    "tasks": {
      "qa": {
        "type": "completion",
        "params": {
          "template": {
            "prompt": "{{prompt}}",
            "max_tokens": 200
          }
        },
        "dataset": {
          "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
        },
        "metrics": {
          "bleu": {
            "type": "bleu",
            "params": {
              "references": [
                "{{ideal_response}}"
              ]
            }
          }
        }
      }
    }
  }
}

Example for triggering a benchmark/custom evaluation

POST /eval/benchmarks/{benchmark_id}/jobs

{
  "benchmark_id": "my-custom-benchmark",
  "benchmark_config": {
    "eval_candidate": {
      "type": "model",
      "model": "meta/llama-3.1-8b-instruct",
      "sampling_params": {
        "max_tokens": 100,
        "temperature": 0.7
      }
    },
    "scoring_params": {}
  }
}

Response example:

{
    "job_id": "1234",
    "status": "in_progress"
}

Example for getting the status of a job

GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}

Example for cancelling a job

POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel

Example for getting the results

GET /eval/benchmarks/{benchmark_id}/results

{
  "generations": [],
  "scores": {
    "{benchmark_id}": {
      "score_rows": [],
      "aggregated_results": {
        "tasks": {},
        "groups": {}
      }
    }
  }
}