llama-stack-mirror/llama_stack/providers/remote/eval/nvidia
Matthew Farrellee 466ef6f490 feat: add static embedding metadata to dynamic model listings for providers using OpenAIMixin
- remove auto-download of ollama embedding models
- add embedding model metadata to dynamic listing w/ unit test
- add support and tests for allowed_models
- removed inference provider models.py files where dynamic listing is enabled
- store embedding metadata in embedding_model_metadata field on inference providers
- make model_entries optional on ModelRegistryHelper and LiteLLMOpenAIMixin
- make OpenAIMixin a ModelRegistryHelper
- skip base64 embedding test for remote::ollama, always returns floats
- only use OpenAI client for ollama model listing
- remove unused build_model_entry function
- remove unused get_huggingface_repo function
2025-09-25 04:56:54 -04:00
..
__init__.py chore(rename): move llama_stack.distribution to llama_stack.core (#2975) 2025-07-30 23:30:53 -07:00
config.py refactor(env)!: enhanced environment variable substitution (#2490) 2025-06-26 08:20:08 +05:30
eval.py feat: add static embedding metadata to dynamic model listings for providers using OpenAIMixin 2025-09-25 04:56:54 -04:00
README.md feat: Add NVIDIA Eval integration (#1890) 2025-04-24 17:12:42 -07:00

NVIDIA NeMo Evaluator Eval Provider

Overview

For the first integration, Benchmarks are mapped to Evaluation Configs on in the NeMo Evaluator. The full evaluation config object is provided as part of the meta-data. The dataset_id and scoring_functions are not used.

Below are a few examples of how to register a benchmark, which in turn will create an evaluation config in NeMo Evaluator and how to trigger an evaluation.

Example for register an academic benchmark

POST /eval/benchmarks
{
  "benchmark_id": "mmlu",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "mmlu"
  }
}

Example for register a custom evaluation

POST /eval/benchmarks
{
  "benchmark_id": "my-custom-benchmark",
  "dataset_id": "",
  "scoring_functions": [],
  "metadata": {
    "type": "custom",
    "params": {
      "parallelism": 8
    },
    "tasks": {
      "qa": {
        "type": "completion",
        "params": {
          "template": {
            "prompt": "{{prompt}}",
            "max_tokens": 200
          }
        },
        "dataset": {
          "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
        },
        "metrics": {
          "bleu": {
            "type": "bleu",
            "params": {
              "references": [
                "{{ideal_response}}"
              ]
            }
          }
        }
      }
    }
  }
}

Example for triggering a benchmark/custom evaluation

POST /eval/benchmarks/{benchmark_id}/jobs
{
  "benchmark_id": "my-custom-benchmark",
  "benchmark_config": {
    "eval_candidate": {
      "type": "model",
      "model": "meta-llama/Llama3.1-8B-Instruct",
      "sampling_params": {
        "max_tokens": 100,
        "temperature": 0.7
      }
    },
    "scoring_params": {}
  }
}

Response example:

{
    "job_id": "eval-1234",
    "status": "in_progress"
}

Example for getting the status of a job

GET /eval/benchmarks/{benchmark_id}/jobs/{job_id}

Response example:

{
  "job_id": "eval-1234",
  "status": "in_progress"
}

Example for cancelling a job

POST /eval/benchmarks/{benchmark_id}/jobs/{job_id}/cancel

Example for getting the results

GET /eval/benchmarks/{benchmark_id}/results
{
  "generations": [],
  "scores": {
    "{benchmark_id}": {
      "score_rows": [],
      "aggregated_results": {
        "tasks": {},
        "groups": {}
      }
    }
  }
}