llama-stack-mirror/llama_stack/providers/remote/datasetio/nvidia
Charlie Doern 871802f489 feat(api): level v1beta APIs
level the following APIs as v1beta:

1. eval: job scheduling is not implemented. Relies heavily on the datasetio API which is under development/missing routes.
2. datasetio: used primarily by eval and training. Given that training is v1alpha, and eval is v1beta, datasetio is likely to change in structure as real usages of the API spin up. Register,unregister, and iter dataset is sparsely implemented meaning the shape of that route is likely to change.
3. telemetry: telemetry has been going through many changes. for example query_metrics was not even implemented until recently and had to change its shape to work. putting this in v1beta will allow us to fix functionality like OTEL, sqlite, etc. The routes themselves are set, but the structure might change a bit

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-09-17 15:57:39 -04:00
..
__init__.py feat: Add NVIDIA NeMo datastore (#1852) 2025-04-28 09:41:59 -07:00
config.py fix: allow default empty vars for conditionals (#2570) 2025-07-01 14:42:05 +02:00
datasetio.py feat(api): level v1beta APIs 2025-09-17 15:57:39 -04:00
README.md chore: rename templates to distributions (#3035) 2025-08-04 11:34:17 -07:00

NVIDIA DatasetIO Provider for LlamaStack

This provider enables dataset management using NVIDIA's NeMo Customizer service.

Features

  • Register datasets for fine-tuning LLMs
  • Unregister datasets

Getting Started

Prerequisites

  • LlamaStack with NVIDIA configuration
  • Access to Hosted NVIDIA NeMo Microservice
  • API key for authentication with the NVIDIA service

Setup

Build the NVIDIA environment:

llama stack build --distro nvidia --image-type venv

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Register a dataset

client.datasets.register(
    purpose="post-training/messages",
    dataset_id="my-training-dataset",
    source={"type": "uri", "uri": "hf://datasets/default/sample-dataset"},
    metadata={
        "format": "json",
        "description": "Dataset for LLM fine-tuning",
        "provider": "nvidia",
    },
)

Get a list of all registered datasets

datasets = client.datasets.list()
for dataset in datasets:
    print(f"Dataset ID: {dataset.identifier}")
    print(f"Description: {dataset.metadata.get('description', '')}")
    print(f"Source: {dataset.source.uri}")
    print("---")

Unregister a dataset

client.datasets.unregister(dataset_id="my-training-dataset")