llama-stack-mirror/llama_stack/providers/remote/datasetio/nvidia
Sébastien Han ac5fd57387
chore: remove nested imports (#2515)
# What does this PR do?

* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.

* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 08:01:05 +05:30
..
__init__.py feat: Add NVIDIA NeMo datastore (#1852) 2025-04-28 09:41:59 -07:00
config.py chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00
datasetio.py chore: remove nested imports (#2515) 2025-06-26 08:01:05 +05:30
README.md feat: Add Nvidia e2e beginner notebook and tool calling notebook (#1964) 2025-06-16 11:29:01 -04:00

NVIDIA DatasetIO Provider for LlamaStack

This provider enables dataset management using NVIDIA's NeMo Customizer service.

Features

  • Register datasets for fine-tuning LLMs
  • Unregister datasets

Getting Started

Prerequisites

  • LlamaStack with NVIDIA configuration
  • Access to Hosted NVIDIA NeMo Microservice
  • API key for authentication with the NVIDIA service

Setup

Build the NVIDIA environment:

llama stack build --template nvidia --image-type conda

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Register a dataset

client.datasets.register(
    purpose="post-training/messages",
    dataset_id="my-training-dataset",
    source={"type": "uri", "uri": "hf://datasets/default/sample-dataset"},
    metadata={
        "format": "json",
        "description": "Dataset for LLM fine-tuning",
        "provider": "nvidia",
    },
)

Get a list of all registered datasets

datasets = client.datasets.list()
for dataset in datasets:
    print(f"Dataset ID: {dataset.identifier}")
    print(f"Description: {dataset.metadata.get('description', '')}")
    print(f"Source: {dataset.source.uri}")
    print("---")

Unregister a dataset

client.datasets.unregister(dataset_id="my-training-dataset")