llama-stack-mirror/llama_stack/providers/remote/datasetio/nvidia
Sébastien Han bc64635835
feat: load config class when doing variable substitution
When using bash style substitution env variable in distribution
template, we are processing the string and convert it to the type
associated with the provider's config class. This allows us to return
the proper type. This is crucial for api key since they are not strings
anymore but SecretStr. If the key is unset we will get an empty string
which will result in a Pydantic error like:

```
ERROR    2025-09-25 21:40:44,565 __main__:527 core::server: Error creating app: 1 validation error for AnthropicConfig
         api_key
           Input should be a valid string
             For further information visit
             https://errors.pydantic.dev/2.11/v/string_type
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-09-29 09:55:19 +02:00
..
__init__.py feat: Add NVIDIA NeMo datastore (#1852) 2025-04-28 09:41:59 -07:00
config.py feat: load config class when doing variable substitution 2025-09-29 09:55:19 +02:00
datasetio.py chore: remove nested imports (#2515) 2025-06-26 08:01:05 +05:30
README.md chore: rename templates to distributions (#3035) 2025-08-04 11:34:17 -07:00

NVIDIA DatasetIO Provider for LlamaStack

This provider enables dataset management using NVIDIA's NeMo Customizer service.

Features

  • Register datasets for fine-tuning LLMs
  • Unregister datasets

Getting Started

Prerequisites

  • LlamaStack with NVIDIA configuration
  • Access to Hosted NVIDIA NeMo Microservice
  • API key for authentication with the NVIDIA service

Setup

Build the NVIDIA environment:

llama stack build --distro nvidia --image-type venv

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Register a dataset

client.datasets.register(
    purpose="post-training/messages",
    dataset_id="my-training-dataset",
    source={"type": "uri", "uri": "hf://datasets/default/sample-dataset"},
    metadata={
        "format": "json",
        "description": "Dataset for LLM fine-tuning",
        "provider": "nvidia",
    },
)

Get a list of all registered datasets

datasets = client.datasets.list()
for dataset in datasets:
    print(f"Dataset ID: {dataset.identifier}")
    print(f"Description: {dataset.metadata.get('description', '')}")
    print(f"Source: {dataset.source.uri}")
    print("---")

Unregister a dataset

client.datasets.unregister(dataset_id="my-training-dataset")