mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-28 06:41:59 +00:00

History

raspawar 60bf0eb532 datastore documentation		2025-04-28 15:50:50 +05:30
..
__init__.py	add datasetio code	2025-04-28 15:48:08 +05:30
config.py	datastore documentation	2025-04-28 15:50:50 +05:30
datasetio.py	add unit tests	2025-04-28 15:50:50 +05:30
README.md	datastore documentation	2025-04-28 15:50:50 +05:30

README.md

NVIDIA DatasetIO Provider for LlamaStack

This provider enables dataset management using NVIDIA's NeMo Customizer service.

Features

Register datasets for fine-tuning LLMs
Unregister datasets

Getting Started

Prerequisites

LlamaStack with NVIDIA configuration
Access to Hosted NVIDIA NeMo Microservice
API key for authentication with the NVIDIA service

Setup

Build the NVIDIA environment:

llama stack build --template nvidia --image-type conda

Basic Usage using the LlamaStack Python Client

Initialize the client

import os
os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_USER_ID"] = "llama-stack-user"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Register a dataset

client.datasets.register(
purpose="post-training/messages",
dataset_id="my-training-dataset",
source={
"type": "uri",
"uri": "hf://datasets/default/sample-dataset"
},
metadata={
"format": "json",
"description": "Dataset for LLM fine-tuning",
"provider": "nvidia"
}
)

Get a list of all registered datasets

datasets = client.datasets.list()
for dataset in datasets:
    print(f"Dataset ID: {dataset.identifier}")
    print(f"Description: {dataset.metadata.get('description', '')}")
    print(f"Source: {dataset.source.uri}")
    print("---")

Unregister a dataset

client.datasets.unregister(dataset_id="my-training-dataset")