mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-04 12:07:34 +00:00

History

Charlie Doern 41431d8bdd refactor: convert providers to be installed via package currently providers have a `pip_package` list. Rather than make our own form of python dependency management, we should use `pyproject.toml` files in each provider declaring the dependencies in a more trackable manner. Each provider can then be installed using the already in place `module` field in the ProviderSpec, pointing to the directory the provider lives in we can then simply `uv pip install` this directory as opposed to installing the dependencies one by one Signed-off-by: Charlie Doern <cdoern@redhat.com>		2025-09-22 09:23:50 -04:00
..
__init__.py	feat: Add NVIDIA NeMo datastore (#1852 )	2025-04-28 09:41:59 -07:00
config.py	fix: allow default empty vars for conditionals (#2570 )	2025-07-01 14:42:05 +02:00
datasetio.py	chore: remove nested imports (#2515 )	2025-06-26 08:01:05 +05:30
pyproject.toml	refactor: convert providers to be installed via package	2025-09-22 09:23:50 -04:00
README.md	chore: rename templates to distributions (#3035 )	2025-08-04 11:34:17 -07:00

README.md

NVIDIA DatasetIO Provider for LlamaStack

This provider enables dataset management using NVIDIA's NeMo Customizer service.

Features

Register datasets for fine-tuning LLMs
Unregister datasets

Getting Started

Prerequisites

LlamaStack with NVIDIA configuration
Access to Hosted NVIDIA NeMo Microservice
API key for authentication with the NVIDIA service

Setup

Build the NVIDIA environment:

llama stack build --distro nvidia --image-type venv

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Register a dataset

client.datasets.register(
    purpose="post-training/messages",
    dataset_id="my-training-dataset",
    source={"type": "uri", "uri": "hf://datasets/default/sample-dataset"},
    metadata={
        "format": "json",
        "description": "Dataset for LLM fine-tuning",
        "provider": "nvidia",
    },
)

Get a list of all registered datasets

datasets = client.datasets.list()
for dataset in datasets:
    print(f"Dataset ID: {dataset.identifier}")
    print(f"Description: {dataset.metadata.get('description', '')}")
    print(f"Source: {dataset.source.uri}")
    print("---")

Unregister a dataset

client.datasets.unregister(dataset_id="my-training-dataset")