llama-stack/llama_stack/providers/remote/datasetio/nvidia/README.md
Rashmi Pawar e6bbf8d20b
feat: Add NVIDIA NeMo datastore (#1852)
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.

Open Issues: 
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860

Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)

## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                        

tests/unit/providers/nvidia/test_datastore.py ..                                                                                   [100%]

============================================================ warnings summary ============================================================

====================================================== 2 passed, 1 warning in 0.84s ======================================================
```

cc: @dglogo, @mattf, @yanxi0830
2025-04-28 09:41:59 -07:00

74 lines
1.7 KiB
Markdown

# NVIDIA DatasetIO Provider for LlamaStack
This provider enables dataset management using NVIDIA's NeMo Customizer service.
## Features
- Register datasets for fine-tuning LLMs
- Unregister datasets
## Getting Started
### Prerequisites
- LlamaStack with NVIDIA configuration
- Access to Hosted NVIDIA NeMo Microservice
- API key for authentication with the NVIDIA service
### Setup
Build the NVIDIA environment:
```bash
llama stack build --template nvidia --image-type conda
```
### Basic Usage using the LlamaStack Python Client
#### Initialize the client
```python
import os
os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_USER_ID"] = "llama-stack-user"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
```
#### Register a dataset
```python
client.datasets.register(
purpose="post-training/messages",
dataset_id="my-training-dataset",
source={"type": "uri", "uri": "hf://datasets/default/sample-dataset"},
metadata={
"format": "json",
"description": "Dataset for LLM fine-tuning",
"provider": "nvidia",
},
)
```
#### Get a list of all registered datasets
```python
datasets = client.datasets.list()
for dataset in datasets:
print(f"Dataset ID: {dataset.identifier}")
print(f"Description: {dataset.metadata.get('description', '')}")
print(f"Source: {dataset.source.uri}")
print("---")
```
#### Unregister a dataset
```python
client.datasets.unregister(dataset_id="my-training-dataset")
```