mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 18:00:36 +00:00

History

Jash Gulabrai 8713d67ce3 fix: Correctly parse algorithm_config when launching NVIDIA customization job; fix internal request handler (#2025 ) # What does this PR do? This addresses 2 bugs I ran into when launching a fine-tuning job with the NVIDIA Adapter: 1. Session handling in `_make_request` helper function returns an error. ``` INFO: 127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms) 16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune response = await self._make_request( File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request async with self.session.request(method, url, params=params, json=json, **kwargs) as response: File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__ self._resp: _RetType = await self._coro File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request handle = tm.start() File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start return self._loop.call_at(when, self.__call__) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at self._check_closed() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed ``` Note: This only occurred when initializing the client like so: ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) # Returns error ``` I didn't run into this issue when using the library client: ``` client = LlamaStackAsLibraryClient("nvidia") client.initialize() response = client.post_training.supervised_fine_tune(...) # Works fine ``` 2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a `dict` when run from unit tests, but a Pydantic model when invoked using the Llama Stack client. So, the call fails outside of unit tests: ``` INFO: 127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms) 21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune "adapter_dim": algorithm_config.get("adapter_dim"), File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__ raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}') AttributeError: 'LoraFinetuningConfig' object has no attribute 'get' ``` The code assumes `algorithm_config` should be `dict`, so I just handle both cases. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan 1. I ran a local Llama Stack server with the necessary env vars: ``` lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ... ``` And invoked `supervised_fine_tune` to confirm neither of the errors above occur. ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) ``` 2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_supervised_fine_tuning.py` [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>		2025-04-25 13:21:50 -07:00
..
__init__.py	feat: Add nemo customizer (#1448 )	2025-03-25 11:01:10 -07:00
config.py	feat: Add nemo customizer (#1448 )	2025-03-25 11:01:10 -07:00
models.py	fix: Add llama-3.2-1b-instruct to NVIDIA fine-tuned model list (#1975 )	2025-04-16 15:02:08 -07:00
post_training.py	fix: Correctly parse algorithm_config when launching NVIDIA customization job; fix internal request handler (#2025 )	2025-04-25 13:21:50 -07:00
README.md	feat: NVIDIA allow non-llama model registration (#1859 )	2025-04-24 17:13:33 -07:00
utils.py	feat: Add nemo customizer (#1448 )	2025-03-25 11:01:10 -07:00

README.md

NVIDIA Post-Training Provider for LlamaStack

This provider enables fine-tuning of LLMs using NVIDIA's NeMo Customizer service.

Features

Supervised fine-tuning of Llama models
LoRA fine-tuning support
Job management and status tracking

Getting Started

Prerequisites

LlamaStack with NVIDIA configuration
Access to Hosted NVIDIA NeMo Customizer service
Dataset registered in the Hosted NVIDIA NeMo Customizer service
Base model downloaded and available in the Hosted NVIDIA NeMo Customizer service

Setup

Build the NVIDIA environment:

llama stack build --template nvidia --image-type conda

Basic Usage using the LlamaStack Python Client

Create Customization Job

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = "your-api-key"
os.environ["NVIDIA_CUSTOMIZER_URL"] = "http://nemo.test"
os.environ["NVIDIA_DATASET_NAMESPACE"] = "default"
os.environ["NVIDIA_PROJECT_ID"] = "test-project"
os.environ["NVIDIA_OUTPUT_MODEL_DIR"] = "test-example-model@v1"

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Configure fine-tuning parameters

from llama_stack_client.types.post_training_supervised_fine_tune_params import (
    TrainingConfig,
    TrainingConfigDataConfig,
    TrainingConfigOptimizerConfig,
)
from llama_stack_client.types.algorithm_config_param import LoraFinetuningConfig

Set up LoRA configuration

algorithm_config = LoraFinetuningConfig(type="LoRA", adapter_dim=16)

Configure training data

data_config = TrainingConfigDataConfig(
    dataset_id="your-dataset-id",  # Use client.datasets.list() to see available datasets
    batch_size=16,
)

Configure optimizer

optimizer_config = TrainingConfigOptimizerConfig(
    lr=0.0001,
)

Set up training configuration

training_config = TrainingConfig(
    n_epochs=2,
    data_config=data_config,
    optimizer_config=optimizer_config,
)

Start fine-tuning job

training_job = client.post_training.supervised_fine_tune(
    job_uuid="unique-job-id",
    model="meta-llama/Llama-3.1-8B-Instruct",
    checkpoint_dir="",
    algorithm_config=algorithm_config,
    training_config=training_config,
    logger_config={},
    hyperparam_search_config={},
)

List all jobs

jobs = client.post_training.job.list()

Check job status

job_status = client.post_training.job.status(job_uuid="your-job-id")

Cancel a job

client.post_training.job.cancel(job_uuid="your-job-id")

Inference with the fine-tuned model

1. Register the model

from llama_stack.apis.models import Model, ModelType

client.models.register(
    model_id="test-example-model@v1",
    provider_id="nvidia",
    provider_model_id="test-example-model@v1",
    model_type=ModelType.llm,
)

2. Inference with the fine-tuned model

response = client.inference.completion(
    content="Complete the sentence using one word: Roses are red, violets are ",
    stream=False,
    model_id="test-example-model@v1",
    sampling_params={
        "max_tokens": 50,
    },
)
print(response.content)