Clean up instructions and implementation; reorganize notebooks

This commit is contained in:
Jash Gulabrai 2025-04-18 16:27:19 -04:00
parent 0d9d333a4e
commit 4131e8146f
29 changed files with 2756 additions and 89 deletions

View file

@ -0,0 +1,58 @@
# Beginner Fine-tuning, Inference, and Evaluation with NVIDIA NeMo Microservices and NIM
## Introduction
This notebook contains the Llama Stack implementation for an end-to-end workflow for running inference, customizing, and evaluating LLMs using the NVIDIA provider. The NVIDIA provider leverages the NeMo Microservices platform, a collection of microservices that you can use to build AI workflows on your Kubernetes cluster on-prem or in cloud.
### About NVIDIA NeMo Microservices
The NVIDIA NeMo microservices platform provides a flexible foundation for building AI workflows such as fine-tuning, evaluation, running inference, or applying guardrails to AI models on your Kubernetes cluster on-premises or in cloud. Refer to [documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for further information.
## Objectives
This end-to-end tutorial shows how to leverage the NeMo Microservices platform for customizing Llama-3.1-8B-Instruct using data from the Stanford Question Answering Dataset (SQuAD) reading comprehension dataset, consisting of questions posed on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding passage, or the question is unanswerable.
## Prerequisites
### Deploy NeMo Microservices
Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.1-8b-instruct`. Please refer to the [installation guide](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-platform/index.html) for instructions.
`NOTE`: The Guardrails step uses the `llama-3.1-nemoguard-8b-content-safety` model to add content safety guardrails to user input. You can either replace this with another model you've already deployed, or deploy this NIM using NeMo Deployment Management Service. This step is similar to [NIM deployment instructions](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html#deploy-nim-for-llama-3-1-8b-instruct) in documentation, but with the following values:
```bash
# URL to NeMo deployment management service
export NEMO_URL="http://nemo.test"
curl --location "$NEMO_URL/v1/deployment/model-deployments" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"name": "llama-3.1-nemoguard-8b-content-safety",
"namespace": "nvidia",
"config": {
"model": "nvidia/llama-3.1-nemoguard-8b-content-safety",
"nim_deployment": {
"image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
"image_tag": "1.0.0",
"pvc_size": "25Gi",
"gpu": 1,
"additional_envs": {
"NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
}
}
}
}'
```
The NIM deployment described above should take approximately 10 minutes to go live. You can continue with the remaining steps while the deployment is in progress.
### Client-Side Requirements
Ensure you have access to:
1. A Python-enabled machine capable of running Jupyter Lab.
2. Network access to the NeMo Microservices IP and ports.
## Get Started
Navigate to the [beginner E2E tutorial](./Llama_Stack_NVIDIA_E2E_Flow.ipynb) tutorial to get started.