mirror of
				https://github.com/meta-llama/llama-stack.git
				synced 2025-10-25 17:11:12 +00:00 
			
		
		
		
	# What does this PR do? This PR contains two sets of notebooks that serve as reference material for developers getting started with Llama Stack using the NVIDIA Provider. Developers should be able to execute these notebooks end-to-end, pointing to their NeMo Microservices deployment. 1. `beginner_e2e/`: Notebook that walks through a beginner end-to-end workflow that covers creating datasets, running inference, customizing and evaluating models, and running safety checks. 2. `tool_calling/`: Notebook that is ported over from the [Data Flywheel & Tool Calling notebook](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/nemo/data-flywheel) that is referenced in the NeMo Microservices docs. I updated the notebook to use the Llama Stack client wherever possible, and added relevant instructions. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - Both notebook folders contain READMEs with pre-requisites. To manually test these notebooks, you'll need to have a deployment of the NeMo Microservices Platform and update the `config.py` file with your deployment's information. - I've run through these notebooks manually end-to-end to verify each step works. [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
		
			
				
	
	
		
			58 lines
		
	
	
	
		
			3.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			58 lines
		
	
	
	
		
			3.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Beginner Fine-tuning, Inference, and Evaluation with NVIDIA NeMo Microservices and NIM
 | |
| 
 | |
| ## Introduction
 | |
| 
 | |
| This notebook contains the Llama Stack implementation for an end-to-end workflow for running inference, customizing, and evaluating LLMs using the NVIDIA provider. The NVIDIA provider leverages the NeMo Microservices platform, a collection of microservices that you can use to build AI workflows on your Kubernetes cluster on-prem or in cloud.
 | |
| 
 | |
| ### About NVIDIA NeMo Microservices
 | |
| 
 | |
| The NVIDIA NeMo microservices platform provides a flexible foundation for building AI workflows such as fine-tuning, evaluation, running inference, or applying guardrails to AI models on your Kubernetes cluster on-premises or in cloud. Refer to [documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for further information.
 | |
| 
 | |
| ## Objectives
 | |
| 
 | |
| This end-to-end tutorial shows how to leverage the NeMo Microservices platform for customizing Llama-3.1-8B-Instruct using data from the Stanford Question Answering Dataset (SQuAD) reading comprehension dataset, consisting of questions posed on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding passage, or the question is unanswerable.
 | |
| 
 | |
| ## Prerequisites
 | |
| 
 | |
| ### Deploy NeMo Microservices
 | |
| 
 | |
| Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.1-8b-instruct`. Please refer to the [installation guide](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-platform/index.html) for instructions.
 | |
| 
 | |
| `NOTE`: The Guardrails step uses the `llama-3.1-nemoguard-8b-content-safety` model to add content safety guardrails to user input. You can either replace this with another model you've already deployed, or deploy this NIM using NeMo Deployment Management Service. This step is similar to [NIM deployment instructions](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html#deploy-nim-for-llama-3-1-8b-instruct) in documentation, but with the following values:
 | |
| 
 | |
| ```bash
 | |
| # URL to NeMo deployment management service
 | |
| export NEMO_URL="http://nemo.test"
 | |
| 
 | |
| curl --location "$NEMO_URL/v1/deployment/model-deployments" \
 | |
|    -H 'accept: application/json' \
 | |
|    -H 'Content-Type: application/json' \
 | |
|    -d '{
 | |
|       "name": "llama-3.1-nemoguard-8b-content-safety",
 | |
|       "namespace": "nvidia",
 | |
|       "config": {
 | |
|          "model": "nvidia/llama-3.1-nemoguard-8b-content-safety",
 | |
|          "nim_deployment": {
 | |
|             "image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
 | |
|             "image_tag": "1.0.0",
 | |
|             "pvc_size":   "25Gi",
 | |
|             "gpu": 1,
 | |
|             "additional_envs": {
 | |
|                "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
 | |
|             }
 | |
|          }
 | |
|       }
 | |
|    }'
 | |
| ```
 | |
| 
 | |
| The NIM deployment described above should take approximately 10 minutes to go live. You can continue with the remaining steps while the deployment is in progress.
 | |
| 
 | |
| ### Client-Side Requirements
 | |
| 
 | |
| Ensure you have access to:
 | |
| 
 | |
| 1. A Python-enabled machine capable of running Jupyter Lab.
 | |
| 2. Network access to the NeMo Microservices IP and ports.
 | |
| 
 | |
| ## Get Started
 | |
| Navigate to the [beginner E2E tutorial](./Llama_Stack_NVIDIA_E2E_Flow.ipynb) tutorial to get started.
 |