mirror of https://github.com/meta-llama/llama-stack.git synced 2025-06-27 18:50:41 +00:00

feat: Add Nvidia e2e beginner notebook and tool calling notebook (#1964 )

# What does this PR do?
This PR contains two sets of notebooks that serve as reference material
for developers getting started with Llama Stack using the NVIDIA
Provider. Developers should be able to execute these notebooks
end-to-end, pointing to their NeMo Microservices deployment.
1. `beginner_e2e/`: Notebook that walks through a beginner end-to-end
workflow that covers creating datasets, running inference, customizing
and evaluating models, and running safety checks.
2. `tool_calling/`: Notebook that is ported over from the [Data Flywheel
& Tool Calling
notebook](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/nemo/data-flywheel)
that is referenced in the NeMo Microservices docs. I updated the
notebook to use the Llama Stack client wherever possible, and added
relevant instructions.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- Both notebook folders contain READMEs with pre-requisites. To manually
test these notebooks, you'll need to have a deployment of the NeMo
Microservices Platform and update the `config.py` file with your
deployment's information.
- I've run through these notebooks manually end-to-end to verify each
step works.

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>

2025-06-16 11:29:01 -04:00

3.2 KiB

Raw Blame History

Beginner Fine-tuning, Inference, and Evaluation with NVIDIA NeMo Microservices and NIM

Introduction

This notebook contains the Llama Stack implementation for an end-to-end workflow for running inference, customizing, and evaluating LLMs using the NVIDIA provider. The NVIDIA provider leverages the NeMo Microservices platform, a collection of microservices that you can use to build AI workflows on your Kubernetes cluster on-prem or in cloud.

About NVIDIA NeMo Microservices

The NVIDIA NeMo microservices platform provides a flexible foundation for building AI workflows such as fine-tuning, evaluation, running inference, or applying guardrails to AI models on your Kubernetes cluster on-premises or in cloud. Refer to documentation for further information.

Objectives

This end-to-end tutorial shows how to leverage the NeMo Microservices platform for customizing Llama-3.1-8B-Instruct using data from the Stanford Question Answering Dataset (SQuAD) reading comprehension dataset, consisting of questions posed on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding passage, or the question is unanswerable.

Prerequisites

Deploy NeMo Microservices

Ensure the NeMo Microservices platform is up and running, including the model downloading step for meta/llama-3.1-8b-instruct. Please refer to the installation guide for instructions.

NOTE: The Guardrails step uses the llama-3.1-nemoguard-8b-content-safety model to add content safety guardrails to user input. You can either replace this with another model you've already deployed, or deploy this NIM using NeMo Deployment Management Service. This step is similar to NIM deployment instructions in documentation, but with the following values:

# URL to NeMo deployment management service
export NEMO_URL="http://nemo.test"

curl --location "$NEMO_URL/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "llama-3.1-nemoguard-8b-content-safety",
      "namespace": "nvidia",
      "config": {
         "model": "nvidia/llama-3.1-nemoguard-8b-content-safety",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
            "image_tag": "1.0.0",
            "pvc_size":   "25Gi",
            "gpu": 1,
            "additional_envs": {
               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
         }
      }
   }'

The NIM deployment described above should take approximately 10 minutes to go live. You can continue with the remaining steps while the deployment is in progress.

Client-Side Requirements

Ensure you have access to:

A Python-enabled machine capable of running Jupyter Lab.
Network access to the NeMo Microservices IP and ports.

Get Started

Navigate to the beginner E2E tutorial tutorial to get started.

3.2 KiB Raw Blame History