From 2ae1d7f4e6d2649deb0a2262bd04eb5393fe7acf Mon Sep 17 00:00:00 2001 From: Jash Gulabrai <37194352+JashG@users.noreply.github.com> Date: Thu, 17 Apr 2025 08:54:30 -0400 Subject: [PATCH] docs: Add NVIDIA platform distro docs (#1971) # What does this PR do? Add NVIDIA platform docs that serve as a starting point for Llama Stack users and explains all supported microservices. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai --- .../remote_hosted_distro/nvidia.md | 88 ----------------- .../self_hosted_distro/nvidia.md | 96 ++++++++++++++++++- .../remote/inference/nvidia/NVIDIA.md | 85 ++++++++++++++++ .../providers/remote/safety/nvidia/README.md | 77 +++++++++++++++ llama_stack/templates/nvidia/doc_template.md | 96 ++++++++++++++++++- llama_stack/templates/nvidia/nvidia.py | 2 +- 6 files changed, 347 insertions(+), 97 deletions(-) delete mode 100644 docs/source/distributions/remote_hosted_distro/nvidia.md create mode 100644 llama_stack/providers/remote/inference/nvidia/NVIDIA.md create mode 100644 llama_stack/providers/remote/safety/nvidia/README.md diff --git a/docs/source/distributions/remote_hosted_distro/nvidia.md b/docs/source/distributions/remote_hosted_distro/nvidia.md deleted file mode 100644 index 58731392d..000000000 --- a/docs/source/distributions/remote_hosted_distro/nvidia.md +++ /dev/null @@ -1,88 +0,0 @@ - -# NVIDIA Distribution - -The `llamastack/distribution-nvidia` distribution consists of the following provider configurations. - -| API | Provider(s) | -|-----|-------------| -| agents | `inline::meta-reference` | -| datasetio | `inline::localfs` | -| eval | `inline::meta-reference` | -| inference | `remote::nvidia` | -| post_training | `remote::nvidia` | -| safety | `remote::nvidia` | -| scoring | `inline::basic` | -| telemetry | `inline::meta-reference` | -| tool_runtime | `inline::rag-runtime` | -| vector_io | `inline::faiss` | - - -### Environment Variables - -The following environment variables can be configured: - -- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``) -- `NVIDIA_USER_ID`: NVIDIA User ID (default: `llama-stack-user`) -- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`) -- `NVIDIA_ACCESS_POLICIES`: NVIDIA Access Policies (default: `{}`) -- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`) -- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`) -- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`) -- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`) -- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`) -- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`) - -### Models - -The following models are available by default: - -- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)` -- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)` -- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)` -- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)` -- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)` -- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)` -- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)` -- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)` -- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)` -- `nvidia/llama-3.2-nv-embedqa-1b-v2 ` -- `nvidia/nv-embedqa-e5-v5 ` -- `nvidia/nv-embedqa-mistral-7b-v2 ` -- `snowflake/arctic-embed-l ` - - -### Prerequisite: API Keys - -Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). - - -## Running Llama Stack with NVIDIA - -You can do this via Conda (build code) or Docker which has a pre-built image. - -### Via Docker - -This method allows you to get started quickly without having to build the distribution code. - -```bash -LLAMA_STACK_PORT=8321 -docker run \ - -it \ - --pull always \ - -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ - -v ./run.yaml:/root/my-run.yaml \ - llamastack/distribution-nvidia \ - --yaml-config /root/my-run.yaml \ - --port $LLAMA_STACK_PORT \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY -``` - -### Via Conda - -```bash -llama stack build --template nvidia --image-type conda -llama stack run ./run.yaml \ - --port 8321 \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY - --env INFERENCE_MODEL=$INFERENCE_MODEL -``` diff --git a/docs/source/distributions/self_hosted_distro/nvidia.md b/docs/source/distributions/self_hosted_distro/nvidia.md index 58731392d..563fdf4e5 100644 --- a/docs/source/distributions/self_hosted_distro/nvidia.md +++ b/docs/source/distributions/self_hosted_distro/nvidia.md @@ -51,14 +51,84 @@ The following models are available by default: - `snowflake/arctic-embed-l ` -### Prerequisite: API Keys +## Prerequisites +### NVIDIA API Keys -Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). +Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. +### Deploy NeMo Microservices Platform +The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/set-up/deploy-as-platform/index.html) for platform prerequisites and instructions to install and deploy the platform. + +## Supported Services +Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. + +### Inference: NVIDIA NIM +NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: + 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) + 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. + +The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. + +### Datasetio API: NeMo Data Store +The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. + +See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage. + +### Eval API: NeMo Evaluator +The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage. + +### Post-Training API: NeMo Customizer +The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage. + +### Safety API: NeMo Guardrails +The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. + +See the NVIDIA Safety docs for supported features and example usage. + +## Deploying models +In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. + +Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. +```sh +# URL to NeMo NIM Proxy service +export NEMO_URL="http://nemo.test" + +curl --location "$NEMO_URL/v1/deployment/model-deployments" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "name": "llama-3.2-1b-instruct", + "namespace": "meta", + "config": { + "model": "meta/llama-3.2-1b-instruct", + "nim_deployment": { + "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", + "image_tag": "1.8.3", + "pvc_size": "25Gi", + "gpu": 1, + "additional_envs": { + "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" + } + } + } + }' +``` +This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/get-started/tutorials/deploy-nims.html#) for more information on how to deploy a NIM and verify it's available for inference. + +You can also remove a deployed NIM to free up GPU resources, if needed. +```sh +export NEMO_URL="http://nemo.test" + +curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" +``` ## Running Llama Stack with NVIDIA -You can do this via Conda (build code) or Docker which has a pre-built image. +You can do this via Conda or venv (build code), or Docker which has a pre-built image. ### Via Docker @@ -80,9 +150,27 @@ docker run \ ### Via Conda ```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct llama stack build --template nvidia --image-type conda llama stack run ./run.yaml \ --port 8321 \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ --env INFERENCE_MODEL=$INFERENCE_MODEL ``` + +### Via venv + +If you've set up your local development environment, you can also build the image using your local virtual environment. + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type venv +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Example Notebooks +You can reference the Jupyter notebooks in `docs/notebooks/nvidia/` for example usage of these APIs. +- [Llama_Stack_NVIDIA_E2E_Flow.ipynb](/docs/notebooks/nvidia/Llama_Stack_NVIDIA_E2E_Flow.ipynb) contains an end-to-end workflow for running inference, customizing, and evaluating models using your deployed NeMo Microservices platform. diff --git a/llama_stack/providers/remote/inference/nvidia/NVIDIA.md b/llama_stack/providers/remote/inference/nvidia/NVIDIA.md new file mode 100644 index 000000000..a353c67f5 --- /dev/null +++ b/llama_stack/providers/remote/inference/nvidia/NVIDIA.md @@ -0,0 +1,85 @@ +# NVIDIA Inference Provider for LlamaStack + +This provider enables running inference using NVIDIA NIM. + +## Features +- Endpoints for completions, chat completions, and embeddings for registered models + +## Getting Started + +### Prerequisites + +- LlamaStack with NVIDIA configuration +- Access to NVIDIA NIM deployment +- NIM for model to use for inference is deployed + +### Setup + +Build the NVIDIA environment: + +```bash +llama stack build --template nvidia --image-type conda +``` + +### Basic Usage using the LlamaStack Python Client + +#### Initialize the client + +```python +import os + +os.environ["NVIDIA_API_KEY"] = ( + "" # Required if using hosted NIM endpoint. If self-hosted, not required. +) +os.environ["NVIDIA_BASE_URL"] = "http://nim.test" # NIM URL + +from llama_stack.distribution.library_client import LlamaStackAsLibraryClient + +client = LlamaStackAsLibraryClient("nvidia") +client.initialize() +``` + +### Create Completion + +```python +response = client.completion( + model_id="meta-llama/Llama-3.1-8b-Instruct", + content="Complete the sentence using one word: Roses are red, violets are :", + stream=False, + sampling_params={ + "max_tokens": 50, + }, +) +print(f"Response: {response.content}") +``` + +### Create Chat Completion + +```python +response = client.chat_completion( + model_id="meta-llama/Llama-3.1-8b-Instruct", + messages=[ + { + "role": "system", + "content": "You must respond to each message with only one word", + }, + { + "role": "user", + "content": "Complete the sentence using one word: Roses are red, violets are:", + }, + ], + stream=False, + sampling_params={ + "max_tokens": 50, + }, +) +print(f"Response: {response.completion_message.content}") +``` + +### Create Embeddings +```python +response = client.embeddings( + model_id="meta-llama/Llama-3.1-8b-Instruct", contents=["foo", "bar", "baz"] +) +print(f"Embeddings: {response.embeddings}") +``` diff --git a/llama_stack/providers/remote/safety/nvidia/README.md b/llama_stack/providers/remote/safety/nvidia/README.md new file mode 100644 index 000000000..434db32fb --- /dev/null +++ b/llama_stack/providers/remote/safety/nvidia/README.md @@ -0,0 +1,77 @@ +# NVIDIA Safety Provider for LlamaStack + +This provider enables safety checks and guardrails for LLM interactions using NVIDIA's NeMo Guardrails service. + +## Features + +- Run safety checks for messages + +## Getting Started + +### Prerequisites + +- LlamaStack with NVIDIA configuration +- Access to NVIDIA NeMo Guardrails service +- NIM for model to use for safety check is deployed + +### Setup + +Build the NVIDIA environment: + +```bash +llama stack build --template nvidia --image-type conda +``` + +### Basic Usage using the LlamaStack Python Client + +#### Initialize the client + +```python +import os + +os.environ["NVIDIA_API_KEY"] = "your-api-key" +os.environ["NVIDIA_GUARDRAILS_URL"] = "http://guardrails.test" + +from llama_stack.distribution.library_client import LlamaStackAsLibraryClient + +client = LlamaStackAsLibraryClient("nvidia") +client.initialize() +``` + +#### Create a safety shield + +```python +from llama_stack.apis.safety import Shield +from llama_stack.apis.inference import Message + +# Create a safety shield +shield = Shield( + shield_id="your-shield-id", + provider_resource_id="safety-model-id", # The model to use for safety checks + description="Safety checks for content moderation", +) + +# Register the shield +await client.safety.register_shield(shield) +``` + +#### Run safety checks + +```python +# Messages to check +messages = [Message(role="user", content="Your message to check")] + +# Run safety check +response = await client.safety.run_shield( + shield_id="your-shield-id", + messages=messages, +) + +# Check for violations +if response.violation: + print(f"Safety violation detected: {response.violation.user_message}") + print(f"Violation level: {response.violation.violation_level}") + print(f"Metadata: {response.violation.metadata}") +else: + print("No safety violations detected") +``` diff --git a/llama_stack/templates/nvidia/doc_template.md b/llama_stack/templates/nvidia/doc_template.md index da95227d8..8818e55c1 100644 --- a/llama_stack/templates/nvidia/doc_template.md +++ b/llama_stack/templates/nvidia/doc_template.md @@ -25,14 +25,84 @@ The following models are available by default: {% endif %} -### Prerequisite: API Keys +## Prerequisites +### NVIDIA API Keys -Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). +Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. +### Deploy NeMo Microservices Platform +The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/set-up/deploy-as-platform/index.html) for platform prerequisites and instructions to install and deploy the platform. + +## Supported Services +Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. + +### Inference: NVIDIA NIM +NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: + 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) + 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. + +The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. + +### Datasetio API: NeMo Data Store +The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. + +See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage. + +### Eval API: NeMo Evaluator +The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage. + +### Post-Training API: NeMo Customizer +The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage. + +### Safety API: NeMo Guardrails +The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. + +See the NVIDIA Safety docs for supported features and example usage. + +## Deploying models +In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. + +Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. +```sh +# URL to NeMo NIM Proxy service +export NEMO_URL="http://nemo.test" + +curl --location "$NEMO_URL/v1/deployment/model-deployments" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "name": "llama-3.2-1b-instruct", + "namespace": "meta", + "config": { + "model": "meta/llama-3.2-1b-instruct", + "nim_deployment": { + "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", + "image_tag": "1.8.3", + "pvc_size": "25Gi", + "gpu": 1, + "additional_envs": { + "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" + } + } + } + }' +``` +This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/get-started/tutorials/deploy-nims.html#) for more information on how to deploy a NIM and verify it's available for inference. + +You can also remove a deployed NIM to free up GPU resources, if needed. +```sh +export NEMO_URL="http://nemo.test" + +curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" +``` ## Running Llama Stack with NVIDIA -You can do this via Conda (build code) or Docker which has a pre-built image. +You can do this via Conda or venv (build code), or Docker which has a pre-built image. ### Via Docker @@ -54,9 +124,27 @@ docker run \ ### Via Conda ```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct llama stack build --template nvidia --image-type conda llama stack run ./run.yaml \ --port 8321 \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ --env INFERENCE_MODEL=$INFERENCE_MODEL ``` + +### Via venv + +If you've set up your local development environment, you can also build the image using your local virtual environment. + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type venv +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Example Notebooks +You can reference the Jupyter notebooks in `docs/notebooks/nvidia/` for example usage of these APIs. +- [Llama_Stack_NVIDIA_E2E_Flow.ipynb](/docs/notebooks/nvidia/Llama_Stack_NVIDIA_E2E_Flow.ipynb) contains an end-to-end workflow for running inference, customizing, and evaluating models using your deployed NeMo Microservices platform. diff --git a/llama_stack/templates/nvidia/nvidia.py b/llama_stack/templates/nvidia/nvidia.py index 3b0cbe1e5..a0cefba52 100644 --- a/llama_stack/templates/nvidia/nvidia.py +++ b/llama_stack/templates/nvidia/nvidia.py @@ -59,7 +59,7 @@ def get_distribution_template() -> DistributionTemplate: default_models = get_model_registry(available_models) return DistributionTemplate( name="nvidia", - distro_type="remote_hosted", + distro_type="self_hosted", description="Use NVIDIA NIM for running LLM inference and safety", container_image=None, template_path=Path(__file__).parent / "doc_template.md",