From 92aac3ec4da1281604e28e5881506f1ad9b82ab6 Mon Sep 17 00:00:00 2001 From: Jash Gulabrai Date: Thu, 17 Apr 2025 14:01:05 -0400 Subject: [PATCH] Add back file that was removed --- .../remote_hosted_distro/nvidia.md | 178 ++++++++++++++++++ 1 file changed, 178 insertions(+) create mode 100644 docs/source/distributions/remote_hosted_distro/nvidia.md diff --git a/docs/source/distributions/remote_hosted_distro/nvidia.md b/docs/source/distributions/remote_hosted_distro/nvidia.md new file mode 100644 index 000000000..4a5e535e8 --- /dev/null +++ b/docs/source/distributions/remote_hosted_distro/nvidia.md @@ -0,0 +1,178 @@ + +# NVIDIA Distribution + +The `llamastack/distribution-nvidia` distribution consists of the following provider configurations. + +| API | Provider(s) | +|-----|-------------| +| agents | `inline::meta-reference` | +| datasetio | `inline::localfs` | +| eval | `remote::nvidia` | +| inference | `remote::nvidia` | +| post_training | `remote::nvidia` | +| safety | `remote::nvidia` | +| scoring | `inline::basic` | +| telemetry | `inline::meta-reference` | +| tool_runtime | `inline::rag-runtime` | +| vector_io | `inline::faiss` | + + +### Environment Variables + +The following environment variables can be configured: + +- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``) +- `NVIDIA_USER_ID`: NVIDIA User ID (default: `llama-stack-user`) +- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`) +- `NVIDIA_ACCESS_POLICIES`: NVIDIA Access Policies (default: `{}`) +- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`) +- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`) +- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`) +- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`) +- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`) +- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`) +- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`) + +### Models + +The following models are available by default: + +- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)` +- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)` +- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)` +- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)` +- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)` +- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)` +- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)` +- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)` +- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)` +- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)` +- `nvidia/llama-3.2-nv-embedqa-1b-v2 ` +- `nvidia/nv-embedqa-e5-v5 ` +- `nvidia/nv-embedqa-mistral-7b-v2 ` +- `snowflake/arctic-embed-l ` + + +## Prerequisites +### NVIDIA API Keys + +Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. + +### Deploy NeMo Microservices Platform +The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/set-up/deploy-as-platform/index.html) for platform prerequisites and instructions to install and deploy the platform. + +## Supported Services +Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. + +### Inference: NVIDIA NIM +NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: + 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) + 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. + +The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. + +### Datasetio API: NeMo Data Store +The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. + +See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage. + +### Eval API: NeMo Evaluator +The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage. + +### Post-Training API: NeMo Customizer +The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. + +See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage. + +### Safety API: NeMo Guardrails +The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. + +See the NVIDIA Safety docs for supported features and example usage. + +## Deploying models +In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. + +Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. +```sh +# URL to NeMo NIM Proxy service +export NEMO_URL="http://nemo.test" + +curl --location "$NEMO_URL/v1/deployment/model-deployments" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "name": "llama-3.2-1b-instruct", + "namespace": "meta", + "config": { + "model": "meta/llama-3.2-1b-instruct", + "nim_deployment": { + "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", + "image_tag": "1.8.3", + "pvc_size": "25Gi", + "gpu": 1, + "additional_envs": { + "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" + } + } + } + }' +``` +This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/documentation/latest/nemo-microservices/latest-early_access/get-started/tutorials/deploy-nims.html#) for more information on how to deploy a NIM and verify it's available for inference. + +You can also remove a deployed NIM to free up GPU resources, if needed. +```sh +export NEMO_URL="http://nemo.test" + +curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" +``` + +## Running Llama Stack with NVIDIA + +You can do this via Conda or venv (build code), or Docker which has a pre-built image. + +### Via Docker + +This method allows you to get started quickly without having to build the distribution code. + +```bash +LLAMA_STACK_PORT=8321 +docker run \ + -it \ + --pull always \ + -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ + -v ./run.yaml:/root/my-run.yaml \ + llamastack/distribution-nvidia \ + --yaml-config /root/my-run.yaml \ + --port $LLAMA_STACK_PORT \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY +``` + +### Via Conda + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type conda +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Via venv + +If you've set up your local development environment, you can also build the image using your local virtual environment. + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type venv +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Example Notebooks +You can reference the Jupyter notebooks in `docs/notebooks/nvidia/` for example usage of these APIs. +- [Llama_Stack_NVIDIA_E2E_Flow.ipynb](/docs/notebooks/nvidia/Llama_Stack_NVIDIA_E2E_Flow.ipynb) contains an end-to-end workflow for running inference, customizing, and evaluating models using your deployed NeMo Microservices platform.