diff --git a/docs/source/distributions/remote_hosted_distro/nvidia.md b/docs/source/distributions/remote_hosted_distro/nvidia.md deleted file mode 100644 index 7f0cb82d6..000000000 --- a/docs/source/distributions/remote_hosted_distro/nvidia.md +++ /dev/null @@ -1,176 +0,0 @@ - -# NVIDIA Distribution - -The `llamastack/distribution-nvidia` distribution consists of the following provider configurations. - -| API | Provider(s) | -|-----|-------------| -| agents | `inline::meta-reference` | -| datasetio | `inline::localfs` | -| eval | `inline::meta-reference` | -| inference | `remote::nvidia` | -| post_training | `remote::nvidia` | -| safety | `remote::nvidia` | -| scoring | `inline::basic` | -| telemetry | `inline::meta-reference` | -| tool_runtime | `inline::rag-runtime` | -| vector_io | `inline::faiss` | - - -### Environment Variables - -The following environment variables can be configured: - -- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``) -- `NVIDIA_USER_ID`: NVIDIA User ID (default: `llama-stack-user`) -- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`) -- `NVIDIA_ACCESS_POLICIES`: NVIDIA Access Policies (default: `{}`) -- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`) -- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`) -- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`) -- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`) -- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`) -- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`) - -### Models - -The following models are available by default: - -- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)` -- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)` -- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)` -- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)` -- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)` -- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)` -- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)` -- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)` -- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)` -- `nvidia/llama-3.2-nv-embedqa-1b-v2 ` -- `nvidia/nv-embedqa-e5-v5 ` -- `nvidia/nv-embedqa-mistral-7b-v2 ` -- `snowflake/arctic-embed-l ` - - -## Prerequisites -### NVIDIA API Keys - -Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. - -### Deploy NeMo Microservices Platform -The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://aire.gitlab-master-pages.nvidia.com/microservices/documentation/latest/nemo-microservices/latest-early_access/set-up/deploy-as-platform/index.html) for platform prerequisites and instructions to install and deploy the platform. - -## Supported Services -Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. - -### Inference: NVIDIA NIM -NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: - 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) - 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. - -The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. - -### Datasetio API: NeMo Data Store -The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. - -See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage. - -### Eval API: NeMo Evaluator -The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. - -See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage. - -### Post-Training API: NeMo Customizer -The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. - -See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage. - -### Safety API: NeMo Guardrails -The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. - -See the NVIDIA Safety docs for supported features and example usage. - -## Deploying models -In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. - -Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. -```sh -# URL to NeMo NIM Proxy service -export NEMO_URL="http://nemo.test" - -curl --location "$NEMO_URL/v1/deployment/model-deployments" \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "name": "llama-3.2-1b-instruct", - "namespace": "meta", - "config": { - "model": "meta/llama-3.2-1b-instruct", - "nim_deployment": { - "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", - "image_tag": "1.8.3", - "pvc_size": "25Gi", - "gpu": 1, - "additional_envs": { - "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" - } - } - } - }' -``` -This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://aire.gitlab-master-pages.nvidia.com/microservices/documentation/latest/nemo-microservices/latest-early_access/get-started/tutorials/deploy-nims.html#) for more information on how to deploy a NIM and verify it's available for inference. - -You can also remove a deployed NIM to free up GPU resources, if needed. -```sh -export NEMO_URL="http://nemo.test" - -curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" -``` - -## Running Llama Stack with NVIDIA - -You can do this via Conda or venv (build code), or Docker which has a pre-built image. - -### Via Docker - -This method allows you to get started quickly without having to build the distribution code. - -```bash -LLAMA_STACK_PORT=8321 -docker run \ - -it \ - --pull always \ - -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ - -v ./run.yaml:/root/my-run.yaml \ - llamastack/distribution-nvidia \ - --yaml-config /root/my-run.yaml \ - --port $LLAMA_STACK_PORT \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY -``` - -### Via Conda - -```bash -INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct -llama stack build --template nvidia --image-type conda -llama stack run ./run.yaml \ - --port 8321 \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ - --env INFERENCE_MODEL=$INFERENCE_MODEL -``` - -### Via venv - -If you've set up your local development environment, you can also build the image using your local virtual environment. - -```bash -INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct -llama stack build --template nvidia --image-type venv -llama stack run ./run.yaml \ - --port 8321 \ - --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ - --env INFERENCE_MODEL=$INFERENCE_MODEL -``` - -### Example Notebooks -You can reference the Jupyter notebooks in `docs/notebooks/nvidia/` for example usage of these APIs. -- [Llama_Stack_NVIDIA_E2E_Flow.ipynb](/docs/notebooks/nvidia/Llama_Stack_NVIDIA_E2E_Flow.ipynb) contains an end-to-end workflow for running inference, customizing, and evaluating models using your deployed NeMo Microservices platform.