From 5bb3817c49649133f279c5d5cb449aa86429f81b Mon Sep 17 00:00:00 2001 From: Ben Browning Date: Mon, 7 Jul 2025 18:50:05 -0400 Subject: [PATCH] fix: Restore the nvidia distro (#2639) # What does this PR do? The `nvidia` distro was previously collapsed into the `starter` distro. However, the `nvidia` distro was setup specifically to use NVIDIA NeMo microservices as providers for all APIs and not just inference, which means it was doing quite a bit more than what the `starter` distro covers today. We should work with our friends at NVIDIA to determine the best place to maintain this distro long-term, but for now this restores the `nvidia` distro and its docs back to where they were so that things continue to work for their users. ## Test Plan I ensure the `nvidia` distro could build, and run at least to the point of complaining that I didn't provide the necessary API keys. ``` uv run llama stack build --template nvidia --image-type venv uv run llama stack run llama_stack/templates/nvidia/run.yaml ``` I also made sure the docs website built and looks reasonable, with the `nvidia` distro docs at the same URL it was previously (because it has incoming links from official NVIDIA NeMo docs, among other places). ``` uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all ``` Signed-off-by: Ben Browning --- .../distributions/list_of_distributions.md | 7 + .../self_hosted_distro/nvidia.md | 177 ++++++++++++++ llama_stack/templates/nvidia/__init__.py | 7 + llama_stack/templates/nvidia/build.yaml | 29 +++ llama_stack/templates/nvidia/doc_template.md | 149 ++++++++++++ llama_stack/templates/nvidia/nvidia.py | 150 ++++++++++++ .../templates/nvidia/run-with-safety.yaml | 119 +++++++++ llama_stack/templates/nvidia/run.yaml | 226 ++++++++++++++++++ 8 files changed, 864 insertions(+) create mode 100644 docs/source/distributions/self_hosted_distro/nvidia.md create mode 100644 llama_stack/templates/nvidia/__init__.py create mode 100644 llama_stack/templates/nvidia/build.yaml create mode 100644 llama_stack/templates/nvidia/doc_template.md create mode 100644 llama_stack/templates/nvidia/nvidia.py create mode 100644 llama_stack/templates/nvidia/run-with-safety.yaml create mode 100644 llama_stack/templates/nvidia/run.yaml diff --git a/docs/source/distributions/list_of_distributions.md b/docs/source/distributions/list_of_distributions.md index e468c3afa..ee01c92c4 100644 --- a/docs/source/distributions/list_of_distributions.md +++ b/docs/source/distributions/list_of_distributions.md @@ -39,6 +39,13 @@ docker pull llama-stack/distribution-meta-reference-gpu **Guides:** [Meta Reference GPU Guide](self_hosted_distro/meta-reference-gpu) +### 🖥️ Self-Hosted with NVIDA NeMo Microservices + +**Use `nvidia` if you:** +- Want to use Llama Stack with NVIDIA NeMo Microservices + +**Guides:** [NVIDIA Distribution Guide](self_hosted_distro/nvidia) + ### ☁️ Managed Hosting **Use remote-hosted endpoints if you:** diff --git a/docs/source/distributions/self_hosted_distro/nvidia.md b/docs/source/distributions/self_hosted_distro/nvidia.md new file mode 100644 index 000000000..47e38f73d --- /dev/null +++ b/docs/source/distributions/self_hosted_distro/nvidia.md @@ -0,0 +1,177 @@ + +# NVIDIA Distribution + +The `llamastack/distribution-nvidia` distribution consists of the following provider configurations. + +| API | Provider(s) | +|-----|-------------| +| agents | `inline::meta-reference` | +| datasetio | `inline::localfs`, `remote::nvidia` | +| eval | `remote::nvidia` | +| inference | `remote::nvidia` | +| post_training | `remote::nvidia` | +| safety | `remote::nvidia` | +| scoring | `inline::basic` | +| telemetry | `inline::meta-reference` | +| tool_runtime | `inline::rag-runtime` | +| vector_io | `inline::faiss` | + + +### Environment Variables + +The following environment variables can be configured: + +- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``) +- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`) +- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`) +- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`) +- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`) +- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`) +- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`) +- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`) +- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`) +- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`) +- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`) + +### Models + +The following models are available by default: + +- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)` +- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)` +- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)` +- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)` +- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)` +- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)` +- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)` +- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)` +- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)` +- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)` +- `nvidia/llama-3.2-nv-embedqa-1b-v2 ` +- `nvidia/nv-embedqa-e5-v5 ` +- `nvidia/nv-embedqa-mistral-7b-v2 ` +- `snowflake/arctic-embed-l ` + + +## Prerequisites +### NVIDIA API Keys + +Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. + +### Deploy NeMo Microservices Platform +The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform. + +## Supported Services +Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. + +### Inference: NVIDIA NIM +NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: + 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) + 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. + +The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. + +### Datasetio API: NeMo Data Store +The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. + +See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage. + +### Eval API: NeMo Evaluator +The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage. + +### Post-Training API: NeMo Customizer +The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage. + +### Safety API: NeMo Guardrails +The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage. + +## Deploying models +In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. + +Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. +```sh +# URL to NeMo NIM Proxy service +export NEMO_URL="http://nemo.test" + +curl --location "$NEMO_URL/v1/deployment/model-deployments" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "name": "llama-3.2-1b-instruct", + "namespace": "meta", + "config": { + "model": "meta/llama-3.2-1b-instruct", + "nim_deployment": { + "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", + "image_tag": "1.8.3", + "pvc_size": "25Gi", + "gpu": 1, + "additional_envs": { + "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" + } + } + } + }' +``` +This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference. + +You can also remove a deployed NIM to free up GPU resources, if needed. +```sh +export NEMO_URL="http://nemo.test" + +curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" +``` + +## Running Llama Stack with NVIDIA + +You can do this via Conda or venv (build code), or Docker which has a pre-built image. + +### Via Docker + +This method allows you to get started quickly without having to build the distribution code. + +```bash +LLAMA_STACK_PORT=8321 +docker run \ + -it \ + --pull always \ + -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ + -v ./run.yaml:/root/my-run.yaml \ + llamastack/distribution-nvidia \ + --config /root/my-run.yaml \ + --port $LLAMA_STACK_PORT \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY +``` + +### Via Conda + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type conda +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Via venv + +If you've set up your local development environment, you can also build the image using your local virtual environment. + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type venv +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +## Example Notebooks +For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`. diff --git a/llama_stack/templates/nvidia/__init__.py b/llama_stack/templates/nvidia/__init__.py new file mode 100644 index 000000000..24e2fbd21 --- /dev/null +++ b/llama_stack/templates/nvidia/__init__.py @@ -0,0 +1,7 @@ +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the terms described in the LICENSE file in +# the root directory of this source tree. + +from .nvidia import get_distribution_template # noqa: F401 diff --git a/llama_stack/templates/nvidia/build.yaml b/llama_stack/templates/nvidia/build.yaml new file mode 100644 index 000000000..51685b2e3 --- /dev/null +++ b/llama_stack/templates/nvidia/build.yaml @@ -0,0 +1,29 @@ +version: 2 +distribution_spec: + description: Use NVIDIA NIM for running LLM inference, evaluation and safety + providers: + inference: + - remote::nvidia + vector_io: + - inline::faiss + safety: + - remote::nvidia + agents: + - inline::meta-reference + telemetry: + - inline::meta-reference + eval: + - remote::nvidia + post_training: + - remote::nvidia + datasetio: + - inline::localfs + - remote::nvidia + scoring: + - inline::basic + tool_runtime: + - inline::rag-runtime +image_type: conda +additional_pip_packages: +- aiosqlite +- sqlalchemy[asyncio] diff --git a/llama_stack/templates/nvidia/doc_template.md b/llama_stack/templates/nvidia/doc_template.md new file mode 100644 index 000000000..3cb8245df --- /dev/null +++ b/llama_stack/templates/nvidia/doc_template.md @@ -0,0 +1,149 @@ +# NVIDIA Distribution + +The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations. + +{{ providers_table }} + +{% if run_config_env_vars %} +### Environment Variables + +The following environment variables can be configured: + +{% for var, (default_value, description) in run_config_env_vars.items() %} +- `{{ var }}`: {{ description }} (default: `{{ default_value }}`) +{% endfor %} +{% endif %} + +{% if default_models %} +### Models + +The following models are available by default: + +{% for model in default_models %} +- `{{ model.model_id }} {{ model.doc_string }}` +{% endfor %} +{% endif %} + + +## Prerequisites +### NVIDIA API Keys + +Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable. + +### Deploy NeMo Microservices Platform +The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform. + +## Supported Services +Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints. + +### Inference: NVIDIA NIM +NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs: + 1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key) + 2. Self-hosted: NVIDIA NIMs that run on your own infrastructure. + +The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment. + +### Datasetio API: NeMo Data Store +The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint. + +See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage. + +### Eval API: NeMo Evaluator +The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage. + +### Post-Training API: NeMo Customizer +The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage. + +### Safety API: NeMo Guardrails +The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint. + +See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage. + +## Deploying models +In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`. + +Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart. +```sh +# URL to NeMo NIM Proxy service +export NEMO_URL="http://nemo.test" + +curl --location "$NEMO_URL/v1/deployment/model-deployments" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "name": "llama-3.2-1b-instruct", + "namespace": "meta", + "config": { + "model": "meta/llama-3.2-1b-instruct", + "nim_deployment": { + "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct", + "image_tag": "1.8.3", + "pvc_size": "25Gi", + "gpu": 1, + "additional_envs": { + "NIM_GUIDED_DECODING_BACKEND": "fast_outlines" + } + } + } + }' +``` +This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference. + +You can also remove a deployed NIM to free up GPU resources, if needed. +```sh +export NEMO_URL="http://nemo.test" + +curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" +``` + +## Running Llama Stack with NVIDIA + +You can do this via Conda or venv (build code), or Docker which has a pre-built image. + +### Via Docker + +This method allows you to get started quickly without having to build the distribution code. + +```bash +LLAMA_STACK_PORT=8321 +docker run \ + -it \ + --pull always \ + -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ + -v ./run.yaml:/root/my-run.yaml \ + llamastack/distribution-{{ name }} \ + --config /root/my-run.yaml \ + --port $LLAMA_STACK_PORT \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY +``` + +### Via Conda + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type conda +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +### Via venv + +If you've set up your local development environment, you can also build the image using your local virtual environment. + +```bash +INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct +llama stack build --template nvidia --image-type venv +llama stack run ./run.yaml \ + --port 8321 \ + --env NVIDIA_API_KEY=$NVIDIA_API_KEY \ + --env INFERENCE_MODEL=$INFERENCE_MODEL +``` + +## Example Notebooks +For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`. diff --git a/llama_stack/templates/nvidia/nvidia.py b/llama_stack/templates/nvidia/nvidia.py new file mode 100644 index 000000000..4eccfb25c --- /dev/null +++ b/llama_stack/templates/nvidia/nvidia.py @@ -0,0 +1,150 @@ +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the terms described in the LICENSE file in +# the root directory of this source tree. + +from pathlib import Path + +from llama_stack.distribution.datatypes import ModelInput, Provider, ShieldInput, ToolGroupInput +from llama_stack.providers.remote.datasetio.nvidia import NvidiaDatasetIOConfig +from llama_stack.providers.remote.eval.nvidia import NVIDIAEvalConfig +from llama_stack.providers.remote.inference.nvidia import NVIDIAConfig +from llama_stack.providers.remote.inference.nvidia.models import MODEL_ENTRIES +from llama_stack.providers.remote.safety.nvidia import NVIDIASafetyConfig +from llama_stack.templates.template import DistributionTemplate, RunConfigSettings, get_model_registry + + +def get_distribution_template() -> DistributionTemplate: + providers = { + "inference": ["remote::nvidia"], + "vector_io": ["inline::faiss"], + "safety": ["remote::nvidia"], + "agents": ["inline::meta-reference"], + "telemetry": ["inline::meta-reference"], + "eval": ["remote::nvidia"], + "post_training": ["remote::nvidia"], + "datasetio": ["inline::localfs", "remote::nvidia"], + "scoring": ["inline::basic"], + "tool_runtime": ["inline::rag-runtime"], + } + + inference_provider = Provider( + provider_id="nvidia", + provider_type="remote::nvidia", + config=NVIDIAConfig.sample_run_config(), + ) + safety_provider = Provider( + provider_id="nvidia", + provider_type="remote::nvidia", + config=NVIDIASafetyConfig.sample_run_config(), + ) + datasetio_provider = Provider( + provider_id="nvidia", + provider_type="remote::nvidia", + config=NvidiaDatasetIOConfig.sample_run_config(), + ) + eval_provider = Provider( + provider_id="nvidia", + provider_type="remote::nvidia", + config=NVIDIAEvalConfig.sample_run_config(), + ) + inference_model = ModelInput( + model_id="${env.INFERENCE_MODEL}", + provider_id="nvidia", + ) + safety_model = ModelInput( + model_id="${env.SAFETY_MODEL}", + provider_id="nvidia", + ) + + available_models = { + "nvidia": MODEL_ENTRIES, + } + default_tool_groups = [ + ToolGroupInput( + toolgroup_id="builtin::rag", + provider_id="rag-runtime", + ), + ] + + default_models = get_model_registry(available_models) + return DistributionTemplate( + name="nvidia", + distro_type="self_hosted", + description="Use NVIDIA NIM for running LLM inference, evaluation and safety", + container_image=None, + template_path=Path(__file__).parent / "doc_template.md", + providers=providers, + available_models_by_provider=available_models, + run_configs={ + "run.yaml": RunConfigSettings( + provider_overrides={ + "inference": [inference_provider], + "datasetio": [datasetio_provider], + "eval": [eval_provider], + }, + default_models=default_models, + default_tool_groups=default_tool_groups, + ), + "run-with-safety.yaml": RunConfigSettings( + provider_overrides={ + "inference": [ + inference_provider, + safety_provider, + ], + "eval": [eval_provider], + }, + default_models=[inference_model, safety_model], + default_shields=[ShieldInput(shield_id="${env.SAFETY_MODEL}", provider_id="nvidia")], + default_tool_groups=default_tool_groups, + ), + }, + run_config_env_vars={ + "NVIDIA_API_KEY": ( + "", + "NVIDIA API Key", + ), + "NVIDIA_APPEND_API_VERSION": ( + "True", + "Whether to append the API version to the base_url", + ), + ## Nemo Customizer related variables + "NVIDIA_DATASET_NAMESPACE": ( + "default", + "NVIDIA Dataset Namespace", + ), + "NVIDIA_PROJECT_ID": ( + "test-project", + "NVIDIA Project ID", + ), + "NVIDIA_CUSTOMIZER_URL": ( + "https://customizer.api.nvidia.com", + "NVIDIA Customizer URL", + ), + "NVIDIA_OUTPUT_MODEL_DIR": ( + "test-example-model@v1", + "NVIDIA Output Model Directory", + ), + "GUARDRAILS_SERVICE_URL": ( + "http://0.0.0.0:7331", + "URL for the NeMo Guardrails Service", + ), + "NVIDIA_GUARDRAILS_CONFIG_ID": ( + "self-check", + "NVIDIA Guardrail Configuration ID", + ), + "NVIDIA_EVALUATOR_URL": ( + "http://0.0.0.0:7331", + "URL for the NeMo Evaluator Service", + ), + "INFERENCE_MODEL": ( + "Llama3.1-8B-Instruct", + "Inference model", + ), + "SAFETY_MODEL": ( + "meta/llama-3.1-8b-instruct", + "Name of the model to use for safety", + ), + }, + ) diff --git a/llama_stack/templates/nvidia/run-with-safety.yaml b/llama_stack/templates/nvidia/run-with-safety.yaml new file mode 100644 index 000000000..7017a5955 --- /dev/null +++ b/llama_stack/templates/nvidia/run-with-safety.yaml @@ -0,0 +1,119 @@ +version: 2 +image_name: nvidia +apis: +- agents +- datasetio +- eval +- inference +- post_training +- safety +- scoring +- telemetry +- tool_runtime +- vector_io +providers: + inference: + - provider_id: nvidia + provider_type: remote::nvidia + config: + url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com} + api_key: ${env.NVIDIA_API_KEY:=} + append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True} + - provider_id: nvidia + provider_type: remote::nvidia + config: + guardrails_service_url: ${env.GUARDRAILS_SERVICE_URL:=http://localhost:7331} + config_id: ${env.NVIDIA_GUARDRAILS_CONFIG_ID:=self-check} + vector_io: + - provider_id: faiss + provider_type: inline::faiss + config: + kvstore: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/faiss_store.db + safety: + - provider_id: nvidia + provider_type: remote::nvidia + config: + guardrails_service_url: ${env.GUARDRAILS_SERVICE_URL:=http://localhost:7331} + config_id: ${env.NVIDIA_GUARDRAILS_CONFIG_ID:=self-check} + agents: + - provider_id: meta-reference + provider_type: inline::meta-reference + config: + persistence_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/agents_store.db + responses_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/responses_store.db + telemetry: + - provider_id: meta-reference + provider_type: inline::meta-reference + config: + service_name: "${env.OTEL_SERVICE_NAME:=\u200B}" + sinks: ${env.TELEMETRY_SINKS:=console,sqlite} + sqlite_db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/trace_store.db + otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=} + eval: + - provider_id: nvidia + provider_type: remote::nvidia + config: + evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331} + post_training: + - provider_id: nvidia + provider_type: remote::nvidia + config: + api_key: ${env.NVIDIA_API_KEY:=} + dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default} + project_id: ${env.NVIDIA_PROJECT_ID:=test-project} + customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test} + datasetio: + - provider_id: localfs + provider_type: inline::localfs + config: + kvstore: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/localfs_datasetio.db + - provider_id: nvidia + provider_type: remote::nvidia + config: + api_key: ${env.NVIDIA_API_KEY:=} + dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default} + project_id: ${env.NVIDIA_PROJECT_ID:=test-project} + datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test} + scoring: + - provider_id: basic + provider_type: inline::basic + config: {} + tool_runtime: + - provider_id: rag-runtime + provider_type: inline::rag-runtime + config: {} +metadata_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/registry.db +inference_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/inference_store.db +models: +- metadata: {} + model_id: ${env.INFERENCE_MODEL} + provider_id: nvidia + model_type: llm +- metadata: {} + model_id: ${env.SAFETY_MODEL} + provider_id: nvidia + model_type: llm +shields: +- shield_id: ${env.SAFETY_MODEL} + provider_id: nvidia +vector_dbs: [] +datasets: [] +scoring_fns: [] +benchmarks: [] +tool_groups: +- toolgroup_id: builtin::rag + provider_id: rag-runtime +server: + port: 8321 diff --git a/llama_stack/templates/nvidia/run.yaml b/llama_stack/templates/nvidia/run.yaml new file mode 100644 index 000000000..ccddf11a2 --- /dev/null +++ b/llama_stack/templates/nvidia/run.yaml @@ -0,0 +1,226 @@ +version: 2 +image_name: nvidia +apis: +- agents +- datasetio +- eval +- inference +- post_training +- safety +- scoring +- telemetry +- tool_runtime +- vector_io +providers: + inference: + - provider_id: nvidia + provider_type: remote::nvidia + config: + url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com} + api_key: ${env.NVIDIA_API_KEY:=} + append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True} + vector_io: + - provider_id: faiss + provider_type: inline::faiss + config: + kvstore: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/faiss_store.db + safety: + - provider_id: nvidia + provider_type: remote::nvidia + config: + guardrails_service_url: ${env.GUARDRAILS_SERVICE_URL:=http://localhost:7331} + config_id: ${env.NVIDIA_GUARDRAILS_CONFIG_ID:=self-check} + agents: + - provider_id: meta-reference + provider_type: inline::meta-reference + config: + persistence_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/agents_store.db + responses_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/responses_store.db + telemetry: + - provider_id: meta-reference + provider_type: inline::meta-reference + config: + service_name: "${env.OTEL_SERVICE_NAME:=\u200B}" + sinks: ${env.TELEMETRY_SINKS:=console,sqlite} + sqlite_db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/trace_store.db + otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=} + eval: + - provider_id: nvidia + provider_type: remote::nvidia + config: + evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331} + post_training: + - provider_id: nvidia + provider_type: remote::nvidia + config: + api_key: ${env.NVIDIA_API_KEY:=} + dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default} + project_id: ${env.NVIDIA_PROJECT_ID:=test-project} + customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test} + datasetio: + - provider_id: nvidia + provider_type: remote::nvidia + config: + api_key: ${env.NVIDIA_API_KEY:=} + dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default} + project_id: ${env.NVIDIA_PROJECT_ID:=test-project} + datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test} + scoring: + - provider_id: basic + provider_type: inline::basic + config: {} + tool_runtime: + - provider_id: rag-runtime + provider_type: inline::rag-runtime + config: {} +metadata_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/registry.db +inference_store: + type: sqlite + db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/nvidia}/inference_store.db +models: +- metadata: {} + model_id: meta/llama3-8b-instruct + provider_id: nvidia + provider_model_id: meta/llama3-8b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3-8B-Instruct + provider_id: nvidia + provider_model_id: meta/llama3-8b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama3-70b-instruct + provider_id: nvidia + provider_model_id: meta/llama3-70b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3-70B-Instruct + provider_id: nvidia + provider_model_id: meta/llama3-70b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.1-8b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.1-8b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.1-8B-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.1-8b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.1-70b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.1-70b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.1-70B-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.1-70b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.1-405b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.1-405b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.1-405B-Instruct-FP8 + provider_id: nvidia + provider_model_id: meta/llama-3.1-405b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.2-1b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-1b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.2-1B-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-1b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.2-3b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-3b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.2-3B-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-3b-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.2-11b-vision-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-11b-vision-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.2-11B-Vision-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-11b-vision-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.2-90b-vision-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-90b-vision-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.2-90B-Vision-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.2-90b-vision-instruct + model_type: llm +- metadata: {} + model_id: meta/llama-3.3-70b-instruct + provider_id: nvidia + provider_model_id: meta/llama-3.3-70b-instruct + model_type: llm +- metadata: {} + model_id: meta-llama/Llama-3.3-70B-Instruct + provider_id: nvidia + provider_model_id: meta/llama-3.3-70b-instruct + model_type: llm +- metadata: + embedding_dimension: 2048 + context_length: 8192 + model_id: nvidia/llama-3.2-nv-embedqa-1b-v2 + provider_id: nvidia + provider_model_id: nvidia/llama-3.2-nv-embedqa-1b-v2 + model_type: embedding +- metadata: + embedding_dimension: 1024 + context_length: 512 + model_id: nvidia/nv-embedqa-e5-v5 + provider_id: nvidia + provider_model_id: nvidia/nv-embedqa-e5-v5 + model_type: embedding +- metadata: + embedding_dimension: 4096 + context_length: 512 + model_id: nvidia/nv-embedqa-mistral-7b-v2 + provider_id: nvidia + provider_model_id: nvidia/nv-embedqa-mistral-7b-v2 + model_type: embedding +- metadata: + embedding_dimension: 1024 + context_length: 512 + model_id: snowflake/arctic-embed-l + provider_id: nvidia + provider_model_id: snowflake/arctic-embed-l + model_type: embedding +shields: [] +vector_dbs: [] +datasets: [] +scoring_fns: [] +benchmarks: [] +tool_groups: +- toolgroup_id: builtin::rag + provider_id: rag-runtime +server: + port: 8321