llama-stack-mirror/docs/docs/distributions/self_hosted_distro/nvidia.md
raghotham feabcdd67b
docs: add documentation on how to use custom run yaml in docker (#3949)
as title

test plan:

```yaml
# custom-ollama-run.yaml
version: 2
image_name: starter
external_providers_dir: /.llama/providers.d
apis:
- inference
- vector_io
- files
- safety
- tool_runtime
- agents

providers:
  inference:
  # Single Ollama provider for all models
  - provider_id: ollama
    provider_type: remote::ollama
    config:
      url: ${env.OLLAMA_URL:=http://localhost:11434}

  vector_io:
  - provider_id: faiss
    provider_type: inline::faiss
    config:
      persistence:
        namespace: vector_io::faiss
        backend: kv_default

  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
    config:
      storage_dir: /.llama/files
      metadata_store:
        table_name: files_metadata
        backend: sql_default

  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
    config:
      excluded_categories: []

  tool_runtime:
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime

  agents:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      persistence:
        agent_state:
          namespace: agents
          backend: kv_default
        responses:
          table_name: responses
          backend: sql_default
          max_write_queue_size: 10000
          num_writers: 4

storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: /.llama/kvstore.db
    sql_default:
      type: sql_sqlite
      db_path: /.llama/sql_store.db
  stores:
    metadata:
      namespace: registry
      backend: kv_default
    inference:
      table_name: inference_store
      backend: sql_default
      max_write_queue_size: 10000
      num_writers: 4
    conversations:
      table_name: openai_conversations
      backend: sql_default

registered_resources:
  models:
  # All models use the same 'ollama' provider
  - model_id: llama3.2-vision:latest
    provider_id: ollama
    provider_model_id: llama3.2-vision:latest
    model_type: llm
  - model_id: llama3.2:3b
    provider_id: ollama
    provider_model_id: llama3.2:3b
    model_type: llm
  # Embedding models
  - model_id: nomic-embed-text-v2-moe
    provider_id: ollama
    provider_model_id: toshk0/nomic-embed-text-v2-moe:Q6_K
    model_type: embedding
    metadata:
      embedding_dimension: 768
  shields: []
  vector_dbs: []
  datasets: []
  scoring_fns: []
  benchmarks: []
  tool_groups: []

server:
  port: 8321

telemetry:
  enabled: true

vector_stores:
  default_provider_id: faiss
  default_embedding_model:
    provider_id: ollama
    model_id: toshk0/nomic-embed-text-v2-moe:Q6_K
```

```bash
docker run
     -it
     --pull always
     -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT
     -v ~/.llama:/root/.llama
     -v $CUSTOM_RUN_CONFIG:/app/custom-run.yaml
     -e RUN_CONFIG_PATH=/app/custom-run.yaml
     -e OLLAMA_URL=http://host.docker.internal:11434/
     llamastack/distribution-starter:0.3.0
     --port $LLAMA_STACK_PORT
```
2025-10-28 16:05:44 -07:00

8.8 KiB

orphan
true

NVIDIA Distribution

The llamastack/distribution-nvidia distribution consists of the following provider configurations.

API Provider(s)
agents inline::meta-reference
datasetio inline::localfs, remote::nvidia
eval remote::nvidia
files inline::localfs
inference remote::nvidia
post_training remote::nvidia
safety remote::nvidia
scoring inline::basic
tool_runtime inline::rag-runtime
vector_io inline::faiss

Environment Variables

The following environment variables can be configured:

  • NVIDIA_API_KEY: NVIDIA API Key (default: ``)
  • NVIDIA_APPEND_API_VERSION: Whether to append the API version to the base_url (default: True)
  • NVIDIA_DATASET_NAMESPACE: NVIDIA Dataset Namespace (default: default)
  • NVIDIA_PROJECT_ID: NVIDIA Project ID (default: test-project)
  • NVIDIA_CUSTOMIZER_URL: NVIDIA Customizer URL (default: https://customizer.api.nvidia.com)
  • NVIDIA_OUTPUT_MODEL_DIR: NVIDIA Output Model Directory (default: test-example-model@v1)
  • GUARDRAILS_SERVICE_URL: URL for the NeMo Guardrails Service (default: http://0.0.0.0:7331)
  • NVIDIA_GUARDRAILS_CONFIG_ID: NVIDIA Guardrail Configuration ID (default: self-check)
  • NVIDIA_EVALUATOR_URL: URL for the NeMo Evaluator Service (default: http://0.0.0.0:7331)
  • INFERENCE_MODEL: Inference model (default: Llama3.1-8B-Instruct)
  • SAFETY_MODEL: Name of the model to use for safety (default: meta/llama-3.1-8b-instruct)

Prerequisites

NVIDIA API Keys

Make sure you have access to a NVIDIA API Key. You can get one by visiting https://build.nvidia.com/. Use this key for the NVIDIA_API_KEY environment variable.

Deploy NeMo Microservices Platform

The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the NVIDIA NeMo Microservices documentation for platform prerequisites and instructions to install and deploy the platform.

Supported Services

Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.

Inference: NVIDIA NIM

NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:

  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.

The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the NVIDIA_BASE_URL environment variable to use your NVIDIA NIM Proxy deployment.

Datasetio API: NeMo Data Store

The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (HfApi), so you can use the client to interact with Data Store. The NVIDIA_DATASETS_URL environment variable should point to your NeMo Data Store endpoint.

See the NVIDIA Datasetio docs for supported features and example usage.

Eval API: NeMo Evaluator

The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The NVIDIA_EVALUATOR_URL environment variable should point to your NeMo Microservices endpoint.

See the NVIDIA Eval docs for supported features and example usage.

Post-Training API: NeMo Customizer

The NeMo Customizer microservice supports fine-tuning models. You can reference this list of supported models that can be fine-tuned using Llama Stack. The NVIDIA_CUSTOMIZER_URL environment variable should point to your NeMo Microservices endpoint.

See the NVIDIA Post-Training docs for supported features and example usage.

Safety API: NeMo Guardrails

The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The GUARDRAILS_SERVICE_URL environment variable should point to your NeMo Microservices endpoint.

See the NVIDIA Safety docs for supported features and example usage.

Deploying models

In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy meta/llama-3.2-1b-instruct.

Note: For improved inference speeds, we need to use NIM with fast_outlines guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.

# URL to NeMo NIM Proxy service
export NEMO_URL="http://nemo.test"

curl --location "$NEMO_URL/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "llama-3.2-1b-instruct",
      "namespace": "meta",
      "config": {
         "model": "meta/llama-3.2-1b-instruct",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
            "image_tag": "1.8.3",
            "pvc_size": "25Gi",
            "gpu": 1,
            "additional_envs": {
               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
         }
      }
   }'

This NIM deployment should take approximately 10 minutes to go live. See the docs for more information on how to deploy a NIM and verify it's available for inference.

You can also remove a deployed NIM to free up GPU resources, if needed.

export NEMO_URL="http://nemo.test"

curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"

Running Llama Stack with NVIDIA

You can do this via venv (build code), or Docker which has a pre-built image.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

LLAMA_STACK_PORT=8321
docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
  llamastack/distribution-nvidia \
  --port $LLAMA_STACK_PORT

Via Docker with Custom Run Configuration

You can also run the Docker container with a custom run configuration file by mounting it into the container:

# Set the path to your custom run.yaml file
CUSTOM_RUN_CONFIG=/path/to/your/custom-run.yaml
LLAMA_STACK_PORT=8321

docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -v $CUSTOM_RUN_CONFIG:/app/custom-run.yaml \
  -e RUN_CONFIG_PATH=/app/custom-run.yaml \
  -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
  llamastack/distribution-nvidia \
  --port $LLAMA_STACK_PORT

Note: The run configuration must be mounted into the container before it can be used. The -v flag mounts your local file into the container, and the RUN_CONFIG_PATH environment variable tells the entrypoint script which configuration to use.

Available run configurations for this distribution:

  • run.yaml
  • run-with-safety.yaml

Via venv

If you've set up your local development environment, you can also install the distribution dependencies using your local virtual environment.

INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
llama stack list-deps nvidia | xargs -L1 uv pip install
NVIDIA_API_KEY=$NVIDIA_API_KEY \
INFERENCE_MODEL=$INFERENCE_MODEL \
llama stack run ./run.yaml \
  --port 8321

Example Notebooks

For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in docs/notebooks/nvidia.