mirror of
				https://github.com/meta-llama/llama-stack.git
				synced 2025-10-26 01:12:59 +00:00 
			
		
		
		
	
		
			Some checks failed
		
		
	
	Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
				
			SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s
				
			Installer CI / lint (push) Failing after 2s
				
			SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s
				
			Installer CI / smoke-test-on-dev (push) Failing after 2s
				
			Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s
				
			Test Llama Stack Build / generate-matrix (push) Successful in 3s
				
			Vector IO Integration Tests / test-matrix (push) Failing after 4s
				
			Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
				
			Test Llama Stack Build / build-custom-container-distribution (push) Failing after 2s
				
			Test Llama Stack Build / build-single-provider (push) Failing after 4s
				
			Python Package Build Test / build (3.12) (push) Failing after 2s
				
			Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s
				
			Python Package Build Test / build (3.13) (push) Failing after 1s
				
			API Conformance Tests / check-schema-compatibility (push) Successful in 10s
				
			Unit Tests / unit-tests (3.12) (push) Failing after 3s
				
			Test Llama Stack Build / build (push) Failing after 3s
				
			Test External API and Providers / test-external (venv) (push) Failing after 3s
				
			Unit Tests / unit-tests (3.13) (push) Failing after 3s
				
			UI Tests / ui-tests (22) (push) Successful in 40s
				
			Pre-commit / pre-commit (push) Successful in 1m18s
				
			# What does this PR do? user can simply set env vars in the beginning of the command.`FOO=BAR llama stack run ...` ## Test Plan Run TELEMETRY_SINKS=coneol uv run --with llama-stack llama stack build --distro=starter --image-type=venv --run --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3711). * #3714 * __->__ #3711
		
			
				
	
	
		
			141 lines
		
	
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			141 lines
		
	
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| orphan: true
 | |
| ---
 | |
| # NVIDIA Distribution
 | |
| 
 | |
| The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.
 | |
| 
 | |
| {{ providers_table }}
 | |
| 
 | |
| {% if run_config_env_vars %}
 | |
| ### Environment Variables
 | |
| 
 | |
| The following environment variables can be configured:
 | |
| 
 | |
| {% for var, (default_value, description) in run_config_env_vars.items() %}
 | |
| - `{{ var }}`: {{ description }} (default: `{{ default_value }}`)
 | |
| {% endfor %}
 | |
| {% endif %}
 | |
| 
 | |
| {% if default_models %}
 | |
| ### Models
 | |
| 
 | |
| The following models are available by default:
 | |
| 
 | |
| {% for model in default_models %}
 | |
| - `{{ model.model_id }} {{ model.doc_string }}`
 | |
| {% endfor %}
 | |
| {% endif %}
 | |
| 
 | |
| 
 | |
| ## Prerequisites
 | |
| ### NVIDIA API Keys
 | |
| 
 | |
| Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
 | |
| 
 | |
| ### Deploy NeMo Microservices Platform
 | |
| The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
 | |
| 
 | |
| ## Supported Services
 | |
| Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
 | |
| 
 | |
| ### Inference: NVIDIA NIM
 | |
| NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
 | |
|   1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
 | |
|   2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
 | |
| 
 | |
| The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
 | |
| 
 | |
| ### Datasetio API: NeMo Data Store
 | |
| The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
 | |
| 
 | |
| See the [NVIDIA Datasetio docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.
 | |
| 
 | |
| ### Eval API: NeMo Evaluator
 | |
| The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
 | |
| 
 | |
| See the [NVIDIA Eval docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.
 | |
| 
 | |
| ### Post-Training API: NeMo Customizer
 | |
| The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
 | |
| 
 | |
| See the [NVIDIA Post-Training docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.
 | |
| 
 | |
| ### Safety API: NeMo Guardrails
 | |
| The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
 | |
| 
 | |
| See the [NVIDIA Safety docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/safety/nvidia/README.md) for supported features and example usage.
 | |
| 
 | |
| ## Deploying models
 | |
| In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
 | |
| 
 | |
| Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
 | |
| ```sh
 | |
| # URL to NeMo NIM Proxy service
 | |
| export NEMO_URL="http://nemo.test"
 | |
| 
 | |
| curl --location "$NEMO_URL/v1/deployment/model-deployments" \
 | |
|    -H 'accept: application/json' \
 | |
|    -H 'Content-Type: application/json' \
 | |
|    -d '{
 | |
|       "name": "llama-3.2-1b-instruct",
 | |
|       "namespace": "meta",
 | |
|       "config": {
 | |
|          "model": "meta/llama-3.2-1b-instruct",
 | |
|          "nim_deployment": {
 | |
|             "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
 | |
|             "image_tag": "1.8.3",
 | |
|             "pvc_size": "25Gi",
 | |
|             "gpu": 1,
 | |
|             "additional_envs": {
 | |
|                "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
 | |
|             }
 | |
|          }
 | |
|       }
 | |
|    }'
 | |
| ```
 | |
| This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
 | |
| 
 | |
| You can also remove a deployed NIM to free up GPU resources, if needed.
 | |
| ```sh
 | |
| export NEMO_URL="http://nemo.test"
 | |
| 
 | |
| curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
 | |
| ```
 | |
| 
 | |
| ## Running Llama Stack with NVIDIA
 | |
| 
 | |
| You can do this via venv (build code), or Docker which has a pre-built image.
 | |
| 
 | |
| ### Via Docker
 | |
| 
 | |
| This method allows you to get started quickly without having to build the distribution code.
 | |
| 
 | |
| ```bash
 | |
| LLAMA_STACK_PORT=8321
 | |
| docker run \
 | |
|   -it \
 | |
|   --pull always \
 | |
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
 | |
|   -v ./run.yaml:/root/my-run.yaml \
 | |
|   -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
 | |
|   llamastack/distribution-{{ name }} \
 | |
|   --config /root/my-run.yaml \
 | |
|   --port $LLAMA_STACK_PORT
 | |
| ```
 | |
| 
 | |
| ### Via venv
 | |
| 
 | |
| If you've set up your local development environment, you can also build the image using your local virtual environment.
 | |
| 
 | |
| ```bash
 | |
| INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
 | |
| llama stack build --distro nvidia --image-type venv
 | |
| NVIDIA_API_KEY=$NVIDIA_API_KEY \
 | |
| INFERENCE_MODEL=$INFERENCE_MODEL \
 | |
| llama stack run ./run.yaml \
 | |
|   --port 8321
 | |
| ```
 | |
| 
 | |
| ## Example Notebooks
 | |
| For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in [docs/notebooks/nvidia](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks/nvidia).
 |