forked from phoenix-oss/llama-stack-mirror
# What does this PR do? Adds custom model registration functionality to NVIDIAInferenceAdapter which let's the inference happen on: - post-training model - non-llama models in API Catalogue(behind https://integrate.api.nvidia.com and endpoints compatible with AyncOpenAI) ## Example Usage: ```python from llama_stack.apis.models import Model, ModelType from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient("nvidia") _ = client.initialize() client.models.register( model_id=model_name, model_type=ModelType.llm, provider_id="nvidia" ) response = client.inference.chat_completion( model_id=model_name, messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}], ) ``` ## Test Plan ```bash pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0 collected 6 items tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%] ============================================================ warnings summary ============================================================ ../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076 /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/ warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 6 passed, 1 warning in 1.51s ====================================================== ``` [//]: # (## Documentation) Updated Readme.md cc: @dglogo, @sumitb, @mattf
173 lines
8.4 KiB
Markdown
173 lines
8.4 KiB
Markdown
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
|
# NVIDIA Distribution
|
|
|
|
The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
|
|
|
|
| API | Provider(s) |
|
|
|-----|-------------|
|
|
| agents | `inline::meta-reference` |
|
|
| datasetio | `inline::localfs` |
|
|
| eval | `remote::nvidia` |
|
|
| inference | `remote::nvidia` |
|
|
| post_training | `remote::nvidia` |
|
|
| safety | `remote::nvidia` |
|
|
| scoring | `inline::basic` |
|
|
| telemetry | `inline::meta-reference` |
|
|
| tool_runtime | `inline::rag-runtime` |
|
|
| vector_io | `inline::faiss` |
|
|
|
|
|
|
### Environment Variables
|
|
|
|
The following environment variables can be configured:
|
|
|
|
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
|
|
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
|
|
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
|
|
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
|
|
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
|
|
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
|
|
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
|
|
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
|
|
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
|
|
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
|
|
|
|
### Models
|
|
|
|
The following models are available by default:
|
|
|
|
- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)`
|
|
- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)`
|
|
- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
|
- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
|
- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
|
- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
|
|
- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
|
- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
|
- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
|
- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
|
- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
|
|
- `nvidia/nv-embedqa-e5-v5 `
|
|
- `nvidia/nv-embedqa-mistral-7b-v2 `
|
|
- `snowflake/arctic-embed-l `
|
|
|
|
|
|
## Prerequisites
|
|
### NVIDIA API Keys
|
|
|
|
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
|
|
|
|
### Deploy NeMo Microservices Platform
|
|
The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
|
|
|
|
## Supported Services
|
|
Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
|
|
|
|
### Inference: NVIDIA NIM
|
|
NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
|
|
1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
|
|
2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
|
|
|
|
The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
|
|
|
|
### Datasetio API: NeMo Data Store
|
|
The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
|
|
|
|
See the [NVIDIA Datasetio docs](/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.
|
|
|
|
### Eval API: NeMo Evaluator
|
|
The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
|
|
|
|
See the [NVIDIA Eval docs](/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.
|
|
|
|
### Post-Training API: NeMo Customizer
|
|
The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
|
|
|
|
See the [NVIDIA Post-Training docs](/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.
|
|
|
|
### Safety API: NeMo Guardrails
|
|
The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
|
|
|
|
See the NVIDIA Safety docs for supported features and example usage.
|
|
|
|
## Deploying models
|
|
In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
|
|
|
|
Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
|
|
```sh
|
|
# URL to NeMo NIM Proxy service
|
|
export NEMO_URL="http://nemo.test"
|
|
|
|
curl --location "$NEMO_URL/v1/deployment/model-deployments" \
|
|
-H 'accept: application/json' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"name": "llama-3.2-1b-instruct",
|
|
"namespace": "meta",
|
|
"config": {
|
|
"model": "meta/llama-3.2-1b-instruct",
|
|
"nim_deployment": {
|
|
"image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
|
|
"image_tag": "1.8.3",
|
|
"pvc_size": "25Gi",
|
|
"gpu": 1,
|
|
"additional_envs": {
|
|
"NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
```
|
|
This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
|
|
|
|
You can also remove a deployed NIM to free up GPU resources, if needed.
|
|
```sh
|
|
export NEMO_URL="http://nemo.test"
|
|
|
|
curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
|
|
```
|
|
|
|
## Running Llama Stack with NVIDIA
|
|
|
|
You can do this via Conda or venv (build code), or Docker which has a pre-built image.
|
|
|
|
### Via Docker
|
|
|
|
This method allows you to get started quickly without having to build the distribution code.
|
|
|
|
```bash
|
|
LLAMA_STACK_PORT=8321
|
|
docker run \
|
|
-it \
|
|
--pull always \
|
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
|
-v ./run.yaml:/root/my-run.yaml \
|
|
llamastack/distribution-nvidia \
|
|
--yaml-config /root/my-run.yaml \
|
|
--port $LLAMA_STACK_PORT \
|
|
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
|
|
```
|
|
|
|
### Via Conda
|
|
|
|
```bash
|
|
INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
|
|
llama stack build --template nvidia --image-type conda
|
|
llama stack run ./run.yaml \
|
|
--port 8321 \
|
|
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
|
```
|
|
|
|
### Via venv
|
|
|
|
If you've set up your local development environment, you can also build the image using your local virtual environment.
|
|
|
|
```bash
|
|
INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
|
|
llama stack build --template nvidia --image-type venv
|
|
llama stack run ./run.yaml \
|
|
--port 8321 \
|
|
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
|
```
|