llama-stack-mirror/llama_stack/providers/remote/inference/nvidia/NVIDIA.md
Jiayi Ni d875e427bf
Some checks failed
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Python Package Build Test / build (3.13) (push) Failing after 1s
Test Llama Stack Build / generate-matrix (push) Successful in 4s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 3s
Python Package Build Test / build (3.12) (push) Failing after 2s
Test Llama Stack Build / build-single-provider (push) Failing after 4s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s
Test External API and Providers / test-external (venv) (push) Failing after 5s
Unit Tests / unit-tests (3.12) (push) Failing after 5s
Test Llama Stack Build / build (push) Failing after 4s
Unit Tests / unit-tests (3.13) (push) Failing after 5s
Vector IO Integration Tests / test-matrix (push) Failing after 9s
API Conformance Tests / check-schema-compatibility (push) Successful in 16s
UI Tests / ui-tests (22) (push) Successful in 33s
Pre-commit / pre-commit (push) Successful in 1m33s
refactor: use extra_body to pass in input_type params for asymmetric embedding models for NVIDIA Inference Provider (#3804)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Previously, the NVIDIA inference provider implemented a custom
`openai_embeddings` method with a hardcoded `input_type="query"`
parameter, which is required by NVIDIA asymmetric embedding
models([https://github.com/llamastack/llama-stack/pull/3205](https://github.com/llamastack/llama-stack/pull/3205)).
Recently `extra_body` parameter is added to the embeddings API
([https://github.com/llamastack/llama-stack/pull/3794](https://github.com/llamastack/llama-stack/pull/3794)).
So, this PR updates the NVIDIA inference provider to use the base
`OpenAIMixin.openai_embeddings` method instead and pass the `input_type`
through the `extra_body` parameter for asymmetric embedding models.

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run the following command for the ```embedding_model```:
```nvidia/llama-3.2-nv-embedqa-1b-v2```, ```nvidia/nv-embedqa-e5-v5```,
```nvidia/nv-embedqa-mistral-7b-v2```, and
```snowflake/arctic-embed-l```.
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model={embedding_model} --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com" --inference-mode=record
```
2025-10-14 13:52:55 -07:00

4.9 KiB

NVIDIA Inference Provider for LlamaStack

This provider enables running inference using NVIDIA NIM.

Features

  • Endpoints for completions, chat completions, and embeddings for registered models

Getting Started

Prerequisites

  • LlamaStack with NVIDIA configuration
  • Access to NVIDIA NIM deployment
  • NIM for model to use for inference is deployed

Setup

Build the NVIDIA environment:

llama stack build --distro nvidia --image-type venv

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = (
    ""  # Required if using hosted NIM endpoint. If self-hosted, not required.
)
os.environ["NVIDIA_BASE_URL"] = "http://nim.test"  # NIM URL

from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Create Chat Completion

The following example shows how to create a chat completion for an NVIDIA NIM.

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You must respond to each message with only one word",
        },
        {
            "role": "user",
            "content": "Complete the sentence using one word: Roses are red, violets are:",
        },
    ],
    stream=False,
    max_tokens=50,
)
print(f"Response: {response.choices[0].message.content}")

Tool Calling Example

The following example shows how to do tool calling for an NVIDIA NIM.

from llama_stack.models.llama.datatypes import ToolDefinition, ToolParamDefinition

tool_definition = ToolDefinition(
    tool_name="get_weather",
    description="Get current weather information for a location",
    parameters={
        "location": ToolParamDefinition(
            param_type="string",
            description="The city and state, e.g. San Francisco, CA",
            required=True,
        ),
        "unit": ToolParamDefinition(
            param_type="string",
            description="Temperature unit (celsius or fahrenheit)",
            required=False,
            default="celsius",
        ),
    },
)

tool_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=[tool_definition],
)

print(f"Tool Response: {tool_response.choices[0].message.content}")
if tool_response.choices[0].message.tool_calls:
    for tool_call in tool_response.choices[0].message.tool_calls:
        print(f"Tool Called: {tool_call.tool_name}")
        print(f"Arguments: {tool_call.arguments}")

Structured Output Example

The following example shows how to do structured output for an NVIDIA NIM.

from llama_stack.apis.inference import JsonSchemaResponseFormat, ResponseFormatType

person_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "occupation": {"type": "string"},
    },
    "required": ["name", "age", "occupation"],
}

response_format = JsonSchemaResponseFormat(
    type=ResponseFormatType.json_schema, json_schema=person_schema
)

structured_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Create a profile for a fictional person named Alice who is 30 years old and is a software engineer. ",
        }
    ],
    response_format=response_format,
)

print(f"Structured Response: {structured_response.choices[0].message.content}")

Create Embeddings

The following example shows how to create embeddings for an NVIDIA NIM.

response = client.embeddings.create(
    model="nvidia/llama-3.2-nv-embedqa-1b-v2",
    input=["What is the capital of France?"],
    extra_body={"input_type": "query"},
)
print(f"Embeddings: {response.data}")

Vision Language Models Example

The following example shows how to run vision inference by using an NVIDIA NIM.

def load_image_as_base64(image_path):
    with open(image_path, "rb") as image_file:
        img_bytes = image_file.read()
        return base64.b64encode(img_bytes).decode("utf-8")


image_path = {path_to_the_image}
demo_image_b64 = load_image_as_base64(image_path)

vlm_response = client.chat.completions.create(
    model="nvidia/vila",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": {
                        "data": demo_image_b64,
                    },
                },
                {
                    "type": "text",
                    "text": "Please describe what you see in this image in detail.",
                },
            ],
        }
    ],
)

print(f"VLM Response: {vlm_response.choices[0].message.content}")