forked from phoenix-oss/llama-stack-mirror

docs: Add NVIDIA platform distro docs (#1971 )

# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>

2025-04-17 05:54:30 -07:00

1.9 KiB

Raw Blame History

NVIDIA Inference Provider for LlamaStack

This provider enables running inference using NVIDIA NIM.

Features

Endpoints for completions, chat completions, and embeddings for registered models

Getting Started

Prerequisites

LlamaStack with NVIDIA configuration
Access to NVIDIA NIM deployment
NIM for model to use for inference is deployed

Setup

Build the NVIDIA environment:

llama stack build --template nvidia --image-type conda

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = (
    ""  # Required if using hosted NIM endpoint. If self-hosted, not required.
)
os.environ["NVIDIA_BASE_URL"] = "http://nim.test"  # NIM URL

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Create Completion

response = client.completion(
    model_id="meta-llama/Llama-3.1-8b-Instruct",
    content="Complete the sentence using one word: Roses are red, violets are :",
    stream=False,
    sampling_params={
        "max_tokens": 50,
    },
)
print(f"Response: {response.content}")

Create Chat Completion

response = client.chat_completion(
    model_id="meta-llama/Llama-3.1-8b-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You must respond to each message with only one word",
        },
        {
            "role": "user",
            "content": "Complete the sentence using one word: Roses are red, violets are:",
        },
    ],
    stream=False,
    sampling_params={
        "max_tokens": 50,
    },
)
print(f"Response: {response.completion_message.content}")

Create Embeddings

response = client.embeddings(
    model_id="meta-llama/Llama-3.1-8b-Instruct", contents=["foo", "bar", "baz"]
)
print(f"Embeddings: {response.embeddings}")

1.9 KiB Raw Blame History

NVIDIA Inference Provider for LlamaStack

Features

Getting Started

Prerequisites

Setup

Basic Usage using the LlamaStack Python Client

Initialize the client

Create Completion

Create Chat Completion

Create Embeddings

1.9 KiB

Raw Blame History