forked from phoenix-oss/llama-stack-mirror
# What does this PR do? Add NVIDIA platform docs that serve as a starting point for Llama Stack users and explains all supported microservices. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
85 lines
1.9 KiB
Markdown
85 lines
1.9 KiB
Markdown
# NVIDIA Inference Provider for LlamaStack
|
|
|
|
This provider enables running inference using NVIDIA NIM.
|
|
|
|
## Features
|
|
- Endpoints for completions, chat completions, and embeddings for registered models
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
|
|
- LlamaStack with NVIDIA configuration
|
|
- Access to NVIDIA NIM deployment
|
|
- NIM for model to use for inference is deployed
|
|
|
|
### Setup
|
|
|
|
Build the NVIDIA environment:
|
|
|
|
```bash
|
|
llama stack build --template nvidia --image-type conda
|
|
```
|
|
|
|
### Basic Usage using the LlamaStack Python Client
|
|
|
|
#### Initialize the client
|
|
|
|
```python
|
|
import os
|
|
|
|
os.environ["NVIDIA_API_KEY"] = (
|
|
"" # Required if using hosted NIM endpoint. If self-hosted, not required.
|
|
)
|
|
os.environ["NVIDIA_BASE_URL"] = "http://nim.test" # NIM URL
|
|
|
|
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
|
|
|
|
client = LlamaStackAsLibraryClient("nvidia")
|
|
client.initialize()
|
|
```
|
|
|
|
### Create Completion
|
|
|
|
```python
|
|
response = client.completion(
|
|
model_id="meta-llama/Llama-3.1-8b-Instruct",
|
|
content="Complete the sentence using one word: Roses are red, violets are :",
|
|
stream=False,
|
|
sampling_params={
|
|
"max_tokens": 50,
|
|
},
|
|
)
|
|
print(f"Response: {response.content}")
|
|
```
|
|
|
|
### Create Chat Completion
|
|
|
|
```python
|
|
response = client.chat_completion(
|
|
model_id="meta-llama/Llama-3.1-8b-Instruct",
|
|
messages=[
|
|
{
|
|
"role": "system",
|
|
"content": "You must respond to each message with only one word",
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "Complete the sentence using one word: Roses are red, violets are:",
|
|
},
|
|
],
|
|
stream=False,
|
|
sampling_params={
|
|
"max_tokens": 50,
|
|
},
|
|
)
|
|
print(f"Response: {response.completion_message.content}")
|
|
```
|
|
|
|
### Create Embeddings
|
|
```python
|
|
response = client.embeddings(
|
|
model_id="meta-llama/Llama-3.1-8b-Instruct", contents=["foo", "bar", "baz"]
|
|
)
|
|
print(f"Embeddings: {response.embeddings}")
|
|
```
|