phoenix-oss/llama-stack-mirror

Fork 1

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-04 04:04:14 +00:00

Jiayi Ni 55e9959f62

Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s

Details

Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped

Details

Test Llama Stack Build / generate-matrix (push) Successful in 5s

Details

Python Package Build Test / build (3.13) (push) Failing after 3s

Details

Test Llama Stack Build / build-single-provider (push) Failing after 9s

Details

Test Llama Stack Build / build-custom-container-distribution (push) Failing after 12s

Details

Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 14s

Details

Unit Tests / unit-tests (3.13) (push) Failing after 11s

Details

Unit Tests / unit-tests (3.12) (push) Failing after 13s

Details

Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 16s

Details

SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 19s

Details

SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 20s

Details

Vector IO Integration Tests / test-matrix (push) Failing after 19s

Details

Test External API and Providers / test-external (venv) (push) Failing after 18s

Details

Python Package Build Test / build (3.12) (push) Failing after 49s

Details

Test Llama Stack Build / build (push) Failing after 54s

Details

UI Tests / ui-tests (22) (push) Failing after 1m26s

Details

Pre-commit / pre-commit (push) Successful in 2m24s

Details

fix: fix ``openai_embeddings`` for asymmetric embedding NIMs (#3205 )

# What does this PR do?
NVIDIA asymmetric embedding models (e.g.,
`nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter
not present in the standard OpenAI embeddings API. This PR adds the
`input_type="query"` as default and updates the documentation to suggest
using the `embedding` API for passage embeddings.

<!-- If resolving an issue, uncomment and update the line below -->
Resolves #2892 

## Test Plan
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py   --stack-config="inference=nvidia"   --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2"   --env NVIDIA_API_KEY={nvidia_api_key}   --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```

2025-08-20 08:06:25 -04:00

2.4 KiB

Raw Blame History

NVIDIA Inference Provider for LlamaStack

This provider enables running inference using NVIDIA NIM.

Features

Endpoints for completions, chat completions, and embeddings for registered models

Getting Started

Prerequisites

LlamaStack with NVIDIA configuration
Access to NVIDIA NIM deployment
NIM for model to use for inference is deployed

Setup

Build the NVIDIA environment:

llama stack build --distro nvidia --image-type venv

Basic Usage using the LlamaStack Python Client

Initialize the client

import os

os.environ["NVIDIA_API_KEY"] = (
    ""  # Required if using hosted NIM endpoint. If self-hosted, not required.
)
os.environ["NVIDIA_BASE_URL"] = "http://nim.test"  # NIM URL

from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()

Create Completion

response = client.inference.completion(
    model_id="meta-llama/Llama-3.1-8B-Instruct",
    content="Complete the sentence using one word: Roses are red, violets are :",
    stream=False,
    sampling_params={
        "max_tokens": 50,
    },
)
print(f"Response: {response.content}")

Create Chat Completion

response = client.inference.chat_completion(
    model_id="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You must respond to each message with only one word",
        },
        {
            "role": "user",
            "content": "Complete the sentence using one word: Roses are red, violets are:",
        },
    ],
    stream=False,
    sampling_params={
        "max_tokens": 50,
    },
)
print(f"Response: {response.completion_message.content}")

Create Embeddings

Note on OpenAI embeddings compatibility

NVIDIA asymmetric embedding models (e.g., nvidia/llama-3.2-nv-embedqa-1b-v2) require an input_type parameter not present in the standard OpenAI embeddings API. The NVIDIA Inference Adapter automatically sets input_type="query" when using the OpenAI-compatible embeddings endpoint for NVIDIA. For passage embeddings, use the embeddings API with task_type="document".

response = client.inference.embeddings(
    model_id="nvidia/llama-3.2-nv-embedqa-1b-v2",
    contents=["What is the capital of France?"],
    task_type="query",
)
print(f"Embeddings: {response.embeddings}")

2.4 KiB Raw Blame History

NVIDIA Inference Provider for LlamaStack

Features

Getting Started

Prerequisites

Setup

Basic Usage using the LlamaStack Python Client

Initialize the client

Create Completion

Create Chat Completion

Create Embeddings

2.4 KiB

Raw Blame History