llama-stack-mirror/llama_stack/providers/remote/inference
Justin 509ac4a659
Some checks failed
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s
Python Package Build Test / build (3.12) (push) Failing after 1s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Python Package Build Test / build (3.13) (push) Failing after 1s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s
Vector IO Integration Tests / test-matrix (push) Failing after 5s
Test External API and Providers / test-external (venv) (push) Failing after 3s
Unit Tests / unit-tests (3.13) (push) Failing after 3s
Unit Tests / unit-tests (3.12) (push) Failing after 4s
API Conformance Tests / check-schema-compatibility (push) Successful in 11s
UI Tests / ui-tests (22) (push) Successful in 30s
Pre-commit / pre-commit (push) Successful in 1m24s
feat: enable Runpod inference adapter (#3707)
# What does this PR do?
Sorry to @mattf I thought I could close the other PR and reopen it.. But
I didn't have the option to reopen it now. I just didn't want it to keep
notifying maintainers if I would make other commits for testing.

Continuation of: https://github.com/llamastack/llama-stack/pull/3641

PR fixes Runpod Adapter
https://github.com/llamastack/llama-stack/issues/3517

## What I fixed from before:
Continuation of: https://github.com/llamastack/llama-stack/pull/3641

1. Made it all OpenAI
2. Fixed the class up since the OpenAIMixin had a couple changes with
the pydantic base model stuff.
3. Test to make sure that we could dynamically find models and use the
resulting identifier to make requests
```bash
curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"
```

## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->

```
# RunPod Provider Quick Start

## Prerequisites
- Python 3.10+
- Git
- RunPod API token

## Setup for Development

```bash
# 1. Clone and enter the repository
cd (into the repo)

# 2. Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Remove any existing llama-stack installation
pip uninstall llama-stack llama-stack-client -y

# 4. Install llama-stack in development mode
pip install -e .

# 5. Build using local development code
(Found this through the Discord)
LLAMA_STACK_DIR=. llama stack build

# When prompted during build:
# - Name: runpod-dev
# - Image type: venv
# - Inference provider: remote::runpod
# - Safety provider: "llama-guard" 
# - Other providers: first defaults
```

## Configure the Stack

The RunPod adapter automatically discovers models from your endpoint via the `/v1/models` API.
No manual model configuration is required - just set your environment variables.

## Run the Server

### Important: Use the Build-Created Virtual Environment

```bash
# Exit the development venv if you're in it
deactivate

# Activate the build-created venv (NOT .venv)
cd (lama-stack folder github repo)
source llamastack-runpod-dev/bin/activate
```

### For Qwen3-32B-AWQ Public Endpoint (Recommended)

```bash
# Set environment variables
export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1"
export RUNPOD_API_TOKEN="your_runpod_api_key"

# Start server
llama stack run
~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml
```

## Quick Test

### 1. List Available Models (Dynamic Discovery)

First, check which models are available on your RunPod endpoint:

```bash
curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"
```

**Example Response:**
```json
{
  "data": [
    {
      "identifier": "qwen3-32b-awq",
      "provider_resource_id": "Qwen/Qwen3-32B-AWQ",
      "provider_id": "runpod",
      "type": "model",
      "metadata": {},
      "model_type": "llm"
    }
  ]
}
```

**Note:** Use the `identifier` value from the response above in your requests below.

### 2. Chat Completion (Non-streaming)

Replace `qwen3-32b-awq` with your model identifier from step 1:

```bash
curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'
```

### 3. Chat Completion (Streaming)

```bash
curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'
```

**Clean streaming output:**
```bash
curl -N -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
-d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content":
"Count to 5"}], "stream": true}' \
  2>/dev/null | while read -r line; do
echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r
'.choices[0].delta.content // empty' 2>/dev/null
  done
```

**Expected Output:**
```
1
2
3
4
5
```
2025-10-07 12:24:50 +02:00
..
anthropic chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
azure chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
bedrock chore: remove deprecated inference.chat_completion implementations (#3654) 2025-10-03 07:55:34 -04:00
cerebras chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
databricks chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
fireworks chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
gemini chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
groq chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
llama_openai_compat chore: disable openai_embeddings on inference=remote::llama-openai-compat (#3704) 2025-10-06 13:27:40 -04:00
nvidia chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
ollama chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
openai chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
passthrough chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
runpod feat: enable Runpod inference adapter (#3707) 2025-10-07 12:24:50 +02:00
sambanova chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
tgi chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
together chore: remove together inference adapter's custom check_model_availability (#3702) 2025-10-06 13:28:36 -04:00
vertexai chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
vllm chore: remove vLLM inference adapter's custom list_models (#3703) 2025-10-06 13:27:30 -04:00
watsonx chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) 2025-10-06 11:33:19 -04:00
__init__.py impls -> inline, adapters -> remote (#381) 2024-11-06 14:54:05 -08:00