llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

History

Justin 509ac4a659 Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s Details Python Package Build Test / build (3.12) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.13) (push) Failing after 1s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Details Vector IO Integration Tests / test-matrix (push) Failing after 5s Details Test External API and Providers / test-external (venv) (push) Failing after 3s Details Unit Tests / unit-tests (3.13) (push) Failing after 3s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details API Conformance Tests / check-schema-compatibility (push) Successful in 11s Details UI Tests / ui-tests (22) (push) Successful in 30s Details Pre-commit / pre-commit (push) Successful in 1m24s Details feat: enable Runpod inference adapter (#3707 ) # What does this PR do? Sorry to @mattf I thought I could close the other PR and reopen it.. But I didn't have the option to reopen it now. I just didn't want it to keep notifying maintainers if I would make other commits for testing. Continuation of: https://github.com/llamastack/llama-stack/pull/3641 PR fixes Runpod Adapter https://github.com/llamastack/llama-stack/issues/3517 ## What I fixed from before: Continuation of: https://github.com/llamastack/llama-stack/pull/3641 1. Made it all OpenAI 2. Fixed the class up since the OpenAIMixin had a couple changes with the pydantic base model stuff. 3. Test to make sure that we could dynamically find models and use the resulting identifier to make requests ```bash curl -X GET \ -H "Content-Type: application/json" \ "http://localhost:8321/v1/models" ``` ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> ``` # RunPod Provider Quick Start ## Prerequisites - Python 3.10+ - Git - RunPod API token ## Setup for Development ```bash # 1. Clone and enter the repository cd (into the repo) # 2. Create and activate virtual environment python3 -m venv .venv source .venv/bin/activate # 3. Remove any existing llama-stack installation pip uninstall llama-stack llama-stack-client -y # 4. Install llama-stack in development mode pip install -e . # 5. Build using local development code (Found this through the Discord) LLAMA_STACK_DIR=. llama stack build # When prompted during build: # - Name: runpod-dev # - Image type: venv # - Inference provider: remote::runpod # - Safety provider: "llama-guard" # - Other providers: first defaults ``` ## Configure the Stack The RunPod adapter automatically discovers models from your endpoint via the `/v1/models` API. No manual model configuration is required - just set your environment variables. ## Run the Server ### Important: Use the Build-Created Virtual Environment ```bash # Exit the development venv if you're in it deactivate # Activate the build-created venv (NOT .venv) cd (lama-stack folder github repo) source llamastack-runpod-dev/bin/activate ``` ### For Qwen3-32B-AWQ Public Endpoint (Recommended) ```bash # Set environment variables export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1" export RUNPOD_API_TOKEN="your_runpod_api_key" # Start server llama stack run ~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml ``` ## Quick Test ### 1. List Available Models (Dynamic Discovery) First, check which models are available on your RunPod endpoint: ```bash curl -X GET \ -H "Content-Type: application/json" \ "http://localhost:8321/v1/models" ``` Example Response: ```json { "data": [ { "identifier": "qwen3-32b-awq", "provider_resource_id": "Qwen/Qwen3-32B-AWQ", "provider_id": "runpod", "type": "model", "metadata": {}, "model_type": "llm" } ] } ``` Note: Use the `identifier` value from the response above in your requests below. ### 2. Chat Completion (Non-streaming) Replace `qwen3-32b-awq` with your model identifier from step 1: ```bash curl -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Hello, count to 3"}], "stream": false }' ``` ### 3. Chat Completion (Streaming) ```bash curl -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true }' ``` Clean streaming output: ```bash curl -N -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true}' \ 2>/dev/null \| while read -r line; do echo "$line" \| grep "^data: " \| sed 's/^data: //' \| jq -r '.choices[0].delta.content // empty' 2>/dev/null done ``` Expected Output: ``` 1 2 3 4 5 ```		2025-10-07 12:24:50 +02:00
..
anthropic	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
azure	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
bedrock	chore: remove deprecated inference.chat_completion implementations (#3654 )	2025-10-03 07:55:34 -04:00
cerebras	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
databricks	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
fireworks	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
gemini	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
groq	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
llama_openai_compat	chore: disable openai_embeddings on inference=remote::llama-openai-compat (#3704 )	2025-10-06 13:27:40 -04:00
nvidia	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
ollama	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
openai	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
passthrough	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
runpod	feat: enable Runpod inference adapter (#3707 )	2025-10-07 12:24:50 +02:00
sambanova	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
tgi	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
together	chore: remove together inference adapter's custom check_model_availability (#3702 )	2025-10-06 13:28:36 -04:00
vertexai	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
vllm	chore: remove vLLM inference adapter's custom list_models (#3703 )	2025-10-06 13:27:30 -04:00
watsonx	chore: turn OpenAIMixin into a pydantic.BaseModel (#3671 )	2025-10-06 11:33:19 -04:00
__init__.py	`impls` -> `inline`, `adapters` -> `remote` (#381 )	2024-11-06 14:54:05 -08:00