mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-03 09:53:45 +00:00
Some checks failed
Pre-commit / pre-commit (push) Failing after 2s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
Integration Tests (Replay) / generate-matrix (push) Successful in 3s
Vector IO Integration Tests / test-matrix (push) Failing after 4s
Test Llama Stack Build / generate-matrix (push) Successful in 3s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Test Llama Stack Build / build-single-provider (push) Failing after 5s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 4s
Python Package Build Test / build (3.12) (push) Failing after 2s
Python Package Build Test / build (3.13) (push) Failing after 1s
Test llama stack list-deps / generate-matrix (push) Successful in 4s
Test llama stack list-deps / show-single-provider (push) Failing after 4s
API Conformance Tests / check-schema-compatibility (push) Successful in 11s
Test llama stack list-deps / list-deps-from-config (push) Failing after 4s
Test External API and Providers / test-external (venv) (push) Failing after 4s
Unit Tests / unit-tests (3.12) (push) Failing after 4s
Test Llama Stack Build / build (push) Failing after 3s
Unit Tests / unit-tests (3.13) (push) Failing after 4s
Test llama stack list-deps / list-deps (push) Failing after 4s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 9s
UI Tests / ui-tests (22) (push) Successful in 48s
Implements AWS Bedrock inference provider using OpenAI-compatible endpoint for Llama models available through Bedrock. Closes: #3410 ## What does this PR do? Adds AWS Bedrock as an inference provider using the OpenAI-compatible endpoint. This lets us use Bedrock models (GPT-OSS, Llama) through the standard llama-stack inference API. The implementation uses LiteLLM's OpenAI client under the hood, so it gets all the OpenAI compatibility features. The provider handles per-request API key overrides via headers. ## Test Plan **Tested the following scenarios:** - Non-streaming completion - basic request/response flow - Streaming completion - SSE streaming with chunked responses - Multi-turn conversations - context retention across turns - Tool calling - function calling with proper tool_calls format # Bedrock OpenAI-Compatible Provider - Test Results **Model:** `bedrock-inference/openai.gpt-oss-20b-1:0` --- ## Test 1: Model Listing **Request:** ```http GET /v1/models HTTP/1.1 ``` **Response:** ```http HTTP/1.1 200 OK Content-Type: application/json { "data": [ {"identifier": "bedrock-inference/openai.gpt-oss-20b-1:0", ...}, {"identifier": "bedrock-inference/openai.gpt-oss-40b-1:0", ...} ] } ``` --- ## Test 2: Non-Streaming Completion **Request:** ```http POST /v1/chat/completions HTTP/1.1 Content-Type: application/json { "model": "bedrock-inference/openai.gpt-oss-20b-1:0", "messages": [{"role": "user", "content": "Say 'Hello from Bedrock' and nothing else"}], "stream": false } ``` **Response:** ```http HTTP/1.1 200 OK Content-Type: application/json { "choices": [{ "finish_reason": "stop", "message": {"content": "...Hello from Bedrock"} }], "usage": {"prompt_tokens": 79, "completion_tokens": 50, "total_tokens": 129} } ``` --- ## Test 3: Streaming Completion **Request:** ```http POST /v1/chat/completions HTTP/1.1 Content-Type: application/json { "model": "bedrock-inference/openai.gpt-oss-20b-1:0", "messages": [{"role": "user", "content": "Count from 1 to 5"}], "stream": true } ``` **Response:** ```http HTTP/1.1 200 OK Content-Type: text/event-stream [6 SSE chunks received] Final content: "1, 2, 3, 4, 5" ``` --- ## Test 4: Error Handling - Invalid Model **Request:** ```http POST /v1/chat/completions HTTP/1.1 Content-Type: application/json { "model": "invalid-model-id", "messages": [{"role": "user", "content": "Hello"}], "stream": false } ``` **Response:** ```http HTTP/1.1 404 Not Found Content-Type: application/json { "detail": "Model 'invalid-model-id' not found. Use 'client.models.list()' to list available Models." } ``` --- ## Test 5: Multi-Turn Conversation **Request 1:** ```http POST /v1/chat/completions HTTP/1.1 { "messages": [{"role": "user", "content": "My name is Alice"}] } ``` **Response 1:** ```http HTTP/1.1 200 OK { "choices": [{ "message": {"content": "...Nice to meet you, Alice! How can I help you today?"} }] } ``` **Request 2 (with history):** ```http POST /v1/chat/completions HTTP/1.1 { "messages": [ {"role": "user", "content": "My name is Alice"}, {"role": "assistant", "content": "...Nice to meet you, Alice!..."}, {"role": "user", "content": "What is my name?"} ] } ``` **Response 2:** ```http HTTP/1.1 200 OK { "choices": [{ "message": {"content": "...Your name is Alice."} }], "usage": {"prompt_tokens": 183, "completion_tokens": 42} } ``` **Context retained across turns** --- ## Test 6: System Messages **Request:** ```http POST /v1/chat/completions HTTP/1.1 { "messages": [ {"role": "system", "content": "You are Shakespeare. Respond only in Shakespearean English."}, {"role": "user", "content": "Tell me about the weather"} ] } ``` **Response:** ```http HTTP/1.1 200 OK { "choices": [{ "message": {"content": "Lo! I heed thy request..."} }], "usage": {"completion_tokens": 813} } ``` --- ## Test 7: Tool Calling **Request:** ```http POST /v1/chat/completions HTTP/1.1 { "messages": [{"role": "user", "content": "What's the weather in San Francisco?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}} } }] } ``` **Response:** ```http HTTP/1.1 200 OK { "choices": [{ "finish_reason": "tool_calls", "message": { "tool_calls": [{ "function": {"name": "get_weather", "arguments": "{\"location\":\"San Francisco\"}"} }] } }] } ``` --- ## Test 8: Sampling Parameters **Request:** ```http POST /v1/chat/completions HTTP/1.1 { "messages": [{"role": "user", "content": "Say hello"}], "temperature": 0.7, "top_p": 0.9 } ``` **Response:** ```http HTTP/1.1 200 OK { "choices": [{ "message": {"content": "...Hello! 👋 How can I help you today?"} }] } ``` --- ## Test 9: Authentication Error Handling ### Subtest A: Invalid API Key **Request:** ```http POST /v1/chat/completions HTTP/1.1 x-llamastack-provider-data: {"aws_bedrock_api_key": "invalid-fake-key-12345"} {"model": "bedrock-inference/openai.gpt-oss-20b-1:0", ...} ``` **Response:** ```http HTTP/1.1 400 Bad Request { "detail": "Invalid value: Authentication failed: Error code: 401 - {'error': {'message': 'Invalid API Key format: Must start with pre-defined prefix', ...}}" } ``` --- ### Subtest B: Empty API Key (Fallback to Config) **Request:** ```http POST /v1/chat/completions HTTP/1.1 x-llamastack-provider-data: {"aws_bedrock_api_key": ""} {"model": "bedrock-inference/openai.gpt-oss-20b-1:0", ...} ``` **Response:** ```http HTTP/1.1 200 OK { "choices": [{ "message": {"content": "...Hello! How can I assist you today?"} }] } ``` **Fell back to config key** --- ### Subtest C: Malformed Token **Request:** ```http POST /v1/chat/completions HTTP/1.1 x-llamastack-provider-data: {"aws_bedrock_api_key": "not-a-valid-bedrock-token-format"} {"model": "bedrock-inference/openai.gpt-oss-20b-1:0", ...} ``` **Response:** ```http HTTP/1.1 400 Bad Request { "detail": "Invalid value: Authentication failed: Error code: 401 - {'error': {'message': 'Invalid API Key format: Must start with pre-defined prefix', ...}}" } ```
302 lines
15 KiB
Python
302 lines
15 KiB
Python
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
|
# All rights reserved.
|
|
#
|
|
# This source code is licensed under the terms described in the LICENSE file in
|
|
# the root directory of this source tree.
|
|
|
|
|
|
from llama_stack.providers.datatypes import (
|
|
Api,
|
|
InlineProviderSpec,
|
|
ProviderSpec,
|
|
RemoteProviderSpec,
|
|
)
|
|
|
|
META_REFERENCE_DEPS = [
|
|
"accelerate",
|
|
"fairscale",
|
|
"torch",
|
|
"torchvision",
|
|
"transformers",
|
|
"zmq",
|
|
"lm-format-enforcer",
|
|
"sentence-transformers",
|
|
"torchao==0.8.0",
|
|
"fbgemm-gpu-genai==1.1.2",
|
|
]
|
|
|
|
|
|
def available_providers() -> list[ProviderSpec]:
|
|
return [
|
|
InlineProviderSpec(
|
|
api=Api.inference,
|
|
provider_type="inline::meta-reference",
|
|
pip_packages=META_REFERENCE_DEPS,
|
|
module="llama_stack.providers.inline.inference.meta_reference",
|
|
config_class="llama_stack.providers.inline.inference.meta_reference.MetaReferenceInferenceConfig",
|
|
description="Meta's reference implementation of inference with support for various model formats and optimization techniques.",
|
|
),
|
|
InlineProviderSpec(
|
|
api=Api.inference,
|
|
provider_type="inline::sentence-transformers",
|
|
# CrossEncoder depends on torchao.quantization
|
|
pip_packages=[
|
|
"torch torchvision torchao>=0.12.0 --extra-index-url https://download.pytorch.org/whl/cpu",
|
|
"sentence-transformers --no-deps",
|
|
# required by some SentenceTransformers architectures for tensor rearrange/merge ops
|
|
"einops",
|
|
# fast HF tokenization backend used by SentenceTransformers models
|
|
"tokenizers",
|
|
# safe and fast file format for storing and loading tensors
|
|
"safetensors",
|
|
],
|
|
module="llama_stack.providers.inline.inference.sentence_transformers",
|
|
config_class="llama_stack.providers.inline.inference.sentence_transformers.config.SentenceTransformersInferenceConfig",
|
|
description="Sentence Transformers inference provider for text embeddings and similarity search.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="cerebras",
|
|
provider_type="remote::cerebras",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.cerebras",
|
|
config_class="llama_stack.providers.remote.inference.cerebras.CerebrasImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.cerebras.config.CerebrasProviderDataValidator",
|
|
description="Cerebras inference provider for running models on Cerebras Cloud platform.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="ollama",
|
|
provider_type="remote::ollama",
|
|
pip_packages=["ollama", "aiohttp", "h11>=0.16.0"],
|
|
config_class="llama_stack.providers.remote.inference.ollama.OllamaImplConfig",
|
|
module="llama_stack.providers.remote.inference.ollama",
|
|
description="Ollama inference provider for running local models through the Ollama runtime.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="vllm",
|
|
provider_type="remote::vllm",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.vllm",
|
|
config_class="llama_stack.providers.remote.inference.vllm.VLLMInferenceAdapterConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.vllm.VLLMProviderDataValidator",
|
|
description="Remote vLLM inference provider for connecting to vLLM servers.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="tgi",
|
|
provider_type="remote::tgi",
|
|
pip_packages=["huggingface_hub", "aiohttp"],
|
|
module="llama_stack.providers.remote.inference.tgi",
|
|
config_class="llama_stack.providers.remote.inference.tgi.TGIImplConfig",
|
|
description="Text Generation Inference (TGI) provider for HuggingFace model serving.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="hf::serverless",
|
|
provider_type="remote::hf::serverless",
|
|
pip_packages=["huggingface_hub", "aiohttp"],
|
|
module="llama_stack.providers.remote.inference.tgi",
|
|
config_class="llama_stack.providers.remote.inference.tgi.InferenceAPIImplConfig",
|
|
description="HuggingFace Inference API serverless provider for on-demand model inference.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
provider_type="remote::hf::endpoint",
|
|
adapter_type="hf::endpoint",
|
|
pip_packages=["huggingface_hub", "aiohttp"],
|
|
module="llama_stack.providers.remote.inference.tgi",
|
|
config_class="llama_stack.providers.remote.inference.tgi.InferenceEndpointImplConfig",
|
|
description="HuggingFace Inference Endpoints provider for dedicated model serving.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="fireworks",
|
|
provider_type="remote::fireworks",
|
|
pip_packages=[
|
|
"fireworks-ai<=0.17.16",
|
|
],
|
|
module="llama_stack.providers.remote.inference.fireworks",
|
|
config_class="llama_stack.providers.remote.inference.fireworks.FireworksImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.fireworks.FireworksProviderDataValidator",
|
|
description="Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="together",
|
|
provider_type="remote::together",
|
|
pip_packages=[
|
|
"together",
|
|
],
|
|
module="llama_stack.providers.remote.inference.together",
|
|
config_class="llama_stack.providers.remote.inference.together.TogetherImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.together.TogetherProviderDataValidator",
|
|
description="Together AI inference provider for open-source models and collaborative AI development.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="bedrock",
|
|
provider_type="remote::bedrock",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.bedrock",
|
|
config_class="llama_stack.providers.remote.inference.bedrock.BedrockConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.bedrock.config.BedrockProviderDataValidator",
|
|
description="AWS Bedrock inference provider using OpenAI compatible endpoint.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="databricks",
|
|
provider_type="remote::databricks",
|
|
pip_packages=["databricks-sdk"],
|
|
module="llama_stack.providers.remote.inference.databricks",
|
|
config_class="llama_stack.providers.remote.inference.databricks.DatabricksImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.databricks.config.DatabricksProviderDataValidator",
|
|
description="Databricks inference provider for running models on Databricks' unified analytics platform.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="nvidia",
|
|
provider_type="remote::nvidia",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.nvidia",
|
|
config_class="llama_stack.providers.remote.inference.nvidia.NVIDIAConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.nvidia.config.NVIDIAProviderDataValidator",
|
|
description="NVIDIA inference provider for accessing NVIDIA NIM models and AI services.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="runpod",
|
|
provider_type="remote::runpod",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.runpod",
|
|
config_class="llama_stack.providers.remote.inference.runpod.RunpodImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.runpod.config.RunpodProviderDataValidator",
|
|
description="RunPod inference provider for running models on RunPod's cloud GPU platform.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="openai",
|
|
provider_type="remote::openai",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.openai",
|
|
config_class="llama_stack.providers.remote.inference.openai.OpenAIConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.openai.config.OpenAIProviderDataValidator",
|
|
description="OpenAI inference provider for accessing GPT models and other OpenAI services.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="anthropic",
|
|
provider_type="remote::anthropic",
|
|
pip_packages=["anthropic"],
|
|
module="llama_stack.providers.remote.inference.anthropic",
|
|
config_class="llama_stack.providers.remote.inference.anthropic.AnthropicConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.anthropic.config.AnthropicProviderDataValidator",
|
|
description="Anthropic inference provider for accessing Claude models and Anthropic's AI services.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="gemini",
|
|
provider_type="remote::gemini",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.gemini",
|
|
config_class="llama_stack.providers.remote.inference.gemini.GeminiConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.gemini.config.GeminiProviderDataValidator",
|
|
description="Google Gemini inference provider for accessing Gemini models and Google's AI services.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="vertexai",
|
|
provider_type="remote::vertexai",
|
|
pip_packages=[
|
|
"google-cloud-aiplatform",
|
|
],
|
|
module="llama_stack.providers.remote.inference.vertexai",
|
|
config_class="llama_stack.providers.remote.inference.vertexai.VertexAIConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.vertexai.config.VertexAIProviderDataValidator",
|
|
description="""Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
|
|
|
|
• Enterprise-grade security: Uses Google Cloud's security controls and IAM
|
|
• Better integration: Seamless integration with other Google Cloud services
|
|
• Advanced features: Access to additional Vertex AI features like model tuning and monitoring
|
|
• Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
|
|
|
|
Configuration:
|
|
- Set VERTEX_AI_PROJECT environment variable (required)
|
|
- Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
|
|
- Use Google Cloud Application Default Credentials or service account key
|
|
|
|
Authentication Setup:
|
|
Option 1 (Recommended): gcloud auth application-default login
|
|
Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
|
|
|
|
Available Models:
|
|
- vertex_ai/gemini-2.0-flash
|
|
- vertex_ai/gemini-2.5-flash
|
|
- vertex_ai/gemini-2.5-pro""",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="groq",
|
|
provider_type="remote::groq",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.groq",
|
|
config_class="llama_stack.providers.remote.inference.groq.GroqConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.groq.config.GroqProviderDataValidator",
|
|
description="Groq inference provider for ultra-fast inference using Groq's LPU technology.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="llama-openai-compat",
|
|
provider_type="remote::llama-openai-compat",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.llama_openai_compat",
|
|
config_class="llama_stack.providers.remote.inference.llama_openai_compat.config.LlamaCompatConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.llama_openai_compat.config.LlamaProviderDataValidator",
|
|
description="Llama OpenAI-compatible provider for using Llama models with OpenAI API format.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="sambanova",
|
|
provider_type="remote::sambanova",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.sambanova",
|
|
config_class="llama_stack.providers.remote.inference.sambanova.SambaNovaImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.sambanova.config.SambaNovaProviderDataValidator",
|
|
description="SambaNova inference provider for running models on SambaNova's dataflow architecture.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="passthrough",
|
|
provider_type="remote::passthrough",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.passthrough",
|
|
config_class="llama_stack.providers.remote.inference.passthrough.PassthroughImplConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.passthrough.PassthroughProviderDataValidator",
|
|
description="Passthrough inference provider for connecting to any external inference service not directly supported.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
adapter_type="watsonx",
|
|
provider_type="remote::watsonx",
|
|
pip_packages=["litellm"],
|
|
module="llama_stack.providers.remote.inference.watsonx",
|
|
config_class="llama_stack.providers.remote.inference.watsonx.WatsonXConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.watsonx.config.WatsonXProviderDataValidator",
|
|
description="IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform.",
|
|
),
|
|
RemoteProviderSpec(
|
|
api=Api.inference,
|
|
provider_type="remote::azure",
|
|
adapter_type="azure",
|
|
pip_packages=[],
|
|
module="llama_stack.providers.remote.inference.azure",
|
|
config_class="llama_stack.providers.remote.inference.azure.AzureConfig",
|
|
provider_data_validator="llama_stack.providers.remote.inference.azure.config.AzureProviderDataValidator",
|
|
description="""
|
|
Azure OpenAI inference provider for accessing GPT models and other Azure services.
|
|
Provider documentation
|
|
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
|
|
""",
|
|
),
|
|
]
|