Merge branch 'main' into chroma

2025-12-03 09:53:45 +00:00 · 2025-10-12 21:38:38 +09:00 · 2025-10-12 21:38:38 +09:00 · f856e53323
commit f856e53323
parent c71bcd5479 82cbcada39
1881 changed files with 886579 additions and 84028 deletions
--- a/docs/source/providers/inference/index.md
+++ b/docs/source/providers/inference/index.md
@ -1,42 +0,0 @@
-# Inference
-
-## Overview
-
-Llama Stack Inference API for generating completions, chat completions, and embeddings.
-
-This API provides the raw interface to the underlying models. Two kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
-
-This section contains documentation for all available providers for the **inference** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_meta-reference
-inline_sentence-transformers
-remote_anthropic
-remote_azure
-remote_bedrock
-remote_cerebras
-remote_databricks
-remote_fireworks
-remote_gemini
-remote_groq
-remote_hf_endpoint
-remote_hf_serverless
-remote_llama-openai-compat
-remote_nvidia
-remote_ollama
-remote_openai
-remote_passthrough
-remote_runpod
-remote_sambanova
-remote_tgi
-remote_together
-remote_vertexai
-remote_vllm
-remote_watsonx
-```
--- a/docs/source/providers/inference/inline_meta-reference.md
+++ b/docs/source/providers/inference/inline_meta-reference.md
@ -1,32 +0,0 @@
-# inline::meta-reference
-
-## Description
-
-Meta's reference implementation of inference with support for various model formats and optimization techniques.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `model` | `str \| None` | No |  |  |
-| `torch_seed` | `int \| None` | No |  |  |
-| `max_seq_len` | `<class 'int'>` | No | 4096 |  |
-| `max_batch_size` | `<class 'int'>` | No | 1 |  |
-| `model_parallel_size` | `int \| None` | No |  |  |
-| `create_distributed_process_group` | `<class 'bool'>` | No | True |  |
-| `checkpoint_dir` | `str \| None` | No |  |  |
-| `quantization` | `Bf16QuantizationConfig \| Fp8QuantizationConfig \| Int4QuantizationConfig, annotation=NoneType, required=True, discriminator='type'` | No |  |  |
-
-## Sample Configuration
-
-```yaml
-model: Llama3.2-3B-Instruct
-checkpoint_dir: ${env.CHECKPOINT_DIR:=null}
-quantization:
-  type: ${env.QUANTIZATION_TYPE:=bf16}
-model_parallel_size: ${env.MODEL_PARALLEL_SIZE:=0}
-max_batch_size: ${env.MAX_BATCH_SIZE:=1}
-max_seq_len: ${env.MAX_SEQ_LEN:=4096}
-
-```
-
--- a/docs/source/providers/inference/inline_sentence-transformers.md
+++ b/docs/source/providers/inference/inline_sentence-transformers.md
@ -1,13 +0,0 @@
-# inline::sentence-transformers
-
-## Description
-
-Sentence Transformers inference provider for text embeddings and similarity search.
-
-## Sample Configuration
-
-```yaml
-{}
-
-```
-
--- a/docs/source/providers/inference/remote_anthropic.md
+++ b/docs/source/providers/inference/remote_anthropic.md
@ -1,19 +0,0 @@
-# remote::anthropic
-
-## Description
-
-Anthropic inference provider for accessing Claude models and Anthropic's AI services.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | API key for Anthropic models |
-
-## Sample Configuration
-
-```yaml
-api_key: ${env.ANTHROPIC_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_azure.md
+++ b/docs/source/providers/inference/remote_azure.md
@ -1,29 +0,0 @@
-# remote::azure
-
-## Description
-
-
-Azure OpenAI inference provider for accessing GPT models and other Azure services.
-Provider documentation
-https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
-
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `<class 'pydantic.types.SecretStr'>` | No |  | Azure API key for Azure |
-| `api_base` | `<class 'pydantic.networks.HttpUrl'>` | No |  | Azure API base for Azure (e.g., https://your-resource-name.openai.azure.com) |
-| `api_version` | `str \| None` | No |  | Azure API version for Azure (e.g., 2024-12-01-preview) |
-| `api_type` | `str \| None` | No | azure | Azure API type for Azure (e.g., azure) |
-
-## Sample Configuration
-
-```yaml
-api_key: ${env.AZURE_API_KEY:=}
-api_base: ${env.AZURE_API_BASE:=}
-api_version: ${env.AZURE_API_VERSION:=}
-api_type: ${env.AZURE_API_TYPE:=}
-
-```
-
--- a/docs/source/providers/inference/remote_bedrock.md
+++ b/docs/source/providers/inference/remote_bedrock.md
@ -1,28 +0,0 @@
-# remote::bedrock
-
-## Description
-
-AWS Bedrock inference provider for accessing various AI models through AWS's managed service.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `aws_access_key_id` | `str \| None` | No |  | The AWS access key to use. Default use environment variable: AWS_ACCESS_KEY_ID |
-| `aws_secret_access_key` | `str \| None` | No |  | The AWS secret access key to use. Default use environment variable: AWS_SECRET_ACCESS_KEY |
-| `aws_session_token` | `str \| None` | No |  | The AWS session token to use. Default use environment variable: AWS_SESSION_TOKEN |
-| `region_name` | `str \| None` | No |  | The default AWS Region to use, for example, us-west-1 or us-west-2.Default use environment variable: AWS_DEFAULT_REGION |
-| `profile_name` | `str \| None` | No |  | The profile name that contains credentials to use.Default use environment variable: AWS_PROFILE |
-| `total_max_attempts` | `int \| None` | No |  | An integer representing the maximum number of attempts that will be made for a single request, including the initial attempt. Default use environment variable: AWS_MAX_ATTEMPTS |
-| `retry_mode` | `str \| None` | No |  | A string representing the type of retries Boto3 will perform.Default use environment variable: AWS_RETRY_MODE |
-| `connect_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds. |
-| `read_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to read from a connection.The default is 60 seconds. |
-| `session_ttl` | `int \| None` | No | 3600 | The time in seconds till a session expires. The default is 3600 seconds (1 hour). |
-
-## Sample Configuration
-
-```yaml
-{}
-
-```
-
--- a/docs/source/providers/inference/remote_cerebras.md
+++ b/docs/source/providers/inference/remote_cerebras.md
@ -1,21 +0,0 @@
-# remote::cerebras
-
-## Description
-
-Cerebras inference provider for running models on Cerebras Cloud platform.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `base_url` | `<class 'str'>` | No | https://api.cerebras.ai | Base URL for the Cerebras API |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | Cerebras API Key |
-
-## Sample Configuration
-
-```yaml
-base_url: https://api.cerebras.ai
-api_key: ${env.CEREBRAS_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_databricks.md
+++ b/docs/source/providers/inference/remote_databricks.md
@ -1,21 +0,0 @@
-# remote::databricks
-
-## Description
-
-Databricks inference provider for running models on Databricks' unified analytics platform.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No |  | The URL for the Databricks model serving endpoint |
-| `api_token` | `<class 'str'>` | No |  | The Databricks API token |
-
-## Sample Configuration
-
-```yaml
-url: ${env.DATABRICKS_URL:=}
-api_token: ${env.DATABRICKS_API_TOKEN:=}
-
-```
-
--- a/docs/source/providers/inference/remote_fireworks.md
+++ b/docs/source/providers/inference/remote_fireworks.md
@ -1,22 +0,0 @@
-# remote::fireworks
-
-## Description
-
-Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
-| `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Fireworks.ai API Key |
-
-## Sample Configuration
-
-```yaml
-url: https://api.fireworks.ai/inference/v1
-api_key: ${env.FIREWORKS_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_gemini.md
+++ b/docs/source/providers/inference/remote_gemini.md
@ -1,19 +0,0 @@
-# remote::gemini
-
-## Description
-
-Google Gemini inference provider for accessing Gemini models and Google's AI services.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | API key for Gemini models |
-
-## Sample Configuration
-
-```yaml
-api_key: ${env.GEMINI_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_groq.md
+++ b/docs/source/providers/inference/remote_groq.md
@ -1,21 +0,0 @@
-# remote::groq
-
-## Description
-
-Groq inference provider for ultra-fast inference using Groq's LPU technology.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Groq API key |
-| `url` | `<class 'str'>` | No | https://api.groq.com | The URL for the Groq AI server |
-
-## Sample Configuration
-
-```yaml
-url: https://api.groq.com
-api_key: ${env.GROQ_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_hf_endpoint.md
+++ b/docs/source/providers/inference/remote_hf_endpoint.md
@ -1,21 +0,0 @@
-# remote::hf::endpoint
-
-## Description
-
-HuggingFace Inference Endpoints provider for dedicated model serving.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
-| `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |
-
-## Sample Configuration
-
-```yaml
-endpoint_name: ${env.INFERENCE_ENDPOINT_NAME}
-api_token: ${env.HF_API_TOKEN}
-
-```
-
--- a/docs/source/providers/inference/remote_hf_serverless.md
+++ b/docs/source/providers/inference/remote_hf_serverless.md
@ -1,21 +0,0 @@
-# remote::hf::serverless
-
-## Description
-
-HuggingFace Inference API serverless provider for on-demand model inference.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `huggingface_repo` | `<class 'str'>` | No |  | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
-| `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |
-
-## Sample Configuration
-
-```yaml
-huggingface_repo: ${env.INFERENCE_MODEL}
-api_token: ${env.HF_API_TOKEN}
-
-```
-
--- a/docs/source/providers/inference/remote_llama-openai-compat.md
+++ b/docs/source/providers/inference/remote_llama-openai-compat.md
@ -1,21 +0,0 @@
-# remote::llama-openai-compat
-
-## Description
-
-Llama OpenAI-compatible provider for using Llama models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Llama API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.llama.com/compat/v1/ | The URL for the Llama API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.llama.com/compat/v1/
-api_key: ${env.LLAMA_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_nvidia.md
+++ b/docs/source/providers/inference/remote_nvidia.md
@ -1,24 +0,0 @@
-# remote::nvidia
-
-## Description
-
-NVIDIA inference provider for accessing NVIDIA NIM models and AI services.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | https://integrate.api.nvidia.com | A base url for accessing the NVIDIA NIM |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The NVIDIA API key, only needed of using the hosted service |
-| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
-| `append_api_version` | `<class 'bool'>` | No | True | When set to false, the API version will not be appended to the base_url. By default, it is true. |
-
-## Sample Configuration
-
-```yaml
-url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
-api_key: ${env.NVIDIA_API_KEY:=}
-append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
-
-```
-
--- a/docs/source/providers/inference/remote_ollama.md
+++ b/docs/source/providers/inference/remote_ollama.md
@ -1,20 +0,0 @@
-# remote::ollama
-
-## Description
-
-Ollama inference provider for running local models through the Ollama runtime.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | http://localhost:11434 |  |
-| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
-
-## Sample Configuration
-
-```yaml
-url: ${env.OLLAMA_URL:=http://localhost:11434}
-
-```
-
--- a/docs/source/providers/inference/remote_openai.md
+++ b/docs/source/providers/inference/remote_openai.md
@ -1,21 +0,0 @@
-# remote::openai
-
-## Description
-
-OpenAI inference provider for accessing GPT models and other OpenAI services.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | API key for OpenAI models |
-| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |
-
-## Sample Configuration
-
-```yaml
-api_key: ${env.OPENAI_API_KEY:=}
-base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}
-
-```
-
--- a/docs/source/providers/inference/remote_passthrough.md
+++ b/docs/source/providers/inference/remote_passthrough.md
@ -1,21 +0,0 @@
-# remote::passthrough
-
-## Description
-
-Passthrough inference provider for connecting to any external inference service not directly supported.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No |  | The URL for the passthrough endpoint |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | API Key for the passthrouth endpoint |
-
-## Sample Configuration
-
-```yaml
-url: ${env.PASSTHROUGH_URL}
-api_key: ${env.PASSTHROUGH_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_runpod.md
+++ b/docs/source/providers/inference/remote_runpod.md
@ -1,21 +0,0 @@
-# remote::runpod
-
-## Description
-
-RunPod inference provider for running models on RunPod's cloud GPU platform.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `str \| None` | No |  | The URL for the Runpod model serving endpoint |
-| `api_token` | `str \| None` | No |  | The API token |
-
-## Sample Configuration
-
-```yaml
-url: ${env.RUNPOD_URL:=}
-api_token: ${env.RUNPOD_API_TOKEN}
-
-```
-
--- a/docs/source/providers/inference/remote_sambanova-openai-compat.md
+++ b/docs/source/providers/inference/remote_sambanova-openai-compat.md
@ -1,21 +0,0 @@
-# remote::sambanova-openai-compat
-
-## Description
-
-SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The SambaNova API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_sambanova.md
+++ b/docs/source/providers/inference/remote_sambanova.md
@ -1,21 +0,0 @@
-# remote::sambanova
-
-## Description
-
-SambaNova inference provider for running models on SambaNova's dataflow architecture.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova AI server |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The SambaNova cloud API Key |
-
-## Sample Configuration
-
-```yaml
-url: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_tgi.md
+++ b/docs/source/providers/inference/remote_tgi.md
@ -1,19 +0,0 @@
-# remote::tgi
-
-## Description
-
-Text Generation Inference (TGI) provider for HuggingFace model serving.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No |  | The URL for the TGI serving endpoint |
-
-## Sample Configuration
-
-```yaml
-url: ${env.TGI_URL:=}
-
-```
-
--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -1,22 +0,0 @@
-# remote::together
-
-## Description
-
-Together AI inference provider for open-source models and collaborative AI development.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
-| `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Together AI API Key |
-
-## Sample Configuration
-
-```yaml
-url: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY:=}
-
-```
-
--- a/docs/source/providers/inference/remote_vertexai.md
+++ b/docs/source/providers/inference/remote_vertexai.md
@ -1,40 +0,0 @@
-# remote::vertexai
-
-## Description
-
-Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
-
-• Enterprise-grade security: Uses Google Cloud's security controls and IAM
-• Better integration: Seamless integration with other Google Cloud services
-• Advanced features: Access to additional Vertex AI features like model tuning and monitoring
-• Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
-
-Configuration:
- Set VERTEX_AI_PROJECT environment variable (required)
- Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
- Use Google Cloud Application Default Credentials or service account key
-
-Authentication Setup:
-Option 1 (Recommended): gcloud auth application-default login
-Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
-
-Available Models:
- vertex_ai/gemini-2.0-flash
- vertex_ai/gemini-2.5-flash
- vertex_ai/gemini-2.5-pro
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `project` | `<class 'str'>` | No |  | Google Cloud project ID for Vertex AI |
-| `location` | `<class 'str'>` | No | us-central1 | Google Cloud location for Vertex AI |
-
-## Sample Configuration
-
-```yaml
-project: ${env.VERTEX_AI_PROJECT:=}
-location: ${env.VERTEX_AI_LOCATION:=us-central1}
-
-```
-
--- a/docs/source/providers/inference/remote_vllm.md
+++ b/docs/source/providers/inference/remote_vllm.md
@ -1,26 +0,0 @@
-# remote::vllm
-
-## Description
-
-Remote vLLM inference provider for connecting to vLLM servers.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `str \| None` | No |  | The URL for the vLLM model serving endpoint |
-| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
-| `api_token` | `str \| None` | No | fake | The API token |
-| `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
-| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
-
-## Sample Configuration
-
-```yaml
-url: ${env.VLLM_URL:=}
-max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
-api_token: ${env.VLLM_API_TOKEN:=fake}
-tls_verify: ${env.VLLM_TLS_VERIFY:=true}
-
-```
-
--- a/docs/source/providers/inference/remote_watsonx.md
+++ b/docs/source/providers/inference/remote_watsonx.md
@ -1,24 +0,0 @@
-# remote::watsonx
-
-## Description
-
-IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | https://us-south.ml.cloud.ibm.com | A base url for accessing the watsonx.ai |
-| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The watsonx API key |
-| `project_id` | `str \| None` | No |  | The Project ID key |
-| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
-
-## Sample Configuration
-
-```yaml
-url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
-api_key: ${env.WATSONX_API_KEY:=}
-project_id: ${env.WATSONX_PROJECT_ID:=}
-
-```
-