docs: provider and distro codegen migration (#3531)

# What does this PR do?    - Updates provider and distro codegen to handle the new format - Migrates provider and distro files to the new format ## Test Plan - Manual testing
2025-12-04 18:13:44 +00:00 · 2025-09-24 14:01:29 -07:00 · 2025-09-24 14:01:29 -07:00 · d23865757f
commit d23865757f
parent 45da31801c
103 changed files with 1796 additions and 423 deletions
--- a/docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md
@ -0,0 +1,125 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# Meta Reference GPU Distribution
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+
+self
+```
+
+The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `remote::huggingface`, `inline::localfs` |
+| eval | `inline::meta-reference` |
+| inference | `inline::meta-reference` |
+| safety | `inline::llama-guard` |
+| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
+| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
+
+
+Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
+
+### Environment Variables
+
+The following environment variables can be configured:
+
+- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
+- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
+- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
+- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
+- `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
+
+
+## Prerequisite: Downloading Models
+
+Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](../../references/llama_cli_reference/download_models.md) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
+
+```
+$ llama model list --downloaded
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
+┃ Model                                   ┃ Size     ┃ Modified Time       ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
+│ Llama3.2-1B-Instruct:int4-qlora-eo8     │ 1.53 GB  │ 2025-02-26 11:22:28 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-1B                             │ 2.31 GB  │ 2025-02-18 21:48:52 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Prompt-Guard-86M                        │ 0.02 GB  │ 2025-02-26 11:29:28 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB  │ 2025-02-26 11:37:41 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-3B                             │ 5.99 GB  │ 2025-02-18 21:51:26 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.1-8B                             │ 14.97 GB │ 2025-02-16 10:36:37 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB  │ 2025-02-26 11:35:02 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama-Guard-3-1B                        │ 2.80 GB  │ 2025-02-26 11:20:46 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama-Guard-3-1B:int4                   │ 0.43 GB  │ 2025-02-26 11:33:33 │
+└─────────────────────────────────────────┴──────────┴─────────────────────┘
+```
+
+## Running the Distribution
+
+You can do this via venv or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+LLAMA_STACK_PORT=8321
+docker run \
+  -it \
+  --pull always \
+  --gpu all \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-meta-reference-gpu \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+docker run \
+  -it \
+  --pull always \
+  --gpu all \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-meta-reference-gpu \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
+  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
+```
+
+### Via venv
+
+Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+
+```bash
+llama stack build --distro meta-reference-gpu --image-type venv
+llama stack run distributions/meta-reference-gpu/run.yaml \
+  --port 8321 \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
+  --port 8321 \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
+  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
+```
--- a/docs/docs/distributions/self_hosted_distro/nvidia.md
+++ b/docs/docs/distributions/self_hosted_distro/nvidia.md
@ -0,0 +1,171 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# NVIDIA Distribution
+
+The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `inline::localfs`, `remote::nvidia` |
+| eval | `remote::nvidia` |
+| files | `inline::localfs` |
+| inference | `remote::nvidia` |
+| post_training | `remote::nvidia` |
+| safety | `remote::nvidia` |
+| scoring | `inline::basic` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `inline::rag-runtime` |
+| vector_io | `inline::faiss` |
+
+
+### Environment Variables
+
+The following environment variables can be configured:
+
+- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
+- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
+- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
+- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
+- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
+- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
+- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
+- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
+- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
+- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
+- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
+
+### Models
+
+The following models are available by default:
+
+- `meta/llama3-8b-instruct `
+- `meta/llama3-70b-instruct `
+- `meta/llama-3.1-8b-instruct `
+- `meta/llama-3.1-70b-instruct `
+- `meta/llama-3.1-405b-instruct `
+- `meta/llama-3.2-1b-instruct `
+- `meta/llama-3.2-3b-instruct `
+- `meta/llama-3.2-11b-vision-instruct `
+- `meta/llama-3.2-90b-vision-instruct `
+- `meta/llama-3.3-70b-instruct `
+- `nvidia/vila `
+- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
+- `nvidia/nv-embedqa-e5-v5 `
+- `nvidia/nv-embedqa-mistral-7b-v2 `
+- `snowflake/arctic-embed-l `
+
+
+## Prerequisites
+### NVIDIA API Keys
+
+Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
+
+### Deploy NeMo Microservices Platform
+The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
+
+## Supported Services
+Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
+
+### Inference: NVIDIA NIM
+NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
+  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
+  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
+
+The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
+
+### Datasetio API: NeMo Data Store
+The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
+
+See the [NVIDIA Datasetio docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.
+
+### Eval API: NeMo Evaluator
+The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Eval docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.
+
+### Post-Training API: NeMo Customizer
+The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Post-Training docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.
+
+### Safety API: NeMo Guardrails
+The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Safety docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/safety/nvidia/README.md) for supported features and example usage.
+
+## Deploying models
+In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
+
+Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
+```sh
+# URL to NeMo NIM Proxy service
+export NEMO_URL="http://nemo.test"
+
+curl --location "$NEMO_URL/v1/deployment/model-deployments" \
+   -H 'accept: application/json' \
+   -H 'Content-Type: application/json' \
+   -d '{
+      "name": "llama-3.2-1b-instruct",
+      "namespace": "meta",
+      "config": {
+         "model": "meta/llama-3.2-1b-instruct",
+         "nim_deployment": {
+            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
+            "image_tag": "1.8.3",
+            "pvc_size": "25Gi",
+            "gpu": 1,
+            "additional_envs": {
+               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
+            }
+         }
+      }
+   }'
+```
+This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
+
+You can also remove a deployed NIM to free up GPU resources, if needed.
+```sh
+export NEMO_URL="http://nemo.test"
+
+curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
+```
+
+## Running Llama Stack with NVIDIA
+
+You can do this via venv (build code), or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+LLAMA_STACK_PORT=8321
+docker run \
+  -it \
+  --pull always \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ./run.yaml:/root/my-run.yaml \
+  llamastack/distribution-nvidia \
+  --config /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
+```
+
+### Via venv
+
+If you've set up your local development environment, you can also build the image using your local virtual environment.
+
+```bash
+INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+llama stack build --distro nvidia --image-type venv
+llama stack run ./run.yaml \
+  --port 8321 \
+  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL
+```
+
+## Example Notebooks
+For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in [docs/notebooks/nvidia](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks/nvidia).
--- a/docs/docs/providers/agents/index.mdx
+++ b/docs/docs/providers/agents/index.mdx
@ -0,0 +1,31 @@
+---
+description: "Agents API for creating and interacting with agentic systems.
+
+    Main functionalities provided by this API:
+    - Create agents with specific instructions and ability to use tools.
+    - Interactions with agents are grouped into sessions (\"threads\"), and each interaction is called a \"turn\".
+    - Agents can be provided with various tools (see the ToolGroups and ToolRuntime APIs for more details).
+    - Agents can be provided with various shields (see the Safety API for more details).
+    - Agents can also use Memory to retrieve information from knowledge bases. See the RAG Tool and Vector IO APIs for more details."
+sidebar_label: Agents
+title: Agents
+---
+
+# Agents
+
+## Overview
+
+Agents API for creating and interacting with agentic systems.
+
+    Main functionalities provided by this API:
+    - Create agents with specific instructions and ability to use tools.
+    - Interactions with agents are grouped into sessions ("threads"), and each interaction is called a "turn".
+    - Agents can be provided with various tools (see the ToolGroups and ToolRuntime APIs for more details).
+    - Agents can be provided with various shields (see the Safety API for more details).
+    - Agents can also use Memory to retrieve information from knowledge bases. See the RAG Tool and Vector IO APIs for more details.
+
+This section contains documentation for all available providers for the **agents** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
--- a/docs/docs/providers/agents/inline_meta-reference.mdx
+++ b/docs/docs/providers/agents/inline_meta-reference.mdx
@ -0,0 +1,29 @@
+---
+description: "Meta's reference implementation of an agent system that can use tools, access vector databases, and perform complex reasoning tasks."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of an agent system that can use tools, access vector databases, and perform complex reasoning tasks.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `persistence_store` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+| `responses_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+persistence_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/agents_store.db
+responses_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/responses_store.db
+```
--- a/docs/docs/providers/batches/index.mdx
+++ b/docs/docs/providers/batches/index.mdx
@ -0,0 +1,35 @@
+---
+description: "The Batches API enables efficient processing of multiple requests in a single operation,
+    particularly useful for processing large datasets, batch evaluation workflows, and
+    cost-effective inference at scale.
+
+    The API is designed to allow use of openai client libraries for seamless integration.
+
+    This API provides the following extensions:
+     - idempotent batch creation
+
+    Note: This API is currently under active development and may undergo changes."
+sidebar_label: Batches
+title: Batches
+---
+
+# Batches
+
+## Overview
+
+The Batches API enables efficient processing of multiple requests in a single operation,
+    particularly useful for processing large datasets, batch evaluation workflows, and
+    cost-effective inference at scale.
+
+    The API is designed to allow use of openai client libraries for seamless integration.
+
+    This API provides the following extensions:
+     - idempotent batch creation
+
+    Note: This API is currently under active development and may undergo changes.
+
+This section contains documentation for all available providers for the **batches** API.
+
+## Providers
+
+- [Reference](./inline_reference)
--- a/docs/docs/providers/batches/inline_reference.mdx
+++ b/docs/docs/providers/batches/inline_reference.mdx
@ -0,0 +1,27 @@
+---
+description: "Reference implementation of batches API with KVStore persistence."
+sidebar_label: Reference
+title: inline::reference
+---
+
+# inline::reference
+
+## Description
+
+Reference implementation of batches API with KVStore persistence.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Configuration for the key-value store backend. |
+| `max_concurrent_batches` | `<class 'int'>` | No | 1 | Maximum number of concurrent batches to process simultaneously. |
+| `max_concurrent_requests_per_batch` | `<class 'int'>` | No | 10 | Maximum number of concurrent requests to process per batch. |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/batches.db
+```
--- a/docs/docs/providers/datasetio/index.md
+++ b/docs/docs/providers/datasetio/index.md
@ -0,0 +1,16 @@
+---
+sidebar_label: Datasetio
+title: Datasetio
+---
+
+# Datasetio
+
+## Overview
+
+This section contains documentation for all available providers for the **datasetio** API.
+
+## Providers
+
+- [Localfs](./inline_localfs)
+- [Remote - Huggingface](./remote_huggingface)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/datasetio/index.mdx
+++ b/docs/docs/providers/datasetio/index.mdx
@ -0,0 +1,16 @@
+---
+sidebar_label: Datasetio
+title: Datasetio
+---
+
+# Datasetio
+
+## Overview
+
+This section contains documentation for all available providers for the **datasetio** API.
+
+## Providers
+
+- [Localfs](./inline_localfs)
+- [Remote - Huggingface](./remote_huggingface)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/datasetio/inline_localfs.mdx
+++ b/docs/docs/providers/datasetio/inline_localfs.mdx
@ -0,0 +1,25 @@
+---
+description: "Local filesystem-based dataset I/O provider for reading and writing datasets to local storage."
+sidebar_label: Localfs
+title: inline::localfs
+---
+
+# inline::localfs
+
+## Description
+
+Local filesystem-based dataset I/O provider for reading and writing datasets to local storage.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/localfs_datasetio.db
+```
--- a/docs/docs/providers/datasetio/remote_huggingface.mdx
+++ b/docs/docs/providers/datasetio/remote_huggingface.mdx
@ -0,0 +1,25 @@
+---
+description: "HuggingFace datasets provider for accessing and managing datasets from the HuggingFace Hub."
+sidebar_label: Remote - Huggingface
+title: remote::huggingface
+---
+
+# remote::huggingface
+
+## Description
+
+HuggingFace datasets provider for accessing and managing datasets from the HuggingFace Hub.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/huggingface_datasetio.db
+```
--- a/docs/docs/providers/datasetio/remote_nvidia.mdx
+++ b/docs/docs/providers/datasetio/remote_nvidia.mdx
@ -0,0 +1,29 @@
+---
+description: "NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The NVIDIA API key. |
+| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
+| `project_id` | `str \| None` | No | test-project | The NVIDIA project ID. |
+| `datasets_url` | `<class 'str'>` | No | http://nemo.test | Base URL for the NeMo Dataset API |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.NVIDIA_API_KEY:=}
+dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
+project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
+datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
+```
--- a/docs/docs/providers/eval/index.mdx
+++ b/docs/docs/providers/eval/index.mdx
@ -0,0 +1,18 @@
+---
+description: "Llama Stack Evaluation API for running evaluations on model and agent candidates."
+sidebar_label: Eval
+title: Eval
+---
+
+# Eval
+
+## Overview
+
+Llama Stack Evaluation API for running evaluations on model and agent candidates.
+
+This section contains documentation for all available providers for the **eval** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/eval/inline_meta-reference.mdx
+++ b/docs/docs/providers/eval/inline_meta-reference.mdx
@ -0,0 +1,25 @@
+---
+description: "Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/meta_reference_eval.db
+```
--- a/docs/docs/providers/eval/remote_nvidia.mdx
+++ b/docs/docs/providers/eval/remote_nvidia.mdx
@ -0,0 +1,23 @@
+---
+description: "NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `evaluator_url` | `<class 'str'>` | No | http://0.0.0.0:7331 | The url for accessing the evaluator service |
+
+## Sample Configuration
+
+```yaml
+evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331}
+```
--- a/docs/docs/providers/external/external-providers-guide.mdx
+++ b/docs/docs/providers/external/external-providers-guide.mdx
@ -0,0 +1,286 @@
+# Creating External Providers
+
+## Configuration
+
+To enable external providers, you need to add `module` into your build yaml, allowing Llama Stack to install the required package corresponding to the external provider.
+
+an example entry in your build.yaml should look like:
+
+```
+- provider_type: remote::ramalama
+  module: ramalama_stack
+```
+
+Additionally you can configure the `external_providers_dir` in your Llama Stack configuration. This method is in the process of being deprecated in favor of the `module` method. If using this method, the external provider directory should contain your external provider specifications:
+
+```yaml
+external_providers_dir: ~/.llama/providers.d/
+```
+
+## Directory Structure
+
+The external providers directory should follow this structure:
+
+```
+providers.d/
+  remote/
+    inference/
+      custom_ollama.yaml
+      vllm.yaml
+    vector_io/
+      qdrant.yaml
+    safety/
+      llama-guard.yaml
+  inline/
+    inference/
+      custom_ollama.yaml
+      vllm.yaml
+    vector_io/
+      qdrant.yaml
+    safety/
+      llama-guard.yaml
+```
+
+Each YAML file in these directories defines a provider specification for that particular API.
+
+## Provider Types
+
+Llama Stack supports two types of external providers:
+
+1. **Remote Providers**: Providers that communicate with external services (e.g., cloud APIs)
+2. **Inline Providers**: Providers that run locally within the Llama Stack process
+
+### Remote Provider Specification
+
+Remote providers are used when you need to communicate with external services. Here's an example for a custom Ollama provider:
+
+```yaml
+adapter:
+  adapter_type: custom_ollama
+  pip_packages:
+  - ollama
+  - aiohttp
+  config_class: llama_stack_ollama_provider.config.OllamaImplConfig
+  module: llama_stack_ollama_provider
+api_dependencies: []
+optional_api_dependencies: []
+```
+
+#### Adapter Configuration
+
+The `adapter` section defines how to load and configure the provider:
+
+- `adapter_type`: A unique identifier for this adapter
+- `pip_packages`: List of Python packages required by the provider
+- `config_class`: The full path to the configuration class
+- `module`: The Python module containing the provider implementation
+
+### Inline Provider Specification
+
+Inline providers run locally within the Llama Stack process. Here's an example for a custom vector store provider:
+
+```yaml
+module: llama_stack_vector_provider
+config_class: llama_stack_vector_provider.config.VectorStoreConfig
+pip_packages:
+  - faiss-cpu
+  - numpy
+api_dependencies:
+  - inference
+optional_api_dependencies:
+  - vector_io
+provider_data_validator: llama_stack_vector_provider.validator.VectorStoreValidator
+container_image: custom-vector-store:latest  # optional
+```
+
+#### Inline Provider Fields
+
+- `module`: The Python module containing the provider implementation
+- `config_class`: The full path to the configuration class
+- `pip_packages`: List of Python packages required by the provider
+- `api_dependencies`: List of Llama Stack APIs that this provider depends on
+- `optional_api_dependencies`: List of optional Llama Stack APIs that this provider can use
+- `provider_data_validator`: Optional validator for provider data
+- `container_image`: Optional container image to use instead of pip packages
+
+## Required Fields
+
+### All Providers
+
+All providers must contain a `get_provider_spec` function in their `provider` module. This is a standardized structure that Llama Stack expects and is necessary for getting things such as the config class. The `get_provider_spec` method returns a structure identical to the `adapter`. An example function may look like:
+
+```python
+from llama_stack.providers.datatypes import (
+    ProviderSpec,
+    Api,
+    AdapterSpec,
+    remote_provider_spec,
+)
+
+
+def get_provider_spec() -> ProviderSpec:
+    return remote_provider_spec(
+        api=Api.inference,
+        adapter=AdapterSpec(
+            adapter_type="ramalama",
+            pip_packages=["ramalama>=0.8.5", "pymilvus"],
+            config_class="ramalama_stack.config.RamalamaImplConfig",
+            module="ramalama_stack",
+        ),
+    )
+```
+
+#### Remote Providers
+
+Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
+1. `config`: An instance of the provider's config class
+2. `deps`: A dictionary of API dependencies
+
+This function must return an instance of the provider's adapter class that implements the required protocol for the API.
+
+Example:
+```python
+async def get_adapter_impl(
+    config: OllamaImplConfig, deps: Dict[Api, Any]
+) -> OllamaInferenceAdapter:
+    return OllamaInferenceAdapter(config)
+```
+
+#### Inline Providers
+
+Inline providers must expose a `get_provider_impl()` function in their module that takes two arguments:
+1. `config`: An instance of the provider's config class
+2. `deps`: A dictionary of API dependencies
+
+Example:
+```python
+async def get_provider_impl(
+    config: VectorStoreConfig, deps: Dict[Api, Any]
+) -> VectorStoreImpl:
+    impl = VectorStoreImpl(config, deps[Api.inference])
+    await impl.initialize()
+    return impl
+```
+
+## Dependencies
+
+The provider package must be installed on the system. For example:
+
+```bash
+$ uv pip show llama-stack-ollama-provider
+Name: llama-stack-ollama-provider
+Version: 0.1.0
+Location: /path/to/venv/lib/python3.10/site-packages
+```
+
+## Best Practices
+
+1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
+
+2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
+
+3. **Dependencies**: Only include the minimum required dependencies in your provider package.
+
+4. **Documentation**: Include clear documentation in your provider package about:
+   - Installation requirements
+   - Configuration options
+   - Usage examples
+   - Any limitations or known issues
+
+5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
+You can refer to the [integration tests
+guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
+information. Execute the test for the Provider type you are developing.
+
+## Troubleshooting
+
+If your external provider isn't being loaded:
+
+1. Check that `module` points to a published pip package with a top level `provider` module including `get_provider_spec`.
+1. Check that the `external_providers_dir` path is correct and accessible.
+2. Verify that the YAML files are properly formatted.
+3. Ensure all required Python packages are installed.
+4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
+   information using `LLAMA_STACK_LOGGING=all=debug`.
+5. Verify that the provider package is installed in your Python environment if using `external_providers_dir`.
+
+## Examples
+
+### Example using `external_providers_dir`: Custom Ollama Provider
+
+Here's a complete example of creating and using a custom Ollama provider:
+
+1. First, create the provider package:
+
+```bash
+mkdir -p llama-stack-provider-ollama
+cd llama-stack-provider-ollama
+git init
+uv init
+```
+
+2. Edit `pyproject.toml`:
+
+```toml
+[project]
+name = "llama-stack-provider-ollama"
+version = "0.1.0"
+description = "Ollama provider for Llama Stack"
+requires-python = ">=3.12"
+dependencies = ["llama-stack", "pydantic", "ollama", "aiohttp"]
+```
+
+3. Create the provider specification:
+
+```yaml
+# ~/.llama/providers.d/remote/inference/custom_ollama.yaml
+adapter:
+  adapter_type: custom_ollama
+  pip_packages: ["ollama", "aiohttp"]
+  config_class: llama_stack_provider_ollama.config.OllamaImplConfig
+  module: llama_stack_provider_ollama
+api_dependencies: []
+optional_api_dependencies: []
+```
+
+4. Install the provider:
+
+```bash
+uv pip install -e .
+```
+
+5. Configure Llama Stack to use external providers:
+
+```yaml
+external_providers_dir: ~/.llama/providers.d/
+```
+
+The provider will now be available in Llama Stack with the type `remote::custom_ollama`.
+
+
+### Example using `module`: ramalama-stack
+
+[ramalama-stack](https://github.com/containers/ramalama-stack) is a recognized external provider that supports installation via module.
+
+To install Llama Stack with this external provider a user can provider the following build.yaml:
+
+```yaml
+version: 2
+distribution_spec:
+  description: Use (an external) Ramalama server for running LLM inference
+  container_image: null
+  providers:
+    inference:
+    - provider_type: remote::ramalama
+      module: ramalama_stack==0.3.0a0
+image_type: venv
+image_name: null
+external_providers_dir: null
+additional_pip_packages:
+- aiosqlite
+- sqlalchemy[asyncio]
+```
+
+No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.
+
+The provider will now be available in Llama Stack with the type `remote::ramalama`.
--- a/docs/docs/providers/external/external-providers-list.mdx
+++ b/docs/docs/providers/external/external-providers-list.mdx
@ -0,0 +1,11 @@
+# Known External Providers
+
+Here's a list of known external providers that you can use with Llama Stack:
+
+| Name | Description | API | Type | Repository |
+|------|-------------|-----|------|------------|
+| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
+| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
+| RamaLama | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
+| TrustyAI LM-Eval | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
+| MongoDB | VectorIO with MongoDB | Vector_IO | Remote | [mongodb-llama-stack](https://github.com/mongodb-partners/mongodb-llama-stack) |
--- a/docs/docs/providers/external/index.mdx
+++ b/docs/docs/providers/external/index.mdx
@ -0,0 +1,11 @@
+# External Providers
+
+Llama Stack supports external providers that live outside of the main codebase. This allows you to:
+- Create and maintain your own providers independently
+- Share providers with others without contributing to the main codebase
+- Keep provider-specific code separate from the core Llama Stack code
+
+## External Provider Documentation
+
+- [Known External Providers](external-providers-list)
+- [Creating External Providers](external-providers-guide)
--- a/docs/docs/providers/files/index.mdx
+++ b/docs/docs/providers/files/index.mdx
@ -0,0 +1,15 @@
+---
+sidebar_label: Files
+title: Files
+---
+
+# Files
+
+## Overview
+
+This section contains documentation for all available providers for the **files** API.
+
+## Providers
+
+- [Localfs](./inline_localfs)
+- [Remote - S3](./remote_s3)
--- a/docs/docs/providers/files/inline_localfs.mdx
+++ b/docs/docs/providers/files/inline_localfs.mdx
@ -0,0 +1,28 @@
+---
+description: "Local filesystem-based file storage provider for managing files and documents locally."
+sidebar_label: Localfs
+title: inline::localfs
+---
+
+# inline::localfs
+
+## Description
+
+Local filesystem-based file storage provider for managing files and documents locally.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `storage_dir` | `<class 'str'>` | No |  | Directory to store uploaded files |
+| `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
+| `ttl_secs` | `<class 'int'>` | No | 31536000 |  |
+
+## Sample Configuration
+
+```yaml
+storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/dummy/files}
+metadata_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/files_metadata.db
+```
--- a/docs/docs/providers/files/remote_s3.mdx
+++ b/docs/docs/providers/files/remote_s3.mdx
@ -0,0 +1,37 @@
+---
+description: "AWS S3-based file storage provider for scalable cloud file management with metadata persistence."
+sidebar_label: Remote - S3
+title: remote::s3
+---
+
+# remote::s3
+
+## Description
+
+AWS S3-based file storage provider for scalable cloud file management with metadata persistence.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `bucket_name` | `<class 'str'>` | No |  | S3 bucket name to store files |
+| `region` | `<class 'str'>` | No | us-east-1 | AWS region where the bucket is located |
+| `aws_access_key_id` | `str \| None` | No |  | AWS access key ID (optional if using IAM roles) |
+| `aws_secret_access_key` | `str \| None` | No |  | AWS secret access key (optional if using IAM roles) |
+| `endpoint_url` | `str \| None` | No |  | Custom S3 endpoint URL (for MinIO, LocalStack, etc.) |
+| `auto_create_bucket` | `<class 'bool'>` | No | False | Automatically create the S3 bucket if it doesn't exist |
+| `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
+
+## Sample Configuration
+
+```yaml
+bucket_name: ${env.S3_BUCKET_NAME}
+region: ${env.AWS_REGION:=us-east-1}
+aws_access_key_id: ${env.AWS_ACCESS_KEY_ID:=}
+aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY:=}
+endpoint_url: ${env.S3_ENDPOINT_URL:=}
+auto_create_bucket: ${env.S3_AUTO_CREATE_BUCKET:=false}
+metadata_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/s3_files_metadata.db
+```
--- a/docs/docs/providers/index.mdx
+++ b/docs/docs/providers/index.mdx
@ -0,0 +1,33 @@
+---
+title: API Providers
+description: Ecosystem of providers for swapping implementations across the same API
+sidebar_label: Overview
+sidebar_position: 1
+---
+
+# API Providers
+
+The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
+- LLM inference providers (e.g., Meta Reference, Ollama, Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, OpenAI, Anthropic, Gemini, WatsonX, etc.),
+- Vector databases (e.g., FAISS, SQLite-Vec, ChromaDB, Weaviate, Qdrant, Milvus, PGVector, etc.),
+- Safety providers (e.g., Meta's Llama Guard, Prompt Guard, Code Scanner, AWS Bedrock Guardrails, etc.),
+- Tool Runtime providers (e.g., RAG Runtime, Brave Search, etc.)
+
+Providers come in two flavors:
+- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code.
+- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack.
+
+Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.
+
+## Provider Categories
+
+- **[External Providers](./external/)** - Guide for building and using external providers
+- **[OpenAI Compatibility](./openai)** - OpenAI API compatibility layer
+- **[Inference](./inference/)** - LLM and embedding model providers
+- **[Agents](./agents/)** - Agentic system providers
+- **[DatasetIO](./datasetio/)** - Dataset and data loader providers
+- **[Safety](./safety/)** - Content moderation and safety providers
+- **[Telemetry](./telemetry/)** - Monitoring and observability providers
+- **[Vector IO](./vector-io/)** - Vector database providers
+- **[Tool Runtime](./tool-runtime/)** - Tool and protocol providers
+- **[Files](./files/)** - File system and storage providers
--- a/docs/docs/providers/inference/index.mdx
+++ b/docs/docs/providers/inference/index.mdx
@ -0,0 +1,48 @@
+---
+description: "Llama Stack Inference API for generating completions, chat completions, and embeddings.
+
+    This API provides the raw interface to the underlying models. Two kinds of models are supported:
+    - LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
+    - Embedding models: these models generate embeddings to be used for semantic search."
+sidebar_label: Inference
+title: Inference
+---
+
+# Inference
+
+## Overview
+
+Llama Stack Inference API for generating completions, chat completions, and embeddings.
+
+    This API provides the raw interface to the underlying models. Two kinds of models are supported:
+    - LLM models: these models generate "raw" and "chat" (conversational) completions.
+    - Embedding models: these models generate embeddings to be used for semantic search.
+
+This section contains documentation for all available providers for the **inference** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
+- [Sentence-Transformers](./inline_sentence-transformers)
+- [Remote - Anthropic](./remote_anthropic)
+- [Remote - Azure](./remote_azure)
+- [Remote - Bedrock](./remote_bedrock)
+- [Remote - Cerebras](./remote_cerebras)
+- [Remote - Databricks](./remote_databricks)
+- [Remote - Fireworks](./remote_fireworks)
+- [Remote - Gemini](./remote_gemini)
+- [Remote - Groq](./remote_groq)
+- [Remote - Hf - Endpoint](./remote_hf_endpoint)
+- [Remote - Hf - Serverless](./remote_hf_serverless)
+- [Remote - Llama-Openai-Compat](./remote_llama-openai-compat)
+- [Remote - Nvidia](./remote_nvidia)
+- [Remote - Ollama](./remote_ollama)
+- [Remote - Openai](./remote_openai)
+- [Remote - Passthrough](./remote_passthrough)
+- [Remote - Runpod](./remote_runpod)
+- [Remote - Sambanova](./remote_sambanova)
+- [Remote - Tgi](./remote_tgi)
+- [Remote - Together](./remote_together)
+- [Remote - Vertexai](./remote_vertexai)
+- [Remote - Vllm](./remote_vllm)
+- [Remote - Watsonx](./remote_watsonx)
--- a/docs/docs/providers/inference/inline_meta-reference.mdx
+++ b/docs/docs/providers/inference/inline_meta-reference.mdx
@ -0,0 +1,36 @@
+---
+description: "Meta's reference implementation of inference with support for various model formats and optimization techniques."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of inference with support for various model formats and optimization techniques.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `model` | `str \| None` | No |  |  |
+| `torch_seed` | `int \| None` | No |  |  |
+| `max_seq_len` | `<class 'int'>` | No | 4096 |  |
+| `max_batch_size` | `<class 'int'>` | No | 1 |  |
+| `model_parallel_size` | `int \| None` | No |  |  |
+| `create_distributed_process_group` | `<class 'bool'>` | No | True |  |
+| `checkpoint_dir` | `str \| None` | No |  |  |
+| `quantization` | `Bf16QuantizationConfig \| Fp8QuantizationConfig \| Int4QuantizationConfig, annotation=NoneType, required=True, discriminator='type'` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+model: Llama3.2-3B-Instruct
+checkpoint_dir: ${env.CHECKPOINT_DIR:=null}
+quantization:
+  type: ${env.QUANTIZATION_TYPE:=bf16}
+model_parallel_size: ${env.MODEL_PARALLEL_SIZE:=0}
+max_batch_size: ${env.MAX_BATCH_SIZE:=1}
+max_seq_len: ${env.MAX_SEQ_LEN:=4096}
+```
--- a/docs/docs/providers/inference/inline_sentence-transformers.mdx
+++ b/docs/docs/providers/inference/inline_sentence-transformers.mdx
@ -0,0 +1,17 @@
+---
+description: "Sentence Transformers inference provider for text embeddings and similarity search."
+sidebar_label: Sentence-Transformers
+title: inline::sentence-transformers
+---
+
+# inline::sentence-transformers
+
+## Description
+
+Sentence Transformers inference provider for text embeddings and similarity search.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/inference/remote_anthropic.mdx
+++ b/docs/docs/providers/inference/remote_anthropic.mdx
@ -0,0 +1,23 @@
+---
+description: "Anthropic inference provider for accessing Claude models and Anthropic's AI services."
+sidebar_label: Remote - Anthropic
+title: remote::anthropic
+---
+
+# remote::anthropic
+
+## Description
+
+Anthropic inference provider for accessing Claude models and Anthropic's AI services.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | API key for Anthropic models |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.ANTHROPIC_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_azure.mdx
+++ b/docs/docs/providers/inference/remote_azure.mdx
@ -0,0 +1,36 @@
+---
+description: |
+  Azure OpenAI inference provider for accessing GPT models and other Azure services.
+  Provider documentation
+  https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
+sidebar_label: Remote - Azure
+title: remote::azure
+---
+
+# remote::azure
+
+## Description
+
+
+Azure OpenAI inference provider for accessing GPT models and other Azure services.
+Provider documentation
+https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `<class 'pydantic.types.SecretStr'>` | No |  | Azure API key for Azure |
+| `api_base` | `<class 'pydantic.networks.HttpUrl'>` | No |  | Azure API base for Azure (e.g., https://your-resource-name.openai.azure.com) |
+| `api_version` | `str \| None` | No |  | Azure API version for Azure (e.g., 2024-12-01-preview) |
+| `api_type` | `str \| None` | No | azure | Azure API type for Azure (e.g., azure) |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.AZURE_API_KEY:=}
+api_base: ${env.AZURE_API_BASE:=}
+api_version: ${env.AZURE_API_VERSION:=}
+api_type: ${env.AZURE_API_TYPE:=}
+```
--- a/docs/docs/providers/inference/remote_bedrock.mdx
+++ b/docs/docs/providers/inference/remote_bedrock.mdx
@ -0,0 +1,32 @@
+---
+description: "AWS Bedrock inference provider for accessing various AI models through AWS's managed service."
+sidebar_label: Remote - Bedrock
+title: remote::bedrock
+---
+
+# remote::bedrock
+
+## Description
+
+AWS Bedrock inference provider for accessing various AI models through AWS's managed service.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `aws_access_key_id` | `str \| None` | No |  | The AWS access key to use. Default use environment variable: AWS_ACCESS_KEY_ID |
+| `aws_secret_access_key` | `str \| None` | No |  | The AWS secret access key to use. Default use environment variable: AWS_SECRET_ACCESS_KEY |
+| `aws_session_token` | `str \| None` | No |  | The AWS session token to use. Default use environment variable: AWS_SESSION_TOKEN |
+| `region_name` | `str \| None` | No |  | The default AWS Region to use, for example, us-west-1 or us-west-2.Default use environment variable: AWS_DEFAULT_REGION |
+| `profile_name` | `str \| None` | No |  | The profile name that contains credentials to use.Default use environment variable: AWS_PROFILE |
+| `total_max_attempts` | `int \| None` | No |  | An integer representing the maximum number of attempts that will be made for a single request, including the initial attempt. Default use environment variable: AWS_MAX_ATTEMPTS |
+| `retry_mode` | `str \| None` | No |  | A string representing the type of retries Boto3 will perform.Default use environment variable: AWS_RETRY_MODE |
+| `connect_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds. |
+| `read_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to read from a connection.The default is 60 seconds. |
+| `session_ttl` | `int \| None` | No | 3600 | The time in seconds till a session expires. The default is 3600 seconds (1 hour). |
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/inference/remote_cerebras.mdx
+++ b/docs/docs/providers/inference/remote_cerebras.mdx
@ -0,0 +1,25 @@
+---
+description: "Cerebras inference provider for running models on Cerebras Cloud platform."
+sidebar_label: Remote - Cerebras
+title: remote::cerebras
+---
+
+# remote::cerebras
+
+## Description
+
+Cerebras inference provider for running models on Cerebras Cloud platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `base_url` | `<class 'str'>` | No | https://api.cerebras.ai | Base URL for the Cerebras API |
+| `api_key` | `<class 'pydantic.types.SecretStr'>` | No |  | Cerebras API Key |
+
+## Sample Configuration
+
+```yaml
+base_url: https://api.cerebras.ai
+api_key: ${env.CEREBRAS_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_databricks.mdx
+++ b/docs/docs/providers/inference/remote_databricks.mdx
@ -0,0 +1,25 @@
+---
+description: "Databricks inference provider for running models on Databricks' unified analytics platform."
+sidebar_label: Remote - Databricks
+title: remote::databricks
+---
+
+# remote::databricks
+
+## Description
+
+Databricks inference provider for running models on Databricks' unified analytics platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No |  | The URL for the Databricks model serving endpoint |
+| `api_token` | `<class 'pydantic.types.SecretStr'>` | No |  | The Databricks API token |
+
+## Sample Configuration
+
+```yaml
+url: ${env.DATABRICKS_HOST:=}
+api_token: ${env.DATABRICKS_TOKEN:=}
+```
--- a/docs/docs/providers/inference/remote_fireworks.mdx
+++ b/docs/docs/providers/inference/remote_fireworks.mdx
@ -0,0 +1,26 @@
+---
+description: "Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform."
+sidebar_label: Remote - Fireworks
+title: remote::fireworks
+---
+
+# remote::fireworks
+
+## Description
+
+Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
+| `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Fireworks.ai API Key |
+
+## Sample Configuration
+
+```yaml
+url: https://api.fireworks.ai/inference/v1
+api_key: ${env.FIREWORKS_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_gemini.mdx
+++ b/docs/docs/providers/inference/remote_gemini.mdx
@ -0,0 +1,23 @@
+---
+description: "Google Gemini inference provider for accessing Gemini models and Google's AI services."
+sidebar_label: Remote - Gemini
+title: remote::gemini
+---
+
+# remote::gemini
+
+## Description
+
+Google Gemini inference provider for accessing Gemini models and Google's AI services.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | API key for Gemini models |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.GEMINI_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_groq.mdx
+++ b/docs/docs/providers/inference/remote_groq.mdx
@ -0,0 +1,25 @@
+---
+description: "Groq inference provider for ultra-fast inference using Groq's LPU technology."
+sidebar_label: Remote - Groq
+title: remote::groq
+---
+
+# remote::groq
+
+## Description
+
+Groq inference provider for ultra-fast inference using Groq's LPU technology.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The Groq API key |
+| `url` | `<class 'str'>` | No | https://api.groq.com | The URL for the Groq AI server |
+
+## Sample Configuration
+
+```yaml
+url: https://api.groq.com
+api_key: ${env.GROQ_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_hf_endpoint.mdx
+++ b/docs/docs/providers/inference/remote_hf_endpoint.mdx
@ -0,0 +1,25 @@
+---
+description: "HuggingFace Inference Endpoints provider for dedicated model serving."
+sidebar_label: Remote - Hf - Endpoint
+title: remote::hf::endpoint
+---
+
+# remote::hf::endpoint
+
+## Description
+
+HuggingFace Inference Endpoints provider for dedicated model serving.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '&#123;namespace&#125;/&#123;endpoint_name&#125;' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
+| `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |
+
+## Sample Configuration
+
+```yaml
+endpoint_name: ${env.INFERENCE_ENDPOINT_NAME}
+api_token: ${env.HF_API_TOKEN}
+```
--- a/docs/docs/providers/inference/remote_hf_serverless.mdx
+++ b/docs/docs/providers/inference/remote_hf_serverless.mdx
@ -0,0 +1,25 @@
+---
+description: "HuggingFace Inference API serverless provider for on-demand model inference."
+sidebar_label: Remote - Hf - Serverless
+title: remote::hf::serverless
+---
+
+# remote::hf::serverless
+
+## Description
+
+HuggingFace Inference API serverless provider for on-demand model inference.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `huggingface_repo` | `<class 'str'>` | No |  | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
+| `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |
+
+## Sample Configuration
+
+```yaml
+huggingface_repo: ${env.INFERENCE_MODEL}
+api_token: ${env.HF_API_TOKEN}
+```
--- a/docs/docs/providers/inference/remote_llama-openai-compat.mdx
+++ b/docs/docs/providers/inference/remote_llama-openai-compat.mdx
@ -0,0 +1,25 @@
+---
+description: "Llama OpenAI-compatible provider for using Llama models with OpenAI API format."
+sidebar_label: Remote - Llama-Openai-Compat
+title: remote::llama-openai-compat
+---
+
+# remote::llama-openai-compat
+
+## Description
+
+Llama OpenAI-compatible provider for using Llama models with OpenAI API format.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The Llama API key |
+| `openai_compat_api_base` | `<class 'str'>` | No | https://api.llama.com/compat/v1/ | The URL for the Llama API server |
+
+## Sample Configuration
+
+```yaml
+openai_compat_api_base: https://api.llama.com/compat/v1/
+api_key: ${env.LLAMA_API_KEY}
+```
--- a/docs/docs/providers/inference/remote_nvidia.mdx
+++ b/docs/docs/providers/inference/remote_nvidia.mdx
@ -0,0 +1,28 @@
+---
+description: "NVIDIA inference provider for accessing NVIDIA NIM models and AI services."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA inference provider for accessing NVIDIA NIM models and AI services.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No | https://integrate.api.nvidia.com | A base url for accessing the NVIDIA NIM |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The NVIDIA API key, only needed of using the hosted service |
+| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
+| `append_api_version` | `<class 'bool'>` | No | True | When set to false, the API version will not be appended to the base_url. By default, it is true. |
+
+## Sample Configuration
+
+```yaml
+url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
+api_key: ${env.NVIDIA_API_KEY:=}
+append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
+```
--- a/docs/docs/providers/inference/remote_ollama.mdx
+++ b/docs/docs/providers/inference/remote_ollama.mdx
@ -0,0 +1,24 @@
+---
+description: "Ollama inference provider for running local models through the Ollama runtime."
+sidebar_label: Remote - Ollama
+title: remote::ollama
+---
+
+# remote::ollama
+
+## Description
+
+Ollama inference provider for running local models through the Ollama runtime.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No | http://localhost:11434 |  |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
+
+## Sample Configuration
+
+```yaml
+url: ${env.OLLAMA_URL:=http://localhost:11434}
+```
--- a/docs/docs/providers/inference/remote_openai.mdx
+++ b/docs/docs/providers/inference/remote_openai.mdx
@ -0,0 +1,25 @@
+---
+description: "OpenAI inference provider for accessing GPT models and other OpenAI services."
+sidebar_label: Remote - Openai
+title: remote::openai
+---
+
+# remote::openai
+
+## Description
+
+OpenAI inference provider for accessing GPT models and other OpenAI services.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | API key for OpenAI models |
+| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.OPENAI_API_KEY:=}
+base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}
+```
--- a/docs/docs/providers/inference/remote_passthrough.mdx
+++ b/docs/docs/providers/inference/remote_passthrough.mdx
@ -0,0 +1,25 @@
+---
+description: "Passthrough inference provider for connecting to any external inference service not directly supported."
+sidebar_label: Remote - Passthrough
+title: remote::passthrough
+---
+
+# remote::passthrough
+
+## Description
+
+Passthrough inference provider for connecting to any external inference service not directly supported.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No |  | The URL for the passthrough endpoint |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | API Key for the passthrouth endpoint |
+
+## Sample Configuration
+
+```yaml
+url: ${env.PASSTHROUGH_URL}
+api_key: ${env.PASSTHROUGH_API_KEY}
+```
--- a/docs/docs/providers/inference/remote_runpod.mdx
+++ b/docs/docs/providers/inference/remote_runpod.mdx
@ -0,0 +1,25 @@
+---
+description: "RunPod inference provider for running models on RunPod's cloud GPU platform."
+sidebar_label: Remote - Runpod
+title: remote::runpod
+---
+
+# remote::runpod
+
+## Description
+
+RunPod inference provider for running models on RunPod's cloud GPU platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `str \| None` | No |  | The URL for the Runpod model serving endpoint |
+| `api_token` | `str \| None` | No |  | The API token |
+
+## Sample Configuration
+
+```yaml
+url: ${env.RUNPOD_URL:=}
+api_token: ${env.RUNPOD_API_TOKEN}
+```
--- a/docs/docs/providers/inference/remote_sambanova-openai-compat.mdx
+++ b/docs/docs/providers/inference/remote_sambanova-openai-compat.mdx
@ -0,0 +1,20 @@
+# remote::sambanova-openai-compat
+
+## Description
+
+SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API format.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The SambaNova API key |
+| `openai_compat_api_base` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova API server |
+
+## Sample Configuration
+
+```yaml
+openai_compat_api_base: https://api.sambanova.ai/v1
+api_key: ${env.SAMBANOVA_API_KEY:=}
+
+```
--- a/docs/docs/providers/inference/remote_sambanova.mdx
+++ b/docs/docs/providers/inference/remote_sambanova.mdx
@ -0,0 +1,25 @@
+---
+description: "SambaNova inference provider for running models on SambaNova's dataflow architecture."
+sidebar_label: Remote - Sambanova
+title: remote::sambanova
+---
+
+# remote::sambanova
+
+## Description
+
+SambaNova inference provider for running models on SambaNova's dataflow architecture.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova AI server |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The SambaNova cloud API Key |
+
+## Sample Configuration
+
+```yaml
+url: https://api.sambanova.ai/v1
+api_key: ${env.SAMBANOVA_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_tgi.mdx
+++ b/docs/docs/providers/inference/remote_tgi.mdx
@ -0,0 +1,23 @@
+---
+description: "Text Generation Inference (TGI) provider for HuggingFace model serving."
+sidebar_label: Remote - Tgi
+title: remote::tgi
+---
+
+# remote::tgi
+
+## Description
+
+Text Generation Inference (TGI) provider for HuggingFace model serving.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No |  | The URL for the TGI serving endpoint |
+
+## Sample Configuration
+
+```yaml
+url: ${env.TGI_URL:=}
+```
--- a/docs/docs/providers/inference/remote_together.mdx
+++ b/docs/docs/providers/inference/remote_together.mdx
@ -0,0 +1,26 @@
+---
+description: "Together AI inference provider for open-source models and collaborative AI development."
+sidebar_label: Remote - Together
+title: remote::together
+---
+
+# remote::together
+
+## Description
+
+Together AI inference provider for open-source models and collaborative AI development.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
+| `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Together AI API Key |
+
+## Sample Configuration
+
+```yaml
+url: https://api.together.xyz/v1
+api_key: ${env.TOGETHER_API_KEY:=}
+```
--- a/docs/docs/providers/inference/remote_vertexai.mdx
+++ b/docs/docs/providers/inference/remote_vertexai.mdx
@ -0,0 +1,64 @@
+---
+description: |
+  Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
+
+  • Enterprise-grade security: Uses Google Cloud's security controls and IAM
+  • Better integration: Seamless integration with other Google Cloud services
+  • Advanced features: Access to additional Vertex AI features like model tuning and monitoring
+  • Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
+
+  Configuration:
+  - Set VERTEX_AI_PROJECT environment variable (required)
+  - Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
+  - Use Google Cloud Application Default Credentials or service account key
+
+  Authentication Setup:
+  Option 1 (Recommended): gcloud auth application-default login
+  Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
+
+  Available Models:
+  - vertex_ai/gemini-2.0-flash
+  - vertex_ai/gemini-2.5-flash
+  - vertex_ai/gemini-2.5-pro
+sidebar_label: Remote - Vertexai
+title: remote::vertexai
+---
+
+# remote::vertexai
+
+## Description
+
+Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
+
+• Enterprise-grade security: Uses Google Cloud's security controls and IAM
+• Better integration: Seamless integration with other Google Cloud services
+• Advanced features: Access to additional Vertex AI features like model tuning and monitoring
+• Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
+
+Configuration:
+- Set VERTEX_AI_PROJECT environment variable (required)
+- Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
+- Use Google Cloud Application Default Credentials or service account key
+
+Authentication Setup:
+Option 1 (Recommended): gcloud auth application-default login
+Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
+
+Available Models:
+- vertex_ai/gemini-2.0-flash
+- vertex_ai/gemini-2.5-flash
+- vertex_ai/gemini-2.5-pro
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `project` | `<class 'str'>` | No |  | Google Cloud project ID for Vertex AI |
+| `location` | `<class 'str'>` | No | us-central1 | Google Cloud location for Vertex AI |
+
+## Sample Configuration
+
+```yaml
+project: ${env.VERTEX_AI_PROJECT:=}
+location: ${env.VERTEX_AI_LOCATION:=us-central1}
+```
--- a/docs/docs/providers/inference/remote_vllm.mdx
+++ b/docs/docs/providers/inference/remote_vllm.mdx
@ -0,0 +1,30 @@
+---
+description: "Remote vLLM inference provider for connecting to vLLM servers."
+sidebar_label: Remote - Vllm
+title: remote::vllm
+---
+
+# remote::vllm
+
+## Description
+
+Remote vLLM inference provider for connecting to vLLM servers.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `str \| None` | No |  | The URL for the vLLM model serving endpoint |
+| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
+| `api_token` | `str \| None` | No | fake | The API token |
+| `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
+
+## Sample Configuration
+
+```yaml
+url: ${env.VLLM_URL:=}
+max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
+api_token: ${env.VLLM_API_TOKEN:=fake}
+tls_verify: ${env.VLLM_TLS_VERIFY:=true}
+```
--- a/docs/docs/providers/inference/remote_watsonx.mdx
+++ b/docs/docs/providers/inference/remote_watsonx.mdx
@ -0,0 +1,28 @@
+---
+description: "IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform."
+sidebar_label: Remote - Watsonx
+title: remote::watsonx
+---
+
+# remote::watsonx
+
+## Description
+
+IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No | https://us-south.ml.cloud.ibm.com | A base url for accessing the watsonx.ai |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The watsonx API key |
+| `project_id` | `str \| None` | No |  | The Project ID key |
+| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |
+
+## Sample Configuration
+
+```yaml
+url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
+api_key: ${env.WATSONX_API_KEY:=}
+project_id: ${env.WATSONX_PROJECT_ID:=}
+```
--- a/docs/docs/providers/openai.mdx
+++ b/docs/docs/providers/openai.mdx
@ -0,0 +1,191 @@
+## OpenAI API Compatibility
+
+### Server path
+
+Llama Stack exposes an OpenAI-compatible API endpoint at `/v1/openai/v1`. So, for a Llama Stack server running locally on port `8321`, the full url to the OpenAI-compatible API endpoint is `http://localhost:8321/v1/openai/v1`.
+
+### Clients
+
+You should be able to use any client that speaks OpenAI APIs with Llama Stack. We regularly test with the official Llama Stack clients as well as OpenAI's official Python client.
+
+#### Llama Stack Client
+
+When using the Llama Stack client, set the `base_url` to the root of your Llama Stack server. It will automatically route OpenAI-compatible requests to the right server endpoint for you.
+
+```python
+from llama_stack_client import LlamaStackClient
+
+client = LlamaStackClient(base_url="http://localhost:8321")
+```
+
+#### OpenAI Client
+
+When using an OpenAI client, set the `base_url` to the `/v1/openai/v1` path on your Llama Stack server.
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")
+```
+
+Regardless of the client you choose, the following code examples should all work the same.
+
+### APIs implemented
+
+#### Models
+
+Many of the APIs require you to pass in a model parameter. To see the list of models available in your Llama Stack server:
+
+```python
+models = client.models.list()
+```
+
+#### Responses
+
+> **Note:** The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.
+
+##### Simple inference
+
+Request:
+
+```
+response = client.responses.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    input="Write a haiku about coding."
+)
+
+print(response.output_text)
+```
+Example output:
+
+```text
+Pixels dancing slow
+Syntax whispers secrets sweet
+Code's gentle silence
+```
+
+##### Structured Output
+
+Request:
+
+```python
+response = client.responses.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    input=[
+        {
+            "role": "system",
+            "content": "Extract the participants from the event information.",
+        },
+        {
+            "role": "user",
+            "content": "Alice and Bob are going to a science fair on Friday.",
+        },
+    ],
+    text={
+        "format": {
+            "type": "json_schema",
+            "name": "participants",
+            "schema": {
+                "type": "object",
+                "properties": {
+                    "participants": {"type": "array", "items": {"type": "string"}}
+                },
+                "required": ["participants"],
+            },
+        }
+    },
+)
+print(response.output_text)
+```
+
+Example output:
+
+```text
+{ "participants": ["Alice", "Bob"] }
+```
+
+#### Chat Completions
+
+##### Simple inference
+
+Request:
+
+```python
+chat_completion = client.chat.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[{"role": "user", "content": "Write a haiku about coding."}],
+)
+
+print(chat_completion.choices[0].message.content)
+```
+
+Example output:
+
+```text
+Lines of code unfold
+Logic flows like a river
+Code's gentle beauty
+```
+
+##### Structured Output
+
+Request:
+
+```python
+chat_completion = client.chat.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[
+        {
+            "role": "system",
+            "content": "Extract the participants from the event information.",
+        },
+        {
+            "role": "user",
+            "content": "Alice and Bob are going to a science fair on Friday.",
+        },
+    ],
+    response_format={
+        "type": "json_schema",
+        "json_schema": {
+            "name": "participants",
+            "schema": {
+                "type": "object",
+                "properties": {
+                    "participants": {"type": "array", "items": {"type": "string"}}
+                },
+                "required": ["participants"],
+            },
+        },
+    },
+)
+
+print(chat_completion.choices[0].message.content)
+```
+
+Example output:
+
+```text
+{ "participants": ["Alice", "Bob"] }
+```
+
+#### Completions
+
+##### Simple inference
+
+Request:
+
+```python
+completion = client.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct", prompt="Write a haiku about coding."
+)
+
+print(completion.choices[0].text)
+```
+
+Example output:
+
+```text
+Lines of code unfurl
+Logic whispers in the dark
+Art in hidden form
+```
--- a/docs/docs/providers/post_training/index.mdx
+++ b/docs/docs/providers/post_training/index.mdx
@ -0,0 +1,17 @@
+---
+sidebar_label: Post Training
+title: Post_Training
+---
+
+# Post_Training
+
+## Overview
+
+This section contains documentation for all available providers for the **post_training** API.
+
+## Providers
+
+- [Huggingface-Gpu](./inline_huggingface-gpu)
+- [Torchtune-Cpu](./inline_torchtune-cpu)
+- [Torchtune-Gpu](./inline_torchtune-gpu)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/post_training/inline_huggingface-cpu.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface-cpu.mdx
@ -0,0 +1,40 @@
+# inline::huggingface-cpu
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | <|user|>
+{input}
+<|assistant|>
+{output} |  |
+| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+
+```
--- a/docs/docs/providers/post_training/inline_huggingface-gpu.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface-gpu.mdx
@ -0,0 +1,42 @@
+---
+description: "HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem."
+sidebar_label: Huggingface-Gpu
+title: inline::huggingface-gpu
+---
+
+# inline::huggingface-gpu
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | &lt;|user|&gt;&lt;br/&gt;&#123;input&#125;&lt;br/&gt;&lt;|assistant|&gt;&lt;br/&gt;&#123;output&#125; |  |
+| `model_specific_config` | `<class 'dict'>` | No | &#123;'trust_remote_code': True, 'attn_implementation': 'sdpa'&#125; |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+```
--- a/docs/docs/providers/post_training/inline_huggingface.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface.mdx
@ -0,0 +1,40 @@
+# inline::huggingface
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | <|user|>
+{input}
+<|assistant|>
+{output} |  |
+| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+
+```
--- a/docs/docs/providers/post_training/inline_torchtune-cpu.mdx
+++ b/docs/docs/providers/post_training/inline_torchtune-cpu.mdx
@ -0,0 +1,24 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Cpu
+title: inline::torchtune-cpu
+---
+
+# inline::torchtune-cpu
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+```
--- a/docs/docs/providers/post_training/inline_torchtune-gpu.mdx
+++ b/docs/docs/providers/post_training/inline_torchtune-gpu.mdx
@ -0,0 +1,24 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Gpu
+title: inline::torchtune-gpu
+---
+
+# inline::torchtune-gpu
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+```
--- a/docs/docs/providers/post_training/inline_torchtune.md
+++ b/docs/docs/providers/post_training/inline_torchtune.md
@ -0,0 +1,20 @@
+# inline::torchtune
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+
+```
+
--- a/docs/docs/providers/post_training/remote_nvidia.mdx
+++ b/docs/docs/providers/post_training/remote_nvidia.mdx
@ -0,0 +1,32 @@
+---
+description: "NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The NVIDIA API key. |
+| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
+| `project_id` | `str \| None` | No | test-example-model@v1 | The NVIDIA project ID. |
+| `customizer_url` | `str \| None` | No |  | Base URL for the NeMo Customizer API |
+| `timeout` | `<class 'int'>` | No | 300 | Timeout for the NVIDIA Post Training API |
+| `max_retries` | `<class 'int'>` | No | 3 | Maximum number of retries for the NVIDIA Post Training API |
+| `output_model_dir` | `<class 'str'>` | No | test-example-model@v1 | Directory to save the output model |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.NVIDIA_API_KEY:=}
+dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
+project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
+customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
+```
--- a/docs/docs/providers/safety/index.mdx
+++ b/docs/docs/providers/safety/index.mdx
@ -0,0 +1,19 @@
+---
+sidebar_label: Safety
+title: Safety
+---
+
+# Safety
+
+## Overview
+
+This section contains documentation for all available providers for the **safety** API.
+
+## Providers
+
+- [Code-Scanner](./inline_code-scanner)
+- [Llama-Guard](./inline_llama-guard)
+- [Prompt-Guard](./inline_prompt-guard)
+- [Remote - Bedrock](./remote_bedrock)
+- [Remote - Nvidia](./remote_nvidia)
+- [Remote - Sambanova](./remote_sambanova)
--- a/docs/docs/providers/safety/inline_code-scanner.mdx
+++ b/docs/docs/providers/safety/inline_code-scanner.mdx
@ -0,0 +1,17 @@
+---
+description: "Code Scanner safety provider for detecting security vulnerabilities and unsafe code patterns."
+sidebar_label: Code-Scanner
+title: inline::code-scanner
+---
+
+# inline::code-scanner
+
+## Description
+
+Code Scanner safety provider for detecting security vulnerabilities and unsafe code patterns.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/safety/inline_llama-guard.mdx
+++ b/docs/docs/providers/safety/inline_llama-guard.mdx
@ -0,0 +1,23 @@
+---
+description: "Llama Guard safety provider for content moderation and safety filtering using Meta's Llama Guard model."
+sidebar_label: Llama-Guard
+title: inline::llama-guard
+---
+
+# inline::llama-guard
+
+## Description
+
+Llama Guard safety provider for content moderation and safety filtering using Meta's Llama Guard model.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `excluded_categories` | `list[str` | No | [] |  |
+
+## Sample Configuration
+
+```yaml
+excluded_categories: []
+```
--- a/docs/docs/providers/safety/inline_prompt-guard.mdx
+++ b/docs/docs/providers/safety/inline_prompt-guard.mdx
@ -0,0 +1,23 @@
+---
+description: "Prompt Guard safety provider for detecting and filtering unsafe prompts and content."
+sidebar_label: Prompt-Guard
+title: inline::prompt-guard
+---
+
+# inline::prompt-guard
+
+## Description
+
+Prompt Guard safety provider for detecting and filtering unsafe prompts and content.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `guard_type` | `<class 'str'>` | No | injection |  |
+
+## Sample Configuration
+
+```yaml
+guard_type: injection
+```
--- a/docs/docs/providers/safety/remote_bedrock.mdx
+++ b/docs/docs/providers/safety/remote_bedrock.mdx
@ -0,0 +1,32 @@
+---
+description: "AWS Bedrock safety provider for content moderation using AWS's safety services."
+sidebar_label: Remote - Bedrock
+title: remote::bedrock
+---
+
+# remote::bedrock
+
+## Description
+
+AWS Bedrock safety provider for content moderation using AWS's safety services.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `aws_access_key_id` | `str \| None` | No |  | The AWS access key to use. Default use environment variable: AWS_ACCESS_KEY_ID |
+| `aws_secret_access_key` | `str \| None` | No |  | The AWS secret access key to use. Default use environment variable: AWS_SECRET_ACCESS_KEY |
+| `aws_session_token` | `str \| None` | No |  | The AWS session token to use. Default use environment variable: AWS_SESSION_TOKEN |
+| `region_name` | `str \| None` | No |  | The default AWS Region to use, for example, us-west-1 or us-west-2.Default use environment variable: AWS_DEFAULT_REGION |
+| `profile_name` | `str \| None` | No |  | The profile name that contains credentials to use.Default use environment variable: AWS_PROFILE |
+| `total_max_attempts` | `int \| None` | No |  | An integer representing the maximum number of attempts that will be made for a single request, including the initial attempt. Default use environment variable: AWS_MAX_ATTEMPTS |
+| `retry_mode` | `str \| None` | No |  | A string representing the type of retries Boto3 will perform.Default use environment variable: AWS_RETRY_MODE |
+| `connect_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds. |
+| `read_timeout` | `float \| None` | No | 60.0 | The time in seconds till a timeout exception is thrown when attempting to read from a connection.The default is 60 seconds. |
+| `session_ttl` | `int \| None` | No | 3600 | The time in seconds till a session expires. The default is 3600 seconds (1 hour). |
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/safety/remote_nvidia.mdx
+++ b/docs/docs/providers/safety/remote_nvidia.mdx
@ -0,0 +1,25 @@
+---
+description: "NVIDIA's safety provider for content moderation and safety filtering."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's safety provider for content moderation and safety filtering.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `guardrails_service_url` | `<class 'str'>` | No | http://0.0.0.0:7331 | The url for accessing the Guardrails service |
+| `config_id` | `str \| None` | No | self-check | Guardrails configuration ID to use from the Guardrails configuration store |
+
+## Sample Configuration
+
+```yaml
+guardrails_service_url: ${env.GUARDRAILS_SERVICE_URL:=http://localhost:7331}
+config_id: ${env.NVIDIA_GUARDRAILS_CONFIG_ID:=self-check}
+```
--- a/docs/docs/providers/safety/remote_sambanova.mdx
+++ b/docs/docs/providers/safety/remote_sambanova.mdx
@ -0,0 +1,25 @@
+---
+description: "SambaNova's safety provider for content moderation and safety filtering."
+sidebar_label: Remote - Sambanova
+title: remote::sambanova
+---
+
+# remote::sambanova
+
+## Description
+
+SambaNova's safety provider for content moderation and safety filtering.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `<class 'str'>` | No | https://api.sambanova.ai/v1 | The URL for the SambaNova AI server |
+| `api_key` | `pydantic.types.SecretStr \| None` | No |  | The SambaNova cloud API Key |
+
+## Sample Configuration
+
+```yaml
+url: https://api.sambanova.ai/v1
+api_key: ${env.SAMBANOVA_API_KEY:=}
+```
--- a/docs/docs/providers/scoring/index.mdx
+++ b/docs/docs/providers/scoring/index.mdx
@ -0,0 +1,16 @@
+---
+sidebar_label: Scoring
+title: Scoring
+---
+
+# Scoring
+
+## Overview
+
+This section contains documentation for all available providers for the **scoring** API.
+
+## Providers
+
+- [Basic](./inline_basic)
+- [Braintrust](./inline_braintrust)
+- [Llm-As-Judge](./inline_llm-as-judge)
--- a/docs/docs/providers/scoring/inline_basic.mdx
+++ b/docs/docs/providers/scoring/inline_basic.mdx
@ -0,0 +1,17 @@
+---
+description: "Basic scoring provider for simple evaluation metrics and scoring functions."
+sidebar_label: Basic
+title: inline::basic
+---
+
+# inline::basic
+
+## Description
+
+Basic scoring provider for simple evaluation metrics and scoring functions.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/scoring/inline_braintrust.mdx
+++ b/docs/docs/providers/scoring/inline_braintrust.mdx
@ -0,0 +1,23 @@
+---
+description: "Braintrust scoring provider for evaluation and scoring using the Braintrust platform."
+sidebar_label: Braintrust
+title: inline::braintrust
+---
+
+# inline::braintrust
+
+## Description
+
+Braintrust scoring provider for evaluation and scoring using the Braintrust platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `openai_api_key` | `str \| None` | No |  | The OpenAI API Key |
+
+## Sample Configuration
+
+```yaml
+openai_api_key: ${env.OPENAI_API_KEY:=}
+```
--- a/docs/docs/providers/scoring/inline_llm-as-judge.mdx
+++ b/docs/docs/providers/scoring/inline_llm-as-judge.mdx
@ -0,0 +1,17 @@
+---
+description: "LLM-as-judge scoring provider that uses language models to evaluate and score responses."
+sidebar_label: Llm-As-Judge
+title: inline::llm-as-judge
+---
+
+# inline::llm-as-judge
+
+## Description
+
+LLM-as-judge scoring provider that uses language models to evaluate and score responses.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/telemetry/index.mdx
+++ b/docs/docs/providers/telemetry/index.mdx
@ -0,0 +1,14 @@
+---
+sidebar_label: Telemetry
+title: Telemetry
+---
+
+# Telemetry
+
+## Overview
+
+This section contains documentation for all available providers for the **telemetry** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
--- a/docs/docs/providers/telemetry/inline_meta-reference.mdx
+++ b/docs/docs/providers/telemetry/inline_meta-reference.mdx
@ -0,0 +1,29 @@
+---
+description: "Meta's reference implementation of telemetry and observability using OpenTelemetry."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of telemetry and observability using OpenTelemetry.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `otel_exporter_otlp_endpoint` | `str \| None` | No |  | The OpenTelemetry collector endpoint URL (base URL for traces, metrics, and logs). If not set, the SDK will use OTEL_EXPORTER_OTLP_ENDPOINT environment variable. |
+| `service_name` | `<class 'str'>` | No |  | The service name to use for telemetry |
+| `sinks` | `list[inline.telemetry.meta_reference.config.TelemetrySink` | No | [&lt;TelemetrySink.CONSOLE: 'console'&gt;, &lt;TelemetrySink.SQLITE: 'sqlite'&gt;] | List of telemetry sinks to enable (possible values: otel_trace, otel_metric, sqlite, console) |
+| `sqlite_db_path` | `<class 'str'>` | No | ~/.llama/runtime/trace_store.db | The path to the SQLite database to use for storing traces |
+
+## Sample Configuration
+
+```yaml
+service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
+sinks: ${env.TELEMETRY_SINKS:=console,sqlite}
+sqlite_db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/trace_store.db
+otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=}
+```
--- a/docs/docs/providers/tool_runtime/index.mdx
+++ b/docs/docs/providers/tool_runtime/index.mdx
@ -0,0 +1,19 @@
+---
+sidebar_label: Tool Runtime
+title: Tool_Runtime
+---
+
+# Tool_Runtime
+
+## Overview
+
+This section contains documentation for all available providers for the **tool_runtime** API.
+
+## Providers
+
+- [Rag-Runtime](./inline_rag-runtime)
+- [Remote - Bing-Search](./remote_bing-search)
+- [Remote - Brave-Search](./remote_brave-search)
+- [Remote - Model-Context-Protocol](./remote_model-context-protocol)
+- [Remote - Tavily-Search](./remote_tavily-search)
+- [Remote - Wolfram-Alpha](./remote_wolfram-alpha)
--- a/docs/docs/providers/tool_runtime/inline_rag-runtime.mdx
+++ b/docs/docs/providers/tool_runtime/inline_rag-runtime.mdx
@ -0,0 +1,17 @@
+---
+description: "RAG (Retrieval-Augmented Generation) tool runtime for document ingestion, chunking, and semantic search."
+sidebar_label: Rag-Runtime
+title: inline::rag-runtime
+---
+
+# inline::rag-runtime
+
+## Description
+
+RAG (Retrieval-Augmented Generation) tool runtime for document ingestion, chunking, and semantic search.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/tool_runtime/remote_bing-search.mdx
+++ b/docs/docs/providers/tool_runtime/remote_bing-search.mdx
@ -0,0 +1,24 @@
+---
+description: "Bing Search tool for web search capabilities using Microsoft's search engine."
+sidebar_label: Remote - Bing-Search
+title: remote::bing-search
+---
+
+# remote::bing-search
+
+## Description
+
+Bing Search tool for web search capabilities using Microsoft's search engine.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  |  |
+| `top_k` | `<class 'int'>` | No | 3 |  |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.BING_API_KEY:}
+```
--- a/docs/docs/providers/tool_runtime/remote_brave-search.mdx
+++ b/docs/docs/providers/tool_runtime/remote_brave-search.mdx
@ -0,0 +1,25 @@
+---
+description: "Brave Search tool for web search capabilities with privacy-focused results."
+sidebar_label: Remote - Brave-Search
+title: remote::brave-search
+---
+
+# remote::brave-search
+
+## Description
+
+Brave Search tool for web search capabilities with privacy-focused results.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The Brave Search API Key |
+| `max_results` | `<class 'int'>` | No | 3 | The maximum number of results to return |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.BRAVE_SEARCH_API_KEY:=}
+max_results: 3
+```
--- a/docs/docs/providers/tool_runtime/remote_model-context-protocol.mdx
+++ b/docs/docs/providers/tool_runtime/remote_model-context-protocol.mdx
@ -0,0 +1,17 @@
+---
+description: "Model Context Protocol (MCP) tool for standardized tool calling and context management."
+sidebar_label: Remote - Model-Context-Protocol
+title: remote::model-context-protocol
+---
+
+# remote::model-context-protocol
+
+## Description
+
+Model Context Protocol (MCP) tool for standardized tool calling and context management.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/docs/providers/tool_runtime/remote_tavily-search.mdx
+++ b/docs/docs/providers/tool_runtime/remote_tavily-search.mdx
@ -0,0 +1,25 @@
+---
+description: "Tavily Search tool for AI-optimized web search with structured results."
+sidebar_label: Remote - Tavily-Search
+title: remote::tavily-search
+---
+
+# remote::tavily-search
+
+## Description
+
+Tavily Search tool for AI-optimized web search with structured results.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The Tavily Search API Key |
+| `max_results` | `<class 'int'>` | No | 3 | The maximum number of results to return |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.TAVILY_SEARCH_API_KEY:=}
+max_results: 3
+```
--- a/docs/docs/providers/tool_runtime/remote_wolfram-alpha.mdx
+++ b/docs/docs/providers/tool_runtime/remote_wolfram-alpha.mdx
@ -0,0 +1,23 @@
+---
+description: "Wolfram Alpha tool for computational knowledge and mathematical calculations."
+sidebar_label: Remote - Wolfram-Alpha
+title: remote::wolfram-alpha
+---
+
+# remote::wolfram-alpha
+
+## Description
+
+Wolfram Alpha tool for computational knowledge and mathematical calculations.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
+```
--- a/docs/docs/providers/vector_io/index.mdx
+++ b/docs/docs/providers/vector_io/index.mdx
@ -0,0 +1,25 @@
+---
+sidebar_label: Vector Io
+title: Vector_Io
+---
+
+# Vector_Io
+
+## Overview
+
+This section contains documentation for all available providers for the **vector_io** API.
+
+## Providers
+
+- [Chromadb](./inline_chromadb)
+- [Faiss](./inline_faiss)
+- [Meta-Reference](./inline_meta-reference)
+- [Milvus](./inline_milvus)
+- [Qdrant](./inline_qdrant)
+- [Sqlite-Vec](./inline_sqlite-vec)
+- [Sqlite Vec](./inline_sqlite_vec)
+- [Remote - Chromadb](./remote_chromadb)
+- [Remote - Milvus](./remote_milvus)
+- [Remote - Pgvector](./remote_pgvector)
+- [Remote - Qdrant](./remote_qdrant)
+- [Remote - Weaviate](./remote_weaviate)
--- a/docs/docs/providers/vector_io/inline_chromadb.mdx
+++ b/docs/docs/providers/vector_io/inline_chromadb.mdx
@ -0,0 +1,91 @@
+---
+description: |
+  [Chroma](https://www.trychroma.com/) is an inline and remote vector
+  database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Chroma supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+  ## Usage
+
+  To use Chrome in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install chroma using pip:
+
+  ```bash
+  pip install chromadb
+  ```
+
+  ## Documentation
+  See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+sidebar_label: Chromadb
+title: inline::chromadb
+---
+
+# inline::chromadb
+
+## Description
+
+
+[Chroma](https://www.trychroma.com/) is an inline and remote vector
+database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+That means you're not limited to storing vectors in memory or in a separate service.
+
+## Features
+Chroma supports:
+- Store embeddings and their metadata
+- Vector search
+- Full-text search
+- Document storage
+- Metadata filtering
+- Multi-modal retrieval
+
+## Usage
+
+To use Chrome in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use chroma.
+3. Start storing and querying vectors.
+
+## Installation
+
+You can install chroma using pip:
+
+```bash
+pip install chromadb
+```
+
+## Documentation
+See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `db_path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
+
+## Sample Configuration
+
+```yaml
+db_path: ${env.CHROMADB_PATH}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_inline_registry.db
+```
--- a/docs/docs/providers/vector_io/inline_faiss.mdx
+++ b/docs/docs/providers/vector_io/inline_faiss.mdx
@ -0,0 +1,106 @@
+---
+description: |
+  [Faiss](https://github.com/facebookresearch/faiss) is an inline vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stack
+  - GPU support
+  - **Vector search** - FAISS supports pure vector similarity search using embeddings
+
+  ## Search Modes
+
+  **Supported:**
+  - **Vector Search** (`mode="vector"`): Performs vector similarity search using embeddings
+
+  **Not Supported:**
+  - **Keyword Search** (`mode="keyword"`): Not supported by FAISS
+  - **Hybrid Search** (`mode="hybrid"`): Not supported by FAISS
+
+  > **Note**: FAISS is designed as a pure vector similarity search library. See the [FAISS GitHub repository](https://github.com/facebookresearch/faiss) for more details about FAISS's core functionality.
+
+  ## Usage
+
+  To use Faiss in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Faiss.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install Faiss using pip:
+
+  ```bash
+  pip install faiss-cpu
+  ```
+  ## Documentation
+  See [Faiss' documentation](https://faiss.ai/) or the [Faiss Wiki](https://github.com/facebookresearch/faiss/wiki) for
+  more details about Faiss in general.
+sidebar_label: Faiss
+title: inline::faiss
+---
+
+# inline::faiss
+
+## Description
+
+
+[Faiss](https://github.com/facebookresearch/faiss) is an inline vector database provider for Llama Stack. It
+allows you to store and query vectors directly in memory.
+That means you'll get fast and efficient vector retrieval.
+
+## Features
+
+- Lightweight and easy to use
+- Fully integrated with Llama Stack
+- GPU support
+- **Vector search** - FAISS supports pure vector similarity search using embeddings
+
+## Search Modes
+
+**Supported:**
+- **Vector Search** (`mode="vector"`): Performs vector similarity search using embeddings
+
+**Not Supported:**
+- **Keyword Search** (`mode="keyword"`): Not supported by FAISS
+- **Hybrid Search** (`mode="hybrid"`): Not supported by FAISS
+
+> **Note**: FAISS is designed as a pure vector similarity search library. See the [FAISS GitHub repository](https://github.com/facebookresearch/faiss) for more details about FAISS's core functionality.
+
+## Usage
+
+To use Faiss in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use Faiss.
+3. Start storing and querying vectors.
+
+## Installation
+
+You can install Faiss using pip:
+
+```bash
+pip install faiss-cpu
+```
+## Documentation
+See [Faiss' documentation](https://faiss.ai/) or the [Faiss Wiki](https://github.com/facebookresearch/faiss/wiki) for
+more details about Faiss in general.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/faiss_store.db
+```
--- a/docs/docs/providers/vector_io/inline_meta-reference.mdx
+++ b/docs/docs/providers/vector_io/inline_meta-reference.mdx
@ -0,0 +1,30 @@
+---
+description: "Meta's reference implementation of a vector database."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of a vector database.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/faiss_store.db
+```
+## Deprecation Notice
+
+:::warning
+Please use the `inline::faiss` provider instead.
+:::
--- a/docs/docs/providers/vector_io/inline_milvus.mdx
+++ b/docs/docs/providers/vector_io/inline_milvus.mdx
@ -0,0 +1,30 @@
+---
+description: "Please refer to the remote provider documentation."
+sidebar_label: Milvus
+title: inline::milvus
+---
+
+# inline::milvus
+
+## Description
+
+
+Please refer to the remote provider documentation.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `db_path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |
+| `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |
+
+## Sample Configuration
+
+```yaml
+db_path: ${env.MILVUS_DB_PATH:=~/.llama/dummy}/milvus.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_registry.db
+```
--- a/docs/docs/providers/vector_io/inline_qdrant.mdx
+++ b/docs/docs/providers/vector_io/inline_qdrant.mdx
@ -0,0 +1,110 @@
+---
+description: |
+  [Qdrant](https://qdrant.tech/documentation/) is an inline and remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  > By default, Qdrant stores vectors in RAM, delivering incredibly fast access for datasets that fit comfortably in
+  > memory. But when your dataset exceeds RAM capacity, Qdrant offers Memmap as an alternative.
+  >
+  > \[[An Introduction to Vector Databases](https://qdrant.tech/articles/what-is-a-vector-database/)\]
+
+
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stack
+  - Apache 2.0 license terms
+  - Store embeddings and their metadata
+  - Supports search by
+    [Keyword](https://qdrant.tech/articles/qdrant-introduces-full-text-filters-and-indexes/)
+    and [Hybrid](https://qdrant.tech/articles/hybrid-search/#building-a-hybrid-search-system-in-qdrant) search
+  - [Multilingual and Multimodal retrieval](https://qdrant.tech/documentation/multimodal-search/)
+  - [Medatata filtering](https://qdrant.tech/articles/vector-search-filtering/)
+  - [GPU support](https://qdrant.tech/documentation/guides/running-with-gpu/)
+
+  ## Usage
+
+  To use Qdrant in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Qdrant.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install Qdrant using docker:
+
+  ```bash
+  docker pull qdrant/qdrant
+  ```
+  ## Documentation
+  See the [Qdrant documentation](https://qdrant.tech/documentation/) for more details about Qdrant in general.
+sidebar_label: Qdrant
+title: inline::qdrant
+---
+
+# inline::qdrant
+
+## Description
+
+
+[Qdrant](https://qdrant.tech/documentation/) is an inline and remote vector database provider for Llama Stack. It
+allows you to store and query vectors directly in memory.
+That means you'll get fast and efficient vector retrieval.
+
+> By default, Qdrant stores vectors in RAM, delivering incredibly fast access for datasets that fit comfortably in
+> memory. But when your dataset exceeds RAM capacity, Qdrant offers Memmap as an alternative.
+>
+> \[[An Introduction to Vector Databases](https://qdrant.tech/articles/what-is-a-vector-database/)\]
+
+
+
+## Features
+
+- Lightweight and easy to use
+- Fully integrated with Llama Stack
+- Apache 2.0 license terms
+- Store embeddings and their metadata
+- Supports search by
+  [Keyword](https://qdrant.tech/articles/qdrant-introduces-full-text-filters-and-indexes/)
+  and [Hybrid](https://qdrant.tech/articles/hybrid-search/#building-a-hybrid-search-system-in-qdrant) search
+- [Multilingual and Multimodal retrieval](https://qdrant.tech/documentation/multimodal-search/)
+- [Medatata filtering](https://qdrant.tech/articles/vector-search-filtering/)
+- [GPU support](https://qdrant.tech/documentation/guides/running-with-gpu/)
+
+## Usage
+
+To use Qdrant in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use Qdrant.
+3. Start storing and querying vectors.
+
+## Installation
+
+You can install Qdrant using docker:
+
+```bash
+docker pull qdrant/qdrant
+```
+## Documentation
+See the [Qdrant documentation](https://qdrant.tech/documentation/) for more details about Qdrant in general.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+path: ${env.QDRANT_PATH:=~/.llama/~/.llama/dummy}/qdrant.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db
+```
--- a/docs/docs/providers/vector_io/inline_sqlite-vec.mdx
+++ b/docs/docs/providers/vector_io/inline_sqlite-vec.mdx
@ -0,0 +1,420 @@
+---
+description: |
+  [SQLite-Vec](https://github.com/asg017/sqlite-vec) is an inline vector database provider for Llama Stack. It
+  allows you to store and query vectors directly within an SQLite database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stacks
+  - Uses disk-based storage for persistence, allowing for larger vector storage
+
+  ### Comparison to Faiss
+
+  The choice between Faiss and sqlite-vec should be made based on the needs of your application,
+  as they have different strengths.
+
+  #### Choosing the Right Provider
+
+  Scenario | Recommended Tool | Reason
+  -- |-----------------| --
+  Online Analytical Processing (OLAP) | Faiss           | Fast, in-memory searches
+  Online Transaction Processing (OLTP) | sqlite-vec      | Frequent writes and reads
+  Frequent writes | sqlite-vec      | Efficient disk-based storage and incremental indexing
+  Large datasets | sqlite-vec      | Disk-based storage for larger vector storage
+  Datasets that can fit in memory, frequent reads | Faiss | Optimized for speed, indexing, and GPU acceleration
+
+  #### Empirical Example
+
+  Consider the histogram below in which 10,000 randomly generated strings were inserted
+  in batches of 100 into both Faiss and sqlite-vec using `client.tool_runtime.rag_tool.insert()`.
+
+  ```{image} ../../../../_static/providers/vector_io/write_time_comparison_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss write times
+  :width: 400px
+  ```
+
+  You will notice that the average write time for `sqlite-vec` was 788ms, compared to
+  47,640ms for Faiss. While the number is jarring, if you look at the distribution, you can see that it is rather
+  uniformly spread across the [1500, 100000] interval.
+
+  Looking at each individual write in the order that the documents are inserted you'll see the increase in
+  write speed as Faiss reindexes the vectors after each write.
+  ```{image} ../../../../_static/providers/vector_io/write_time_sequence_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss write times
+  :width: 400px
+  ```
+
+  In comparison, the read times for Faiss was on average 10% faster than sqlite-vec.
+  The modes of the two distributions highlight the differences much further where Faiss
+  will likely yield faster read performance.
+
+  ```{image} ../../../../_static/providers/vector_io/read_time_comparison_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss read times
+  :width: 400px
+  ```
+
+  ## Usage
+
+  To use sqlite-vec in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use SQLite-Vec.
+  3. Start storing and querying vectors.
+
+  The SQLite-vec provider supports three search modes:
+
+  1. **Vector Search** (`mode="vector"`): Performs pure vector similarity search using the embeddings.
+  2. **Keyword Search** (`mode="keyword"`): Performs full-text search using SQLite's FTS5.
+  3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword search for better results. First performs keyword search to get candidate matches, then applies vector similarity search on those candidates.
+
+  Example with hybrid search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "hybrid", "max_chunks": 3, "score_threshold": 0.7},
+  )
+
+  # Using RRF ranker
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={
+          "mode": "hybrid",
+          "max_chunks": 3,
+          "score_threshold": 0.7,
+          "ranker": {"type": "rrf", "impact_factor": 60.0},
+      },
+  )
+
+  # Using weighted ranker
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={
+          "mode": "hybrid",
+          "max_chunks": 3,
+          "score_threshold": 0.7,
+          "ranker": {"type": "weighted", "alpha": 0.7},  # 70% vector, 30% keyword
+      },
+  )
+  ```
+
+  Example with explicit vector search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "vector", "max_chunks": 3, "score_threshold": 0.7},
+  )
+  ```
+
+  Example with keyword search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "keyword", "max_chunks": 3, "score_threshold": 0.7},
+  )
+  ```
+
+  ## Supported Search Modes
+
+  The SQLite vector store supports three search modes:
+
+  1. **Vector Search** (`mode="vector"`): Uses vector similarity to find relevant chunks
+  2. **Keyword Search** (`mode="keyword"`): Uses keyword matching to find relevant chunks
+  3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword scores using a ranker
+
+  ### Hybrid Search
+
+  Hybrid search combines the strengths of both vector and keyword search by:
+  - Computing vector similarity scores
+  - Computing keyword match scores
+  - Using a ranker to combine these scores
+
+  Two ranker types are supported:
+
+  1. **RRF (Reciprocal Rank Fusion)**:
+     - Combines ranks from both vector and keyword results
+     - Uses an impact factor (default: 60.0) to control the weight of higher-ranked results
+     - Good for balancing between vector and keyword results
+     - The default impact factor of 60.0 comes from the original RRF paper by Cormack et al. (2009) [^1], which found this value to provide optimal performance across various retrieval tasks
+
+  2. **Weighted**:
+     - Linearly combines normalized vector and keyword scores
+     - Uses an alpha parameter (0-1) to control the blend:
+       - alpha=0: Only use keyword scores
+       - alpha=1: Only use vector scores
+       - alpha=0.5: Equal weight to both (default)
+
+  Example using RAGQueryConfig with different search modes:
+
+  ```python
+  from llama_stack.apis.tools import RAGQueryConfig, RRFRanker, WeightedRanker
+
+  # Vector search
+  config = RAGQueryConfig(mode="vector", max_chunks=5)
+
+  # Keyword search
+  config = RAGQueryConfig(mode="keyword", max_chunks=5)
+
+  # Hybrid search with custom RRF ranker
+  config = RAGQueryConfig(
+      mode="hybrid",
+      max_chunks=5,
+      ranker=RRFRanker(impact_factor=50.0),  # Custom impact factor
+  )
+
+  # Hybrid search with weighted ranker
+  config = RAGQueryConfig(
+      mode="hybrid",
+      max_chunks=5,
+      ranker=WeightedRanker(alpha=0.7),  # 70% vector, 30% keyword
+  )
+
+  # Hybrid search with default RRF ranker
+  config = RAGQueryConfig(
+      mode="hybrid", max_chunks=5
+  )  # Will use RRF with impact_factor=60.0
+  ```
+
+  Note: The ranker configuration is only used in hybrid mode. For vector or keyword modes, the ranker parameter is ignored.
+
+  ## Installation
+
+  You can install SQLite-Vec using pip:
+
+  ```bash
+  pip install sqlite-vec
+  ```
+
+  ## Documentation
+
+  See [sqlite-vec's GitHub repo](https://github.com/asg017/sqlite-vec/tree/main) for more details about sqlite-vec in general.
+
+  [^1]: Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). [Reciprocal rank fusion outperforms condorcet and individual rank learning methods](https://dl.acm.org/doi/10.1145/1571941.1572114). In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 758-759).
+sidebar_label: Sqlite-Vec
+title: inline::sqlite-vec
+---
+
+# inline::sqlite-vec
+
+## Description
+
+
+[SQLite-Vec](https://github.com/asg017/sqlite-vec) is an inline vector database provider for Llama Stack. It
+allows you to store and query vectors directly within an SQLite database.
+That means you're not limited to storing vectors in memory or in a separate service.
+
+## Features
+
+- Lightweight and easy to use
+- Fully integrated with Llama Stacks
+- Uses disk-based storage for persistence, allowing for larger vector storage
+
+### Comparison to Faiss
+
+The choice between Faiss and sqlite-vec should be made based on the needs of your application,
+as they have different strengths.
+
+#### Choosing the Right Provider
+
+Scenario | Recommended Tool | Reason
+-- |-----------------| --
+Online Analytical Processing (OLAP) | Faiss           | Fast, in-memory searches
+Online Transaction Processing (OLTP) | sqlite-vec      | Frequent writes and reads
+Frequent writes | sqlite-vec      | Efficient disk-based storage and incremental indexing
+Large datasets | sqlite-vec      | Disk-based storage for larger vector storage
+Datasets that can fit in memory, frequent reads | Faiss | Optimized for speed, indexing, and GPU acceleration
+
+#### Empirical Example
+
+Consider the histogram below in which 10,000 randomly generated strings were inserted
+in batches of 100 into both Faiss and sqlite-vec using `client.tool_runtime.rag_tool.insert()`.
+
+```{image} ../../../../_static/providers/vector_io/write_time_comparison_sqlite-vec-faiss.png
+:alt: Comparison of SQLite-Vec and Faiss write times
+:width: 400px
+```
+
+You will notice that the average write time for `sqlite-vec` was 788ms, compared to
+47,640ms for Faiss. While the number is jarring, if you look at the distribution, you can see that it is rather
+uniformly spread across the [1500, 100000] interval.
+
+Looking at each individual write in the order that the documents are inserted you'll see the increase in
+write speed as Faiss reindexes the vectors after each write.
+```{image} ../../../../_static/providers/vector_io/write_time_sequence_sqlite-vec-faiss.png
+:alt: Comparison of SQLite-Vec and Faiss write times
+:width: 400px
+```
+
+In comparison, the read times for Faiss was on average 10% faster than sqlite-vec.
+The modes of the two distributions highlight the differences much further where Faiss
+will likely yield faster read performance.
+
+```{image} ../../../../_static/providers/vector_io/read_time_comparison_sqlite-vec-faiss.png
+:alt: Comparison of SQLite-Vec and Faiss read times
+:width: 400px
+```
+
+## Usage
+
+To use sqlite-vec in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use SQLite-Vec.
+3. Start storing and querying vectors.
+
+The SQLite-vec provider supports three search modes:
+
+1. **Vector Search** (`mode="vector"`): Performs pure vector similarity search using the embeddings.
+2. **Keyword Search** (`mode="keyword"`): Performs full-text search using SQLite's FTS5.
+3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword search for better results. First performs keyword search to get candidate matches, then applies vector similarity search on those candidates.
+
+Example with hybrid search:
+```python
+response = await vector_io.query_chunks(
+    vector_db_id="my_db",
+    query="your query here",
+    params={"mode": "hybrid", "max_chunks": 3, "score_threshold": 0.7},
+)
+
+# Using RRF ranker
+response = await vector_io.query_chunks(
+    vector_db_id="my_db",
+    query="your query here",
+    params={
+        "mode": "hybrid",
+        "max_chunks": 3,
+        "score_threshold": 0.7,
+        "ranker": {"type": "rrf", "impact_factor": 60.0},
+    },
+)
+
+# Using weighted ranker
+response = await vector_io.query_chunks(
+    vector_db_id="my_db",
+    query="your query here",
+    params={
+        "mode": "hybrid",
+        "max_chunks": 3,
+        "score_threshold": 0.7,
+        "ranker": {"type": "weighted", "alpha": 0.7},  # 70% vector, 30% keyword
+    },
+)
+```
+
+Example with explicit vector search:
+```python
+response = await vector_io.query_chunks(
+    vector_db_id="my_db",
+    query="your query here",
+    params={"mode": "vector", "max_chunks": 3, "score_threshold": 0.7},
+)
+```
+
+Example with keyword search:
+```python
+response = await vector_io.query_chunks(
+    vector_db_id="my_db",
+    query="your query here",
+    params={"mode": "keyword", "max_chunks": 3, "score_threshold": 0.7},
+)
+```
+
+## Supported Search Modes
+
+The SQLite vector store supports three search modes:
+
+1. **Vector Search** (`mode="vector"`): Uses vector similarity to find relevant chunks
+2. **Keyword Search** (`mode="keyword"`): Uses keyword matching to find relevant chunks
+3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword scores using a ranker
+
+### Hybrid Search
+
+Hybrid search combines the strengths of both vector and keyword search by:
+- Computing vector similarity scores
+- Computing keyword match scores
+- Using a ranker to combine these scores
+
+Two ranker types are supported:
+
+1. **RRF (Reciprocal Rank Fusion)**:
+   - Combines ranks from both vector and keyword results
+   - Uses an impact factor (default: 60.0) to control the weight of higher-ranked results
+   - Good for balancing between vector and keyword results
+   - The default impact factor of 60.0 comes from the original RRF paper by Cormack et al. (2009) [^1], which found this value to provide optimal performance across various retrieval tasks
+
+2. **Weighted**:
+   - Linearly combines normalized vector and keyword scores
+   - Uses an alpha parameter (0-1) to control the blend:
+     - alpha=0: Only use keyword scores
+     - alpha=1: Only use vector scores
+     - alpha=0.5: Equal weight to both (default)
+
+Example using RAGQueryConfig with different search modes:
+
+```python
+from llama_stack.apis.tools import RAGQueryConfig, RRFRanker, WeightedRanker
+
+# Vector search
+config = RAGQueryConfig(mode="vector", max_chunks=5)
+
+# Keyword search
+config = RAGQueryConfig(mode="keyword", max_chunks=5)
+
+# Hybrid search with custom RRF ranker
+config = RAGQueryConfig(
+    mode="hybrid",
+    max_chunks=5,
+    ranker=RRFRanker(impact_factor=50.0),  # Custom impact factor
+)
+
+# Hybrid search with weighted ranker
+config = RAGQueryConfig(
+    mode="hybrid",
+    max_chunks=5,
+    ranker=WeightedRanker(alpha=0.7),  # 70% vector, 30% keyword
+)
+
+# Hybrid search with default RRF ranker
+config = RAGQueryConfig(
+    mode="hybrid", max_chunks=5
+)  # Will use RRF with impact_factor=60.0
+```
+
+Note: The ranker configuration is only used in hybrid mode. For vector or keyword modes, the ranker parameter is ignored.
+
+## Installation
+
+You can install SQLite-Vec using pip:
+
+```bash
+pip install sqlite-vec
+```
+
+## Documentation
+
+See [sqlite-vec's GitHub repo](https://github.com/asg017/sqlite-vec/tree/main) for more details about sqlite-vec in general.
+
+[^1]: Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). [Reciprocal rank fusion outperforms condorcet and individual rank learning methods](https://dl.acm.org/doi/10.1145/1571941.1572114). In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 758-759).
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |
+
+## Sample Configuration
+
+```yaml
+db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec_registry.db
+```
--- a/docs/docs/providers/vector_io/inline_sqlite_vec.mdx
+++ b/docs/docs/providers/vector_io/inline_sqlite_vec.mdx
@ -0,0 +1,34 @@
+---
+description: "Please refer to the sqlite-vec provider documentation."
+sidebar_label: Sqlite Vec
+title: inline::sqlite_vec
+---
+
+# inline::sqlite_vec
+
+## Description
+
+
+Please refer to the sqlite-vec provider documentation.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |
+
+## Sample Configuration
+
+```yaml
+db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec_registry.db
+```
+## Deprecation Notice
+
+:::warning
+Please use the `inline::sqlite-vec` provider (notice the hyphen instead of underscore) instead.
+:::
--- a/docs/docs/providers/vector_io/remote_chromadb.mdx
+++ b/docs/docs/providers/vector_io/remote_chromadb.mdx
@ -0,0 +1,90 @@
+---
+description: |
+  [Chroma](https://www.trychroma.com/) is an inline and remote vector
+  database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Chroma supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+  ## Usage
+
+  To use Chrome in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install chroma using pip:
+
+  ```bash
+  pip install chromadb
+  ```
+
+  ## Documentation
+  See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+sidebar_label: Remote - Chromadb
+title: remote::chromadb
+---
+
+# remote::chromadb
+
+## Description
+
+
+[Chroma](https://www.trychroma.com/) is an inline and remote vector
+database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+That means you're not limited to storing vectors in memory or in a separate service.
+
+## Features
+Chroma supports:
+- Store embeddings and their metadata
+- Vector search
+- Full-text search
+- Document storage
+- Metadata filtering
+- Multi-modal retrieval
+
+## Usage
+
+To use Chrome in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use chroma.
+3. Start storing and querying vectors.
+
+## Installation
+
+You can install chroma using pip:
+
+```bash
+pip install chromadb
+```
+
+## Documentation
+See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `url` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
+
+## Sample Configuration
+
+```yaml
+url: ${env.CHROMADB_URL}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_remote_registry.db
+```
--- a/docs/docs/providers/vector_io/remote_milvus.mdx
+++ b/docs/docs/providers/vector_io/remote_milvus.mdx
@ -0,0 +1,426 @@
+---
+description: |
+  [Milvus](https://milvus.io/) is an inline and remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly within a Milvus database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+
+  - Easy to use
+  - Fully integrated with Llama Stack
+  - Supports all search modes: vector, keyword, and hybrid search (both inline and remote configurations)
+
+  ## Usage
+
+  To use Milvus in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Milvus.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  If you want to use inline Milvus, you can install:
+
+  ```bash
+  pip install pymilvus[milvus-lite]
+  ```
+
+  If you want to use remote Milvus, you can install:
+
+  ```bash
+  pip install pymilvus
+  ```
+
+  ## Configuration
+
+  In Llama Stack, Milvus can be configured in two ways:
+  - **Inline (Local) Configuration** - Uses Milvus-Lite for local storage
+  - **Remote Configuration** - Connects to a remote Milvus server
+
+  ### Inline (Local) Configuration
+
+  The simplest method is local configuration, which requires setting `db_path`, a path for locally storing Milvus-Lite files:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: inline::milvus
+      config:
+        db_path: ~/.llama/distributions/together/milvus_store.db
+  ```
+
+  ### Remote Configuration
+
+  Remote configuration is suitable for larger data storage requirements:
+
+  #### Standard Remote Connection
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "http://<host>:<port>"
+        token: "<user>:<password>"
+  ```
+
+  #### TLS-Enabled Remote Connection (One-way TLS)
+
+  For connections to Milvus instances with one-way TLS enabled:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "https://<host>:<port>"
+        token: "<user>:<password>"
+        secure: True
+        server_pem_path: "/path/to/server.pem"
+  ```
+
+  #### Mutual TLS (mTLS) Remote Connection
+
+  For connections to Milvus instances with mutual TLS (mTLS) enabled:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "https://<host>:<port>"
+        token: "<user>:<password>"
+        secure: True
+        ca_pem_path: "/path/to/ca.pem"
+        client_pem_path: "/path/to/client.pem"
+        client_key_path: "/path/to/client.key"
+  ```
+
+  #### Key Parameters for TLS Configuration
+
+  - **`secure`**: Enables TLS encryption when set to `true`. Defaults to `false`.
+  - **`server_pem_path`**: Path to the **server certificate** for verifying the server's identity (used in one-way TLS).
+  - **`ca_pem_path`**: Path to the **Certificate Authority (CA) certificate** for validating the server certificate (required in mTLS).
+  - **`client_pem_path`**: Path to the **client certificate** file (required for mTLS).
+  - **`client_key_path`**: Path to the **client private key** file (required for mTLS).
+
+  ## Search Modes
+
+  Milvus supports three different search modes for both inline and remote configurations:
+
+  ### Vector Search
+  Vector search uses semantic similarity to find the most relevant chunks based on embedding vectors. This is the default search mode and works well for finding conceptually similar content.
+
+  ```python
+  # Vector search example
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="What is machine learning?",
+      search_mode="vector",
+      max_num_results=5,
+  )
+  ```
+
+  ### Keyword Search
+  Keyword search uses traditional text-based matching to find chunks containing specific terms or phrases. This is useful when you need exact term matches.
+
+  ```python
+  # Keyword search example
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="Python programming language",
+      search_mode="keyword",
+      max_num_results=5,
+  )
+  ```
+
+  ### Hybrid Search
+  Hybrid search combines both vector and keyword search methods to provide more comprehensive results. It leverages the strengths of both semantic similarity and exact term matching.
+
+  #### Basic Hybrid Search
+  ```python
+  # Basic hybrid search example (uses RRF ranker with default impact_factor=60.0)
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+  )
+  ```
+
+  **Note**: The default `impact_factor` value of 60.0 was empirically determined to be optimal in the original RRF research paper: ["Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (Cormack et al., 2009).
+
+  #### Hybrid Search with RRF (Reciprocal Rank Fusion) Ranker
+  RRF combines rankings from vector and keyword search by using reciprocal ranks. The impact factor controls how much weight is given to higher-ranked results.
+
+  ```python
+  # Hybrid search with custom RRF parameters
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+      ranking_options={
+          "ranker": {
+              "type": "rrf",
+              "impact_factor": 100.0,  # Higher values give more weight to top-ranked results
+          }
+      },
+  )
+  ```
+
+  #### Hybrid Search with Weighted Ranker
+  Weighted ranker linearly combines normalized scores from vector and keyword search. The alpha parameter controls the balance between the two search methods.
+
+  ```python
+  # Hybrid search with weighted ranker
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+      ranking_options={
+          "ranker": {
+              "type": "weighted",
+              "alpha": 0.7,  # 70% vector search, 30% keyword search
+          }
+      },
+  )
+  ```
+
+  For detailed documentation on RRF and Weighted rankers, please refer to the [Milvus Reranking Guide](https://milvus.io/docs/reranking.md).
+
+  ## Documentation
+  See the [Milvus documentation](https://milvus.io/docs/install-overview.md) for more details about Milvus in general.
+
+  For more details on TLS configuration, refer to the [TLS setup guide](https://milvus.io/docs/tls.md).
+sidebar_label: Remote - Milvus
+title: remote::milvus
+---
+
+# remote::milvus
+
+## Description
+
+
+[Milvus](https://milvus.io/) is an inline and remote vector database provider for Llama Stack. It
+allows you to store and query vectors directly within a Milvus database.
+That means you're not limited to storing vectors in memory or in a separate service.
+
+## Features
+
+- Easy to use
+- Fully integrated with Llama Stack
+- Supports all search modes: vector, keyword, and hybrid search (both inline and remote configurations)
+
+## Usage
+
+To use Milvus in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use Milvus.
+3. Start storing and querying vectors.
+
+## Installation
+
+If you want to use inline Milvus, you can install:
+
+```bash
+pip install pymilvus[milvus-lite]
+```
+
+If you want to use remote Milvus, you can install:
+
+```bash
+pip install pymilvus
+```
+
+## Configuration
+
+In Llama Stack, Milvus can be configured in two ways:
+- **Inline (Local) Configuration** - Uses Milvus-Lite for local storage
+- **Remote Configuration** - Connects to a remote Milvus server
+
+### Inline (Local) Configuration
+
+The simplest method is local configuration, which requires setting `db_path`, a path for locally storing Milvus-Lite files:
+
+```yaml
+vector_io:
+  - provider_id: milvus
+    provider_type: inline::milvus
+    config:
+      db_path: ~/.llama/distributions/together/milvus_store.db
+```
+
+### Remote Configuration
+
+Remote configuration is suitable for larger data storage requirements:
+
+#### Standard Remote Connection
+
+```yaml
+vector_io:
+  - provider_id: milvus
+    provider_type: remote::milvus
+    config:
+      uri: "http://<host>:<port>"
+      token: "<user>:<password>"
+```
+
+#### TLS-Enabled Remote Connection (One-way TLS)
+
+For connections to Milvus instances with one-way TLS enabled:
+
+```yaml
+vector_io:
+  - provider_id: milvus
+    provider_type: remote::milvus
+    config:
+      uri: "https://<host>:<port>"
+      token: "<user>:<password>"
+      secure: True
+      server_pem_path: "/path/to/server.pem"
+```
+
+#### Mutual TLS (mTLS) Remote Connection
+
+For connections to Milvus instances with mutual TLS (mTLS) enabled:
+
+```yaml
+vector_io:
+  - provider_id: milvus
+    provider_type: remote::milvus
+    config:
+      uri: "https://<host>:<port>"
+      token: "<user>:<password>"
+      secure: True
+      ca_pem_path: "/path/to/ca.pem"
+      client_pem_path: "/path/to/client.pem"
+      client_key_path: "/path/to/client.key"
+```
+
+#### Key Parameters for TLS Configuration
+
+- **`secure`**: Enables TLS encryption when set to `true`. Defaults to `false`.
+- **`server_pem_path`**: Path to the **server certificate** for verifying the server's identity (used in one-way TLS).
+- **`ca_pem_path`**: Path to the **Certificate Authority (CA) certificate** for validating the server certificate (required in mTLS).
+- **`client_pem_path`**: Path to the **client certificate** file (required for mTLS).
+- **`client_key_path`**: Path to the **client private key** file (required for mTLS).
+
+## Search Modes
+
+Milvus supports three different search modes for both inline and remote configurations:
+
+### Vector Search
+Vector search uses semantic similarity to find the most relevant chunks based on embedding vectors. This is the default search mode and works well for finding conceptually similar content.
+
+```python
+# Vector search example
+search_response = client.vector_stores.search(
+    vector_store_id=vector_store.id,
+    query="What is machine learning?",
+    search_mode="vector",
+    max_num_results=5,
+)
+```
+
+### Keyword Search
+Keyword search uses traditional text-based matching to find chunks containing specific terms or phrases. This is useful when you need exact term matches.
+
+```python
+# Keyword search example
+search_response = client.vector_stores.search(
+    vector_store_id=vector_store.id,
+    query="Python programming language",
+    search_mode="keyword",
+    max_num_results=5,
+)
+```
+
+### Hybrid Search
+Hybrid search combines both vector and keyword search methods to provide more comprehensive results. It leverages the strengths of both semantic similarity and exact term matching.
+
+#### Basic Hybrid Search
+```python
+# Basic hybrid search example (uses RRF ranker with default impact_factor=60.0)
+search_response = client.vector_stores.search(
+    vector_store_id=vector_store.id,
+    query="neural networks in Python",
+    search_mode="hybrid",
+    max_num_results=5,
+)
+```
+
+**Note**: The default `impact_factor` value of 60.0 was empirically determined to be optimal in the original RRF research paper: ["Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (Cormack et al., 2009).
+
+#### Hybrid Search with RRF (Reciprocal Rank Fusion) Ranker
+RRF combines rankings from vector and keyword search by using reciprocal ranks. The impact factor controls how much weight is given to higher-ranked results.
+
+```python
+# Hybrid search with custom RRF parameters
+search_response = client.vector_stores.search(
+    vector_store_id=vector_store.id,
+    query="neural networks in Python",
+    search_mode="hybrid",
+    max_num_results=5,
+    ranking_options={
+        "ranker": {
+            "type": "rrf",
+            "impact_factor": 100.0,  # Higher values give more weight to top-ranked results
+        }
+    },
+)
+```
+
+#### Hybrid Search with Weighted Ranker
+Weighted ranker linearly combines normalized scores from vector and keyword search. The alpha parameter controls the balance between the two search methods.
+
+```python
+# Hybrid search with weighted ranker
+search_response = client.vector_stores.search(
+    vector_store_id=vector_store.id,
+    query="neural networks in Python",
+    search_mode="hybrid",
+    max_num_results=5,
+    ranking_options={
+        "ranker": {
+            "type": "weighted",
+            "alpha": 0.7,  # 70% vector search, 30% keyword search
+        }
+    },
+)
+```
+
+For detailed documentation on RRF and Weighted rankers, please refer to the [Milvus Reranking Guide](https://milvus.io/docs/reranking.md).
+
+## Documentation
+See the [Milvus documentation](https://milvus.io/docs/install-overview.md) for more details about Milvus in general.
+
+For more details on TLS configuration, refer to the [TLS setup guide](https://milvus.io/docs/tls.md).
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `uri` | `<class 'str'>` | No |  | The URI of the Milvus server |
+| `token` | `str \| None` | No |  | The token of the Milvus server |
+| `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
+| `config` | `dict` | No | &#123;&#125; | This configuration allows additional fields to be passed through to the underlying Milvus client. See the [Milvus](https://milvus.io/docs/install-overview.md) documentation for more details about Milvus in general. |
+
+:::note
+This configuration class accepts additional fields beyond those listed above. You can pass any additional configuration options that will be forwarded to the underlying provider.
+:::
+
+## Sample Configuration
+
+```yaml
+uri: ${env.MILVUS_ENDPOINT}
+token: ${env.MILVUS_TOKEN}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_remote_registry.db
+```
--- a/docs/docs/providers/vector_io/remote_pgvector.mdx
+++ b/docs/docs/providers/vector_io/remote_pgvector.mdx
@ -0,0 +1,234 @@
+---
+description: |
+  [PGVector](https://github.com/pgvector/pgvector) is a remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  ## Features
+
+  - Easy to use
+  - Fully integrated with Llama Stack
+
+  There are three implementations of search for PGVectoIndex available:
+
+  1. Vector Search:
+  - How it works:
+    - Uses PostgreSQL's vector extension (pgvector) to perform similarity search
+    - Compares query embeddings against stored embeddings using Cosine distance or other distance metrics
+    - Eg. SQL query: SELECT document, embedding &lt;=&gt; %s::vector AS distance FROM table ORDER BY distance
+
+  -Characteristics:
+    - Semantic understanding - finds documents similar in meaning even if they don't share keywords
+    - Works with high-dimensional vector embeddings (typically 768, 1024, or higher dimensions)
+    - Best for: Finding conceptually related content, handling synonyms, cross-language search
+
+  2. Keyword Search
+  - How it works:
+    - Uses PostgreSQL's full-text search capabilities with tsvector and ts_rank
+    - Converts text to searchable tokens using to_tsvector('english', text). Default language is English.
+    - Eg. SQL query: SELECT document, ts_rank(tokenized_content, plainto_tsquery('english', %s)) AS score
+
+  - Characteristics:
+    - Lexical matching - finds exact keyword matches and variations
+    - Uses GIN (Generalized Inverted Index) for fast text search performance
+    - Scoring: Uses PostgreSQL's ts_rank function for relevance scoring
+    - Best for: Exact term matching, proper names, technical terms, Boolean-style queries
+
+  3. Hybrid Search
+  - How it works:
+    - Combines both vector and keyword search results
+    - Runs both searches independently, then merges results using configurable reranking
+
+  - Two reranking strategies available:
+      - Reciprocal Rank Fusion (RRF) - (default: 60.0)
+      - Weighted Average - (default: 0.5)
+
+  - Characteristics:
+    - Best of both worlds: semantic understanding + exact matching
+    - Documents appearing in both searches get boosted scores
+    - Configurable balance between semantic and lexical matching
+    - Best for: General-purpose search where you want both precision and recall
+
+  4. Database Schema
+  The PGVector implementation stores data optimized for all three search types:
+  CREATE TABLE vector_store_xxx (
+      id TEXT PRIMARY KEY,
+      document JSONB,                    -- Original document
+      embedding vector(dimension),        -- For vector search
+      content_text TEXT,                 -- Raw text content
+      tokenized_content TSVECTOR          -- For keyword search
+  );
+
+  -- Indexes for performance
+  CREATE INDEX content_gin_idx ON table USING GIN(tokenized_content);  -- Keyword search
+  -- Vector index created automatically by pgvector
+
+  ## Usage
+
+  To use PGVector in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
+  3. Start storing and querying vectors.
+
+  ## This is an example how you can set up your environment for using PGVector
+
+  1. Export env vars:
+  ```bash
+  export ENABLE_PGVECTOR=true
+  export PGVECTOR_HOST=localhost
+  export PGVECTOR_PORT=5432
+  export PGVECTOR_DB=llamastack
+  export PGVECTOR_USER=llamastack
+  export PGVECTOR_PASSWORD=llamastack
+  ```
+
+  2. Create DB:
+  ```bash
+  psql -h localhost -U postgres -c "CREATE ROLE llamastack LOGIN PASSWORD 'llamastack';"
+  psql -h localhost -U postgres -c "CREATE DATABASE llamastack OWNER llamastack;"
+  psql -h localhost -U llamastack -d llamastack -c "CREATE EXTENSION IF NOT EXISTS vector;"
+  ```
+
+  ## Installation
+
+  You can install PGVector using docker:
+
+  ```bash
+  docker pull pgvector/pgvector:pg17
+  ```
+  ## Documentation
+  See [PGVector's documentation](https://github.com/pgvector/pgvector) for more details about PGVector in general.
+sidebar_label: Remote - Pgvector
+title: remote::pgvector
+---
+
+# remote::pgvector
+
+## Description
+
+
+[PGVector](https://github.com/pgvector/pgvector) is a remote vector database provider for Llama Stack. It
+allows you to store and query vectors directly in memory.
+That means you'll get fast and efficient vector retrieval.
+
+## Features
+
+- Easy to use
+- Fully integrated with Llama Stack
+
+There are three implementations of search for PGVectoIndex available:
+
+1. Vector Search:
+- How it works:
+  - Uses PostgreSQL's vector extension (pgvector) to perform similarity search
+  - Compares query embeddings against stored embeddings using Cosine distance or other distance metrics
+  - Eg. SQL query: SELECT document, embedding &lt;=&gt; %s::vector AS distance FROM table ORDER BY distance
+
+-Characteristics:
+  - Semantic understanding - finds documents similar in meaning even if they don't share keywords
+  - Works with high-dimensional vector embeddings (typically 768, 1024, or higher dimensions)
+  - Best for: Finding conceptually related content, handling synonyms, cross-language search
+
+2. Keyword Search
+- How it works:
+  - Uses PostgreSQL's full-text search capabilities with tsvector and ts_rank
+  - Converts text to searchable tokens using to_tsvector('english', text). Default language is English.
+  - Eg. SQL query: SELECT document, ts_rank(tokenized_content, plainto_tsquery('english', %s)) AS score
+
+- Characteristics:
+  - Lexical matching - finds exact keyword matches and variations
+  - Uses GIN (Generalized Inverted Index) for fast text search performance
+  - Scoring: Uses PostgreSQL's ts_rank function for relevance scoring
+  - Best for: Exact term matching, proper names, technical terms, Boolean-style queries
+
+3. Hybrid Search
+- How it works:
+  - Combines both vector and keyword search results
+  - Runs both searches independently, then merges results using configurable reranking
+
+- Two reranking strategies available:
+    - Reciprocal Rank Fusion (RRF) - (default: 60.0)
+    - Weighted Average - (default: 0.5)
+
+- Characteristics:
+  - Best of both worlds: semantic understanding + exact matching
+  - Documents appearing in both searches get boosted scores
+  - Configurable balance between semantic and lexical matching
+  - Best for: General-purpose search where you want both precision and recall
+
+4. Database Schema
+The PGVector implementation stores data optimized for all three search types:
+CREATE TABLE vector_store_xxx (
+    id TEXT PRIMARY KEY,
+    document JSONB,                    -- Original document
+    embedding vector(dimension),        -- For vector search
+    content_text TEXT,                 -- Raw text content
+    tokenized_content TSVECTOR          -- For keyword search
+);
+
+-- Indexes for performance
+CREATE INDEX content_gin_idx ON table USING GIN(tokenized_content);  -- Keyword search
+-- Vector index created automatically by pgvector
+
+## Usage
+
+To use PGVector in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
+3. Start storing and querying vectors.
+
+## This is an example how you can set up your environment for using PGVector
+
+1. Export env vars:
+```bash
+export ENABLE_PGVECTOR=true
+export PGVECTOR_HOST=localhost
+export PGVECTOR_PORT=5432
+export PGVECTOR_DB=llamastack
+export PGVECTOR_USER=llamastack
+export PGVECTOR_PASSWORD=llamastack
+```
+
+2. Create DB:
+```bash
+psql -h localhost -U postgres -c "CREATE ROLE llamastack LOGIN PASSWORD 'llamastack';"
+psql -h localhost -U postgres -c "CREATE DATABASE llamastack OWNER llamastack;"
+psql -h localhost -U llamastack -d llamastack -c "CREATE EXTENSION IF NOT EXISTS vector;"
+```
+
+## Installation
+
+You can install PGVector using docker:
+
+```bash
+docker pull pgvector/pgvector:pg17
+```
+## Documentation
+See [PGVector's documentation](https://github.com/pgvector/pgvector) for more details about PGVector in general.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `host` | `str \| None` | No | localhost |  |
+| `port` | `int \| None` | No | 5432 |  |
+| `db` | `str \| None` | No | postgres |  |
+| `user` | `str \| None` | No | postgres |  |
+| `password` | `str \| None` | No | mysecretpassword |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+
+## Sample Configuration
+
+```yaml
+host: ${env.PGVECTOR_HOST:=localhost}
+port: ${env.PGVECTOR_PORT:=5432}
+db: ${env.PGVECTOR_DB}
+user: ${env.PGVECTOR_USER}
+password: ${env.PGVECTOR_PASSWORD}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/pgvector_registry.db
+```
--- a/docs/docs/providers/vector_io/remote_qdrant.mdx
+++ b/docs/docs/providers/vector_io/remote_qdrant.mdx
@ -0,0 +1,38 @@
+---
+description: "Please refer to the inline provider documentation."
+sidebar_label: Remote - Qdrant
+title: remote::qdrant
+---
+
+# remote::qdrant
+
+## Description
+
+
+Please refer to the inline provider documentation.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `location` | `str \| None` | No |  |  |
+| `url` | `str \| None` | No |  |  |
+| `port` | `int \| None` | No | 6333 |  |
+| `grpc_port` | `<class 'int'>` | No | 6334 |  |
+| `prefer_grpc` | `<class 'bool'>` | No | False |  |
+| `https` | `bool \| None` | No |  |  |
+| `api_key` | `str \| None` | No |  |  |
+| `prefix` | `str \| None` | No |  |  |
+| `timeout` | `int \| None` | No |  |  |
+| `host` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.QDRANT_API_KEY:=}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db
+```
--- a/docs/docs/providers/vector_io/remote_weaviate.mdx
+++ b/docs/docs/providers/vector_io/remote_weaviate.mdx
@ -0,0 +1,88 @@
+---
+description: |
+  [Weaviate](https://weaviate.io/) is a vector database provider for Llama Stack.
+  It allows you to store and query vectors directly within a Weaviate database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Weaviate supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Hybrid search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+
+  ## Usage
+
+  To use Weaviate in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  To install Weaviate see the [Weaviate quickstart documentation](https://weaviate.io/developers/weaviate/quickstart).
+
+  ## Documentation
+  See [Weaviate's documentation](https://weaviate.io/developers/weaviate) for more details about Weaviate in general.
+sidebar_label: Remote - Weaviate
+title: remote::weaviate
+---
+
+# remote::weaviate
+
+## Description
+
+
+[Weaviate](https://weaviate.io/) is a vector database provider for Llama Stack.
+It allows you to store and query vectors directly within a Weaviate database.
+That means you're not limited to storing vectors in memory or in a separate service.
+
+## Features
+Weaviate supports:
+- Store embeddings and their metadata
+- Vector search
+- Full-text search
+- Hybrid search
+- Document storage
+- Metadata filtering
+- Multi-modal retrieval
+
+
+## Usage
+
+To use Weaviate in your Llama Stack project, follow these steps:
+
+1. Install the necessary dependencies.
+2. Configure your Llama Stack project to use chroma.
+3. Start storing and querying vectors.
+
+## Installation
+
+To install Weaviate see the [Weaviate quickstart documentation](https://weaviate.io/developers/weaviate/quickstart).
+
+## Documentation
+See [Weaviate's documentation](https://weaviate.io/developers/weaviate) for more details about Weaviate in general.
+
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `weaviate_api_key` | `str \| None` | No |  | The API key for the Weaviate instance |
+| `weaviate_cluster_url` | `str \| None` | No | localhost:8080 | The URL of the Weaviate cluster |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+
+## Sample Configuration
+
+```yaml
+weaviate_api_key: null
+weaviate_cluster_url: ${env.WEAVIATE_CLUSTER_URL:=localhost:8080}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/weaviate_registry.db
+```