docs: provider and distro codegen migration

2025-10-04 04:04:14 +00:00 · 2025-09-23 12:35:42 -07:00 · 2025-09-23 12:35:42 -07:00 · 785e578b6c
commit 785e578b6c
parent 45da31801c
103 changed files with 1799 additions and 426 deletions
--- a/docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md
@ -0,0 +1,125 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# Meta Reference GPU Distribution
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+
+self
+```
+
+The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `remote::huggingface`, `inline::localfs` |
+| eval | `inline::meta-reference` |
+| inference | `inline::meta-reference` |
+| safety | `inline::llama-guard` |
+| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
+| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
+
+
+Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
+
+### Environment Variables
+
+The following environment variables can be configured:
+
+- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
+- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
+- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
+- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
+- `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
+
+
+## Prerequisite: Downloading Models
+
+Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](../../references/llama_cli_reference/download_models.md) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
+
+```
+$ llama model list --downloaded
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
+┃ Model                                   ┃ Size     ┃ Modified Time       ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
+│ Llama3.2-1B-Instruct:int4-qlora-eo8     │ 1.53 GB  │ 2025-02-26 11:22:28 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-1B                             │ 2.31 GB  │ 2025-02-18 21:48:52 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Prompt-Guard-86M                        │ 0.02 GB  │ 2025-02-26 11:29:28 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB  │ 2025-02-26 11:37:41 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-3B                             │ 5.99 GB  │ 2025-02-18 21:51:26 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.1-8B                             │ 14.97 GB │ 2025-02-16 10:36:37 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB  │ 2025-02-26 11:35:02 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama-Guard-3-1B                        │ 2.80 GB  │ 2025-02-26 11:20:46 │
+├─────────────────────────────────────────┼──────────┼─────────────────────┤
+│ Llama-Guard-3-1B:int4                   │ 0.43 GB  │ 2025-02-26 11:33:33 │
+└─────────────────────────────────────────┴──────────┴─────────────────────┘
+```
+
+## Running the Distribution
+
+You can do this via venv or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+LLAMA_STACK_PORT=8321
+docker run \
+  -it \
+  --pull always \
+  --gpu all \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-meta-reference-gpu \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+docker run \
+  -it \
+  --pull always \
+  --gpu all \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-meta-reference-gpu \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
+  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
+```
+
+### Via venv
+
+Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+
+```bash
+llama stack build --distro meta-reference-gpu --image-type venv
+llama stack run distributions/meta-reference-gpu/run.yaml \
+  --port 8321 \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
+  --port 8321 \
+  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
+  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
+```
--- a/docs/docs/distributions/self_hosted_distro/nvidia.md
+++ b/docs/docs/distributions/self_hosted_distro/nvidia.md
@ -0,0 +1,171 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# NVIDIA Distribution
+
+The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `inline::localfs`, `remote::nvidia` |
+| eval | `remote::nvidia` |
+| files | `inline::localfs` |
+| inference | `remote::nvidia` |
+| post_training | `remote::nvidia` |
+| safety | `remote::nvidia` |
+| scoring | `inline::basic` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `inline::rag-runtime` |
+| vector_io | `inline::faiss` |
+
+
+### Environment Variables
+
+The following environment variables can be configured:
+
+- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
+- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
+- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
+- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
+- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
+- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
+- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
+- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
+- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
+- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
+- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
+
+### Models
+
+The following models are available by default:
+
+- `meta/llama3-8b-instruct `
+- `meta/llama3-70b-instruct `
+- `meta/llama-3.1-8b-instruct `
+- `meta/llama-3.1-70b-instruct `
+- `meta/llama-3.1-405b-instruct `
+- `meta/llama-3.2-1b-instruct `
+- `meta/llama-3.2-3b-instruct `
+- `meta/llama-3.2-11b-vision-instruct `
+- `meta/llama-3.2-90b-vision-instruct `
+- `meta/llama-3.3-70b-instruct `
+- `nvidia/vila `
+- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
+- `nvidia/nv-embedqa-e5-v5 `
+- `nvidia/nv-embedqa-mistral-7b-v2 `
+- `snowflake/arctic-embed-l `
+
+
+## Prerequisites
+### NVIDIA API Keys
+
+Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
+
+### Deploy NeMo Microservices Platform
+The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
+
+## Supported Services
+Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
+
+### Inference: NVIDIA NIM
+NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
+  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
+  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
+
+The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
+
+### Datasetio API: NeMo Data Store
+The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
+
+See the [NVIDIA Datasetio docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.
+
+### Eval API: NeMo Evaluator
+The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Eval docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.
+
+### Post-Training API: NeMo Customizer
+The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Post-Training docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.
+
+### Safety API: NeMo Guardrails
+The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
+
+See the [NVIDIA Safety docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/safety/nvidia/README.md) for supported features and example usage.
+
+## Deploying models
+In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
+
+Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
+```sh
+# URL to NeMo NIM Proxy service
+export NEMO_URL="http://nemo.test"
+
+curl --location "$NEMO_URL/v1/deployment/model-deployments" \
+   -H 'accept: application/json' \
+   -H 'Content-Type: application/json' \
+   -d '{
+      "name": "llama-3.2-1b-instruct",
+      "namespace": "meta",
+      "config": {
+         "model": "meta/llama-3.2-1b-instruct",
+         "nim_deployment": {
+            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
+            "image_tag": "1.8.3",
+            "pvc_size": "25Gi",
+            "gpu": 1,
+            "additional_envs": {
+               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
+            }
+         }
+      }
+   }'
+```
+This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
+
+You can also remove a deployed NIM to free up GPU resources, if needed.
+```sh
+export NEMO_URL="http://nemo.test"
+
+curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
+```
+
+## Running Llama Stack with NVIDIA
+
+You can do this via venv (build code), or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+LLAMA_STACK_PORT=8321
+docker run \
+  -it \
+  --pull always \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ./run.yaml:/root/my-run.yaml \
+  llamastack/distribution-nvidia \
+  --config /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
+```
+
+### Via venv
+
+If you've set up your local development environment, you can also build the image using your local virtual environment.
+
+```bash
+INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+llama stack build --distro nvidia --image-type venv
+llama stack run ./run.yaml \
+  --port 8321 \
+  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL
+```
+
+## Example Notebooks
+For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in [docs/notebooks/nvidia](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks/nvidia).
--- a/docs/source/providers/agents/index.md
+++ b/docs/source/providers/agents/index.md
@ -1,3 +1,16 @@
+---
+description: "Agents API for creating and interacting with agentic systems.
+
+    Main functionalities provided by this API:
+    - Create agents with specific instructions and ability to use tools.
+    - Interactions with agents are grouped into sessions (\"threads\"), and each interaction is called a \"turn\".
+    - Agents can be provided with various tools (see the ToolGroups and ToolRuntime APIs for more details).
+    - Agents can be provided with various shields (see the Safety API for more details).
+    - Agents can also use Memory to retrieve information from knowledge bases. See the RAG Tool and Vector IO APIs for more details."
+sidebar_label: Agents
+title: Agents
+---
+
 # Agents

 ## Overview
@ -15,8 +28,4 @@ This section contains documentation for all available providers for the **agents

 ## Providers

-```{toctree}
-:maxdepth: 1
-
-inline_meta-reference
-```
+- [Meta-Reference](./inline_meta-reference)
--- a/docs/source/providers/agents/inline_meta-reference.md
+++ b/docs/source/providers/agents/inline_meta-reference.md
@ -1,3 +1,9 @@
+---
+description: "Meta's reference implementation of an agent system that can use tools, access vector databases, and perform complex reasoning tasks."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
 # inline::meta-reference

 ## Description
@ -20,6 +26,4 @@ persistence_store:
 responses_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/responses_store.db
-
 ```
-
--- a/docs/source/providers/batches/index.md
+++ b/docs/source/providers/batches/index.md
@ -1,3 +1,18 @@
+---
+description: "The Batches API enables efficient processing of multiple requests in a single operation,
+    particularly useful for processing large datasets, batch evaluation workflows, and
+    cost-effective inference at scale.
+
+    The API is designed to allow use of openai client libraries for seamless integration.
+
+    This API provides the following extensions:
+     - idempotent batch creation
+
+    Note: This API is currently under active development and may undergo changes."
+sidebar_label: Batches
+title: Batches
+---
+
 # Batches

 ## Overview
@ -17,8 +32,4 @@ This section contains documentation for all available providers for the **batche

 ## Providers

-```{toctree}
-:maxdepth: 1
-
-inline_reference
-```
+- [Reference](./inline_reference)
--- a/docs/source/providers/batches/inline_reference.md
+++ b/docs/source/providers/batches/inline_reference.md
@ -1,3 +1,9 @@
+---
+description: "Reference implementation of batches API with KVStore persistence."
+sidebar_label: Reference
+title: inline::reference
+---
+
 # inline::reference

 ## Description
@ -18,6 +24,4 @@ Reference implementation of batches API with KVStore persistence.
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/batches.db
-
 ```
-
--- a/docs/docs/providers/datasetio/index.md
+++ b/docs/docs/providers/datasetio/index.md
@ -0,0 +1,16 @@
+---
+sidebar_label: Datasetio
+title: Datasetio
+---
+
+# Datasetio
+
+## Overview
+
+This section contains documentation for all available providers for the **datasetio** API.
+
+## Providers
+
+- [Localfs](./inline_localfs)
+- [Remote - Huggingface](./remote_huggingface)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/datasetio/index.mdx
+++ b/docs/docs/providers/datasetio/index.mdx
@ -0,0 +1,16 @@
+---
+sidebar_label: Datasetio
+title: Datasetio
+---
+
+# Datasetio
+
+## Overview
+
+This section contains documentation for all available providers for the **datasetio** API.
+
+## Providers
+
+- [Localfs](./inline_localfs)
+- [Remote - Huggingface](./remote_huggingface)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/source/providers/datasetio/inline_localfs.md
+++ b/docs/source/providers/datasetio/inline_localfs.md
@ -1,3 +1,9 @@
+---
+description: "Local filesystem-based dataset I/O provider for reading and writing datasets to local storage."
+sidebar_label: Localfs
+title: inline::localfs
+---
+
 # inline::localfs

 ## Description
@ -16,6 +22,4 @@ Local filesystem-based dataset I/O provider for reading and writing datasets to
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/localfs_datasetio.db
-
 ```
-
--- a/docs/source/providers/datasetio/remote_huggingface.md
+++ b/docs/source/providers/datasetio/remote_huggingface.md
@ -1,3 +1,9 @@
+---
+description: "HuggingFace datasets provider for accessing and managing datasets from the HuggingFace Hub."
+sidebar_label: Remote - Huggingface
+title: remote::huggingface
+---
+
 # remote::huggingface

 ## Description
@ -16,6 +22,4 @@ HuggingFace datasets provider for accessing and managing datasets from the Huggi
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/huggingface_datasetio.db
-
 ```
-
--- a/docs/source/providers/datasetio/remote_nvidia.md
+++ b/docs/source/providers/datasetio/remote_nvidia.md
@ -1,3 +1,9 @@
+---
+description: "NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
 # remote::nvidia

 ## Description
@ -20,6 +26,4 @@ api_key: ${env.NVIDIA_API_KEY:=}
 dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
 project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
 datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
-
 ```
-
--- a/docs/docs/providers/eval/index.mdx
+++ b/docs/docs/providers/eval/index.mdx
@ -0,0 +1,18 @@
+---
+description: "Llama Stack Evaluation API for running evaluations on model and agent candidates."
+sidebar_label: Eval
+title: Eval
+---
+
+# Eval
+
+## Overview
+
+Llama Stack Evaluation API for running evaluations on model and agent candidates.
+
+This section contains documentation for all available providers for the **eval** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/source/providers/eval/inline_meta-reference.md
+++ b/docs/source/providers/eval/inline_meta-reference.md
@ -1,3 +1,9 @@
+---
+description: "Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
 # inline::meta-reference

 ## Description
@ -16,6 +22,4 @@ Meta's reference implementation of evaluation tasks with support for multiple la
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/meta_reference_eval.db
-
 ```
-
--- a/docs/source/providers/eval/remote_nvidia.md
+++ b/docs/source/providers/eval/remote_nvidia.md
@ -1,3 +1,9 @@
+---
+description: "NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
 # remote::nvidia

 ## Description
@ -14,6 +20,4 @@ NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform.

 ```yaml
 evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331}
-
 ```
-
--- a/docs/source/providers/external/external-providers-guide.md
+++ b/docs/source/providers/external/external-providers-guide.md
@ -283,4 +283,4 @@ additional_pip_packages:

 No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.

-The provider will now be available in Llama Stack with the type `remote::ramalama`.
+The provider will now be available in Llama Stack with the type `remote::ramalama`.
--- a/docs/source/providers/external/external-providers-list.md
+++ b/docs/source/providers/external/external-providers-list.md
--- a/docs/source/providers/external/index.md
+++ b/docs/source/providers/external/index.md
@ -5,9 +5,7 @@ Llama Stack supports external providers that live outside of the main codebase.
 - Share providers with others without contributing to the main codebase
 - Keep provider-specific code separate from the core Llama Stack code

-```{toctree}
-:maxdepth: 1
+## External Provider Documentation

-external-providers-list
-external-providers-guide
-```
+- [Known External Providers](external-providers-list)
+- [Creating External Providers](external-providers-guide)
--- a/docs/source/providers/files/index.md
+++ b/docs/source/providers/files/index.md
@ -1,3 +1,8 @@
+---
+sidebar_label: Files
+title: Files
+---
+
 # Files

 ## Overview
@ -6,9 +11,5 @@ This section contains documentation for all available providers for the **files*

 ## Providers

-```{toctree}
-:maxdepth: 1
-
-inline_localfs
-remote_s3
-```
+- [Localfs](./inline_localfs)
+- [Remote - S3](./remote_s3)
--- a/docs/source/providers/files/inline_localfs.md
+++ b/docs/source/providers/files/inline_localfs.md
@ -1,3 +1,9 @@
+---
+description: "Local filesystem-based file storage provider for managing files and documents locally."
+sidebar_label: Localfs
+title: inline::localfs
+---
+
 # inline::localfs

 ## Description
@ -19,6 +25,4 @@ storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/dummy/files}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/files_metadata.db
-
 ```
-
--- a/docs/source/providers/files/remote_s3.md
+++ b/docs/source/providers/files/remote_s3.md
@ -1,3 +1,9 @@
+---
+description: "AWS S3-based file storage provider for scalable cloud file management with metadata persistence."
+sidebar_label: Remote - S3
+title: remote::s3
+---
+
 # remote::s3

 ## Description
@ -28,6 +34,4 @@ auto_create_bucket: ${env.S3_AUTO_CREATE_BUCKET:=false}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/s3_files_metadata.db
-
 ```
-
--- a/docs/source/providers/index.md
+++ b/docs/source/providers/index.md
@ -1,3 +1,10 @@
+---
+title: API Providers
+description: Ecosystem of providers for swapping implementations across the same API
+sidebar_label: Overview
+sidebar_position: 1
+---
+
 # API Providers

 The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
@ -12,17 +19,15 @@ Providers come in two flavors:

 Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.

-```{toctree}
-:maxdepth: 1
+## Provider Categories

-external/index
-openai
-inference/index
-agents/index
-datasetio/index
-safety/index
-telemetry/index
-vector_io/index
-tool_runtime/index
-files/index
-```
+- **[External Providers](./external/)** - Guide for building and using external providers
+- **[OpenAI Compatibility](./openai)** - OpenAI API compatibility layer
+- **[Inference](./inference/)** - LLM and embedding model providers
+- **[Agents](./agents/)** - Agentic system providers
+- **[DatasetIO](./datasetio/)** - Dataset and data loader providers
+- **[Safety](./safety/)** - Content moderation and safety providers
+- **[Telemetry](./telemetry/)** - Monitoring and observability providers
+- **[Vector IO](./vector-io/)** - Vector database providers
+- **[Tool Runtime](./tool-runtime/)** - Tool and protocol providers
+- **[Files](./files/)** - File system and storage providers
--- a/docs/docs/providers/inference/index.mdx
+++ b/docs/docs/providers/inference/index.mdx
@ -0,0 +1,48 @@
+---
+description: "Llama Stack Inference API for generating completions, chat completions, and embeddings.
+
+    This API provides the raw interface to the underlying models. Two kinds of models are supported:
+    - LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
+    - Embedding models: these models generate embeddings to be used for semantic search."
+sidebar_label: Inference
+title: Inference
+---
+
+# Inference
+
+## Overview
+
+Llama Stack Inference API for generating completions, chat completions, and embeddings.
+
+    This API provides the raw interface to the underlying models. Two kinds of models are supported:
+    - LLM models: these models generate "raw" and "chat" (conversational) completions.
+    - Embedding models: these models generate embeddings to be used for semantic search.
+
+This section contains documentation for all available providers for the **inference** API.
+
+## Providers
+
+- [Meta-Reference](./inline_meta-reference)
+- [Sentence-Transformers](./inline_sentence-transformers)
+- [Remote - Anthropic](./remote_anthropic)
+- [Remote - Azure](./remote_azure)
+- [Remote - Bedrock](./remote_bedrock)
+- [Remote - Cerebras](./remote_cerebras)
+- [Remote - Databricks](./remote_databricks)
+- [Remote - Fireworks](./remote_fireworks)
+- [Remote - Gemini](./remote_gemini)
+- [Remote - Groq](./remote_groq)
+- [Remote - Hf - Endpoint](./remote_hf_endpoint)
+- [Remote - Hf - Serverless](./remote_hf_serverless)
+- [Remote - Llama-Openai-Compat](./remote_llama-openai-compat)
+- [Remote - Nvidia](./remote_nvidia)
+- [Remote - Ollama](./remote_ollama)
+- [Remote - Openai](./remote_openai)
+- [Remote - Passthrough](./remote_passthrough)
+- [Remote - Runpod](./remote_runpod)
+- [Remote - Sambanova](./remote_sambanova)
+- [Remote - Tgi](./remote_tgi)
+- [Remote - Together](./remote_together)
+- [Remote - Vertexai](./remote_vertexai)
+- [Remote - Vllm](./remote_vllm)
+- [Remote - Watsonx](./remote_watsonx)
--- a/docs/source/providers/inference/inline_meta-reference.md
+++ b/docs/source/providers/inference/inline_meta-reference.md
@ -1,3 +1,9 @@
+---
+description: "Meta's reference implementation of inference with support for various model formats and optimization techniques."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
 # inline::meta-reference

 ## Description
@ -27,6 +33,4 @@ quantization:
 model_parallel_size: ${env.MODEL_PARALLEL_SIZE:=0}
 max_batch_size: ${env.MAX_BATCH_SIZE:=1}
 max_seq_len: ${env.MAX_SEQ_LEN:=4096}
-
 ```
-
--- a/docs/docs/providers/inference/inline_sentence-transformers.mdx
+++ b/docs/docs/providers/inference/inline_sentence-transformers.mdx
@ -0,0 +1,17 @@
+---
+description: "Sentence Transformers inference provider for text embeddings and similarity search."
+sidebar_label: Sentence-Transformers
+title: inline::sentence-transformers
+---
+
+# inline::sentence-transformers
+
+## Description
+
+Sentence Transformers inference provider for text embeddings and similarity search.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/source/providers/inference/remote_anthropic.md
+++ b/docs/source/providers/inference/remote_anthropic.md
@ -1,3 +1,9 @@
+---
+description: "Anthropic inference provider for accessing Claude models and Anthropic's AI services."
+sidebar_label: Remote - Anthropic
+title: remote::anthropic
+---
+
 # remote::anthropic

 ## Description
@ -14,6 +20,4 @@ Anthropic inference provider for accessing Claude models and Anthropic's AI serv

 ```yaml
 api_key: ${env.ANTHROPIC_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_azure.md
+++ b/docs/source/providers/inference/remote_azure.md
@ -1,3 +1,12 @@
+---
+description: |
+  Azure OpenAI inference provider for accessing GPT models and other Azure services.
+  Provider documentation
+  https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
+sidebar_label: Remote - Azure
+title: remote::azure
+---
+
 # remote::azure

 ## Description
@ -24,6 +33,4 @@ api_key: ${env.AZURE_API_KEY:=}
 api_base: ${env.AZURE_API_BASE:=}
 api_version: ${env.AZURE_API_VERSION:=}
 api_type: ${env.AZURE_API_TYPE:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_bedrock.md
+++ b/docs/source/providers/inference/remote_bedrock.md
@ -1,3 +1,9 @@
+---
+description: "AWS Bedrock inference provider for accessing various AI models through AWS's managed service."
+sidebar_label: Remote - Bedrock
+title: remote::bedrock
+---
+
 # remote::bedrock

 ## Description
@ -23,6 +29,4 @@ AWS Bedrock inference provider for accessing various AI models through AWS's man

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/inference/remote_cerebras.md
+++ b/docs/source/providers/inference/remote_cerebras.md
@ -1,3 +1,9 @@
+---
+description: "Cerebras inference provider for running models on Cerebras Cloud platform."
+sidebar_label: Remote - Cerebras
+title: remote::cerebras
+---
+
 # remote::cerebras

 ## Description
@ -16,6 +22,4 @@ Cerebras inference provider for running models on Cerebras Cloud platform.
 ```yaml
 base_url: https://api.cerebras.ai
 api_key: ${env.CEREBRAS_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_databricks.md
+++ b/docs/source/providers/inference/remote_databricks.md
@ -1,3 +1,9 @@
+---
+description: "Databricks inference provider for running models on Databricks' unified analytics platform."
+sidebar_label: Remote - Databricks
+title: remote::databricks
+---
+
 # remote::databricks

 ## Description
@ -9,13 +15,11 @@ Databricks inference provider for running models on Databricks' unified analytic
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `url` | `<class 'str'>` | No |  | The URL for the Databricks model serving endpoint |
-| `api_token` | `<class 'pydantic.types.SecretStr'>` | No |  | The Databricks API token |
+| `api_token` | `<class 'str'>` | No |  | The Databricks API token |

 ## Sample Configuration

 ```yaml
-url: ${env.DATABRICKS_HOST:=}
-api_token: ${env.DATABRICKS_TOKEN:=}
-
+url: ${env.DATABRICKS_URL:=}
+api_token: ${env.DATABRICKS_API_TOKEN:=}
 ```
-
--- a/docs/source/providers/inference/remote_fireworks.md
+++ b/docs/source/providers/inference/remote_fireworks.md
@ -1,3 +1,9 @@
+---
+description: "Fireworks AI inference provider for Llama models and other AI models on the Fireworks platform."
+sidebar_label: Remote - Fireworks
+title: remote::fireworks
+---
+
 # remote::fireworks

 ## Description
@ -17,6 +23,4 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire
 ```yaml
 url: https://api.fireworks.ai/inference/v1
 api_key: ${env.FIREWORKS_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_gemini.md
+++ b/docs/source/providers/inference/remote_gemini.md
@ -1,3 +1,9 @@
+---
+description: "Google Gemini inference provider for accessing Gemini models and Google's AI services."
+sidebar_label: Remote - Gemini
+title: remote::gemini
+---
+
 # remote::gemini

 ## Description
@ -14,6 +20,4 @@ Google Gemini inference provider for accessing Gemini models and Google's AI ser

 ```yaml
 api_key: ${env.GEMINI_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_groq.md
+++ b/docs/source/providers/inference/remote_groq.md
@ -1,3 +1,9 @@
+---
+description: "Groq inference provider for ultra-fast inference using Groq's LPU technology."
+sidebar_label: Remote - Groq
+title: remote::groq
+---
+
 # remote::groq

 ## Description
@ -16,6 +22,4 @@ Groq inference provider for ultra-fast inference using Groq's LPU technology.
 ```yaml
 url: https://api.groq.com
 api_key: ${env.GROQ_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_hf_endpoint.md
+++ b/docs/source/providers/inference/remote_hf_endpoint.md
@ -1,3 +1,9 @@
+---
+description: "HuggingFace Inference Endpoints provider for dedicated model serving."
+sidebar_label: Remote - Hf - Endpoint
+title: remote::hf::endpoint
+---
+
 # remote::hf::endpoint

 ## Description
@ -8,7 +14,7 @@ HuggingFace Inference Endpoints provider for dedicated model serving.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
+| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '&#123;namespace&#125;/&#123;endpoint_name&#125;' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
 | `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |

 ## Sample Configuration
@ -16,6 +22,4 @@ HuggingFace Inference Endpoints provider for dedicated model serving.
 ```yaml
 endpoint_name: ${env.INFERENCE_ENDPOINT_NAME}
 api_token: ${env.HF_API_TOKEN}
-
 ```
-
--- a/docs/source/providers/inference/remote_hf_serverless.md
+++ b/docs/source/providers/inference/remote_hf_serverless.md
@ -1,3 +1,9 @@
+---
+description: "HuggingFace Inference API serverless provider for on-demand model inference."
+sidebar_label: Remote - Hf - Serverless
+title: remote::hf::serverless
+---
+
 # remote::hf::serverless

 ## Description
@ -16,6 +22,4 @@ HuggingFace Inference API serverless provider for on-demand model inference.
 ```yaml
 huggingface_repo: ${env.INFERENCE_MODEL}
 api_token: ${env.HF_API_TOKEN}
-
 ```
-
--- a/docs/source/providers/inference/remote_llama-openai-compat.md
+++ b/docs/source/providers/inference/remote_llama-openai-compat.md
@ -1,3 +1,9 @@
+---
+description: "Llama OpenAI-compatible provider for using Llama models with OpenAI API format."
+sidebar_label: Remote - Llama-Openai-Compat
+title: remote::llama-openai-compat
+---
+
 # remote::llama-openai-compat

 ## Description
@ -16,6 +22,4 @@ Llama OpenAI-compatible provider for using Llama models with OpenAI API format.
 ```yaml
 openai_compat_api_base: https://api.llama.com/compat/v1/
 api_key: ${env.LLAMA_API_KEY}
-
 ```
-
--- a/docs/source/providers/inference/remote_nvidia.md
+++ b/docs/source/providers/inference/remote_nvidia.md
@ -1,3 +1,9 @@
+---
+description: "NVIDIA inference provider for accessing NVIDIA NIM models and AI services."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
 # remote::nvidia

 ## Description
@ -19,6 +25,4 @@ NVIDIA inference provider for accessing NVIDIA NIM models and AI services.
 url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
 api_key: ${env.NVIDIA_API_KEY:=}
 append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
-
 ```
-
--- a/docs/source/providers/inference/remote_ollama.md
+++ b/docs/source/providers/inference/remote_ollama.md
@ -1,3 +1,9 @@
+---
+description: "Ollama inference provider for running local models through the Ollama runtime."
+sidebar_label: Remote - Ollama
+title: remote::ollama
+---
+
 # remote::ollama

 ## Description
@ -15,6 +21,4 @@ Ollama inference provider for running local models through the Ollama runtime.

 ```yaml
 url: ${env.OLLAMA_URL:=http://localhost:11434}
-
 ```
-
--- a/docs/source/providers/inference/remote_openai.md
+++ b/docs/source/providers/inference/remote_openai.md
@ -1,3 +1,9 @@
+---
+description: "OpenAI inference provider for accessing GPT models and other OpenAI services."
+sidebar_label: Remote - Openai
+title: remote::openai
+---
+
 # remote::openai

 ## Description
@ -16,6 +22,4 @@ OpenAI inference provider for accessing GPT models and other OpenAI services.
 ```yaml
 api_key: ${env.OPENAI_API_KEY:=}
 base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}
-
 ```
-
--- a/docs/source/providers/inference/remote_passthrough.md
+++ b/docs/source/providers/inference/remote_passthrough.md
@ -1,3 +1,9 @@
+---
+description: "Passthrough inference provider for connecting to any external inference service not directly supported."
+sidebar_label: Remote - Passthrough
+title: remote::passthrough
+---
+
 # remote::passthrough

 ## Description
@ -16,6 +22,4 @@ Passthrough inference provider for connecting to any external inference service
 ```yaml
 url: ${env.PASSTHROUGH_URL}
 api_key: ${env.PASSTHROUGH_API_KEY}
-
 ```
-
--- a/docs/source/providers/inference/remote_runpod.md
+++ b/docs/source/providers/inference/remote_runpod.md
@ -1,3 +1,9 @@
+---
+description: "RunPod inference provider for running models on RunPod's cloud GPU platform."
+sidebar_label: Remote - Runpod
+title: remote::runpod
+---
+
 # remote::runpod

 ## Description
@ -16,6 +22,4 @@ RunPod inference provider for running models on RunPod's cloud GPU platform.
 ```yaml
 url: ${env.RUNPOD_URL:=}
 api_token: ${env.RUNPOD_API_TOKEN}
-
 ```
-
--- a/docs/source/providers/inference/remote_sambanova-openai-compat.md
+++ b/docs/source/providers/inference/remote_sambanova-openai-compat.md
@ -18,4 +18,3 @@ openai_compat_api_base: https://api.sambanova.ai/v1
 api_key: ${env.SAMBANOVA_API_KEY:=}

 ```
-
--- a/docs/source/providers/inference/remote_sambanova.md
+++ b/docs/source/providers/inference/remote_sambanova.md
@ -1,3 +1,9 @@
+---
+description: "SambaNova inference provider for running models on SambaNova's dataflow architecture."
+sidebar_label: Remote - Sambanova
+title: remote::sambanova
+---
+
 # remote::sambanova

 ## Description
@ -16,6 +22,4 @@ SambaNova inference provider for running models on SambaNova's dataflow architec
 ```yaml
 url: https://api.sambanova.ai/v1
 api_key: ${env.SAMBANOVA_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_tgi.md
+++ b/docs/source/providers/inference/remote_tgi.md
@ -1,3 +1,9 @@
+---
+description: "Text Generation Inference (TGI) provider for HuggingFace model serving."
+sidebar_label: Remote - Tgi
+title: remote::tgi
+---
+
 # remote::tgi

 ## Description
@ -14,6 +20,4 @@ Text Generation Inference (TGI) provider for HuggingFace model serving.

 ```yaml
 url: ${env.TGI_URL:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -1,3 +1,9 @@
+---
+description: "Together AI inference provider for open-source models and collaborative AI development."
+sidebar_label: Remote - Together
+title: remote::together
+---
+
 # remote::together

 ## Description
@ -17,6 +23,4 @@ Together AI inference provider for open-source models and collaborative AI devel
 ```yaml
 url: https://api.together.xyz/v1
 api_key: ${env.TOGETHER_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/inference/remote_vertexai.md
+++ b/docs/source/providers/inference/remote_vertexai.md
@ -1,3 +1,29 @@
+---
+description: |
+  Google Vertex AI inference provider enables you to use Google's Gemini models through Google Cloud's Vertex AI platform, providing several advantages:
+
+  • Enterprise-grade security: Uses Google Cloud's security controls and IAM
+  • Better integration: Seamless integration with other Google Cloud services
+  • Advanced features: Access to additional Vertex AI features like model tuning and monitoring
+  • Authentication: Uses Google Cloud Application Default Credentials (ADC) instead of API keys
+
+  Configuration:
+  - Set VERTEX_AI_PROJECT environment variable (required)
+  - Set VERTEX_AI_LOCATION environment variable (optional, defaults to us-central1)
+  - Use Google Cloud Application Default Credentials or service account key
+
+  Authentication Setup:
+  Option 1 (Recommended): gcloud auth application-default login
+  Option 2: Set GOOGLE_APPLICATION_CREDENTIALS to service account key path
+
+  Available Models:
+  - vertex_ai/gemini-2.0-flash
+  - vertex_ai/gemini-2.5-flash
+  - vertex_ai/gemini-2.5-pro
+sidebar_label: Remote - Vertexai
+title: remote::vertexai
+---
+
 # remote::vertexai

 ## Description
@ -35,6 +61,4 @@ Available Models:
 ```yaml
 project: ${env.VERTEX_AI_PROJECT:=}
 location: ${env.VERTEX_AI_LOCATION:=us-central1}
-
 ```
-
--- a/docs/source/providers/inference/remote_vllm.md
+++ b/docs/source/providers/inference/remote_vllm.md
@ -1,3 +1,9 @@
+---
+description: "Remote vLLM inference provider for connecting to vLLM servers."
+sidebar_label: Remote - Vllm
+title: remote::vllm
+---
+
 # remote::vllm

 ## Description
@ -21,6 +27,4 @@ url: ${env.VLLM_URL:=}
 max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
 api_token: ${env.VLLM_API_TOKEN:=fake}
 tls_verify: ${env.VLLM_TLS_VERIFY:=true}
-
 ```
-
--- a/docs/source/providers/inference/remote_watsonx.md
+++ b/docs/source/providers/inference/remote_watsonx.md
@ -1,3 +1,9 @@
+---
+description: "IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform."
+sidebar_label: Remote - Watsonx
+title: remote::watsonx
+---
+
 # remote::watsonx

 ## Description
@ -19,6 +25,4 @@ IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform
 url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
 api_key: ${env.WATSONX_API_KEY:=}
 project_id: ${env.WATSONX_PROJECT_ID:=}
-
 ```
-
--- a/docs/source/providers/openai.md
+++ b/docs/source/providers/openai.md
@ -42,9 +42,7 @@ models = client.models.list()

 #### Responses

-:::{note}
-The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.
-:::
+> **Note:** The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.

 ##### Simple inference

--- a/docs/docs/providers/post_training/index.mdx
+++ b/docs/docs/providers/post_training/index.mdx
@ -0,0 +1,17 @@
+---
+sidebar_label: Post Training
+title: Post_Training
+---
+
+# Post_Training
+
+## Overview
+
+This section contains documentation for all available providers for the **post_training** API.
+
+## Providers
+
+- [Huggingface-Gpu](./inline_huggingface-gpu)
+- [Torchtune-Cpu](./inline_torchtune-cpu)
+- [Torchtune-Gpu](./inline_torchtune-gpu)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/source/providers/post_training/inline_huggingface-cpu.md
+++ b/docs/source/providers/post_training/inline_huggingface-cpu.md
@ -38,4 +38,3 @@ device: cpu
 dpo_output_dir: ~/.llama/dummy/dpo_output

 ```
-
--- a/docs/source/providers/post_training/inline_huggingface-gpu.md
+++ b/docs/source/providers/post_training/inline_huggingface-gpu.md
@ -1,3 +1,9 @@
+---
+description: "HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem."
+sidebar_label: Huggingface-Gpu
+title: inline::huggingface-gpu
+---
+
 # inline::huggingface-gpu

 ## Description
@ -11,11 +17,8 @@ HuggingFace-based post-training provider for fine-tuning models using the Huggin
 | `device` | `<class 'str'>` | No | cuda |  |
 | `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
 | `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
-| `chat_template` | `<class 'str'>` | No | <|user|>
-{input}
-<|assistant|>
-{output} |  |
-| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `chat_template` | `<class 'str'>` | No | &lt;|user|&gt;&lt;br/&gt;&#123;input&#125;&lt;br/&gt;&lt;|assistant|&gt;&lt;br/&gt;&#123;output&#125; |  |
+| `model_specific_config` | `<class 'dict'>` | No | &#123;'trust_remote_code': True, 'attn_implementation': 'sdpa'&#125; |  |
 | `max_seq_length` | `<class 'int'>` | No | 2048 |  |
 | `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
 | `save_total_limit` | `<class 'int'>` | No | 3 |  |
@ -36,6 +39,4 @@ checkpoint_format: huggingface
 distributed_backend: null
 device: cpu
 dpo_output_dir: ~/.llama/dummy/dpo_output
-
 ```
-
--- a/docs/source/providers/post_training/inline_huggingface.md
+++ b/docs/source/providers/post_training/inline_huggingface.md
@ -38,4 +38,3 @@ device: cpu
 dpo_output_dir: ~/.llama/dummy/dpo_output

 ```
-
--- a/docs/source/providers/post_training/inline_torchtune-cpu.md
+++ b/docs/source/providers/post_training/inline_torchtune-cpu.md
@ -1,3 +1,9 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Cpu
+title: inline::torchtune-cpu
+---
+
 # inline::torchtune-cpu

 ## Description
@ -15,6 +21,4 @@ TorchTune-based post-training provider for fine-tuning and optimizing models usi

 ```yaml
 checkpoint_format: meta
-
 ```
-
--- a/docs/source/providers/post_training/inline_torchtune-gpu.md
+++ b/docs/source/providers/post_training/inline_torchtune-gpu.md
@ -1,3 +1,9 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Gpu
+title: inline::torchtune-gpu
+---
+
 # inline::torchtune-gpu

 ## Description
@ -15,6 +21,4 @@ TorchTune-based post-training provider for fine-tuning and optimizing models usi

 ```yaml
 checkpoint_format: meta
-
 ```
-
--- a/docs/source/providers/post_training/inline_torchtune.md
+++ b/docs/source/providers/post_training/inline_torchtune.md
--- a/docs/source/providers/post_training/remote_nvidia.md
+++ b/docs/source/providers/post_training/remote_nvidia.md
@ -1,3 +1,9 @@
+---
+description: "NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
 # remote::nvidia

 ## Description
@ -23,6 +29,4 @@ api_key: ${env.NVIDIA_API_KEY:=}
 dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
 project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
 customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
-
 ```
-
--- a/docs/docs/providers/safety/index.mdx
+++ b/docs/docs/providers/safety/index.mdx
@ -0,0 +1,19 @@
+---
+sidebar_label: Safety
+title: Safety
+---
+
+# Safety
+
+## Overview
+
+This section contains documentation for all available providers for the **safety** API.
+
+## Providers
+
+- [Code-Scanner](./inline_code-scanner)
+- [Llama-Guard](./inline_llama-guard)
+- [Prompt-Guard](./inline_prompt-guard)
+- [Remote - Bedrock](./remote_bedrock)
+- [Remote - Nvidia](./remote_nvidia)
+- [Remote - Sambanova](./remote_sambanova)
--- a/docs/source/providers/safety/inline_code-scanner.md
+++ b/docs/source/providers/safety/inline_code-scanner.md
@ -1,3 +1,9 @@
+---
+description: "Code Scanner safety provider for detecting security vulnerabilities and unsafe code patterns."
+sidebar_label: Code-Scanner
+title: inline::code-scanner
+---
+
 # inline::code-scanner

 ## Description
@ -8,6 +14,4 @@ Code Scanner safety provider for detecting security vulnerabilities and unsafe c

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/safety/inline_llama-guard.md
+++ b/docs/source/providers/safety/inline_llama-guard.md
@ -1,3 +1,9 @@
+---
+description: "Llama Guard safety provider for content moderation and safety filtering using Meta's Llama Guard model."
+sidebar_label: Llama-Guard
+title: inline::llama-guard
+---
+
 # inline::llama-guard

 ## Description
@ -14,6 +20,4 @@ Llama Guard safety provider for content moderation and safety filtering using Me

 ```yaml
 excluded_categories: []
-
 ```
-
--- a/docs/source/providers/safety/inline_prompt-guard.md
+++ b/docs/source/providers/safety/inline_prompt-guard.md
@ -1,3 +1,9 @@
+---
+description: "Prompt Guard safety provider for detecting and filtering unsafe prompts and content."
+sidebar_label: Prompt-Guard
+title: inline::prompt-guard
+---
+
 # inline::prompt-guard

 ## Description
@ -14,6 +20,4 @@ Prompt Guard safety provider for detecting and filtering unsafe prompts and cont

 ```yaml
 guard_type: injection
-
 ```
-
--- a/docs/source/providers/safety/remote_bedrock.md
+++ b/docs/source/providers/safety/remote_bedrock.md
@ -1,3 +1,9 @@
+---
+description: "AWS Bedrock safety provider for content moderation using AWS's safety services."
+sidebar_label: Remote - Bedrock
+title: remote::bedrock
+---
+
 # remote::bedrock

 ## Description
@ -23,6 +29,4 @@ AWS Bedrock safety provider for content moderation using AWS's safety services.

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/safety/remote_nvidia.md
+++ b/docs/source/providers/safety/remote_nvidia.md
@ -1,3 +1,9 @@
+---
+description: "NVIDIA's safety provider for content moderation and safety filtering."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
 # remote::nvidia

 ## Description
@ -16,6 +22,4 @@ NVIDIA's safety provider for content moderation and safety filtering.
 ```yaml
 guardrails_service_url: ${env.GUARDRAILS_SERVICE_URL:=http://localhost:7331}
 config_id: ${env.NVIDIA_GUARDRAILS_CONFIG_ID:=self-check}
-
 ```
-
--- a/docs/source/providers/safety/remote_sambanova.md
+++ b/docs/source/providers/safety/remote_sambanova.md
@ -1,3 +1,9 @@
+---
+description: "SambaNova's safety provider for content moderation and safety filtering."
+sidebar_label: Remote - Sambanova
+title: remote::sambanova
+---
+
 # remote::sambanova

 ## Description
@ -16,6 +22,4 @@ SambaNova's safety provider for content moderation and safety filtering.
 ```yaml
 url: https://api.sambanova.ai/v1
 api_key: ${env.SAMBANOVA_API_KEY:=}
-
 ```
-
--- a/docs/docs/providers/scoring/index.mdx
+++ b/docs/docs/providers/scoring/index.mdx
@ -0,0 +1,16 @@
+---
+sidebar_label: Scoring
+title: Scoring
+---
+
+# Scoring
+
+## Overview
+
+This section contains documentation for all available providers for the **scoring** API.
+
+## Providers
+
+- [Basic](./inline_basic)
+- [Braintrust](./inline_braintrust)
+- [Llm-As-Judge](./inline_llm-as-judge)
--- a/docs/source/providers/scoring/inline_basic.md
+++ b/docs/source/providers/scoring/inline_basic.md
@ -1,3 +1,9 @@
+---
+description: "Basic scoring provider for simple evaluation metrics and scoring functions."
+sidebar_label: Basic
+title: inline::basic
+---
+
 # inline::basic

 ## Description
@ -8,6 +14,4 @@ Basic scoring provider for simple evaluation metrics and scoring functions.

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/scoring/inline_braintrust.md
+++ b/docs/source/providers/scoring/inline_braintrust.md
@ -1,3 +1,9 @@
+---
+description: "Braintrust scoring provider for evaluation and scoring using the Braintrust platform."
+sidebar_label: Braintrust
+title: inline::braintrust
+---
+
 # inline::braintrust

 ## Description
@ -14,6 +20,4 @@ Braintrust scoring provider for evaluation and scoring using the Braintrust plat

 ```yaml
 openai_api_key: ${env.OPENAI_API_KEY:=}
-
 ```
-
--- a/docs/source/providers/scoring/inline_llm-as-judge.md
+++ b/docs/source/providers/scoring/inline_llm-as-judge.md
@ -1,3 +1,9 @@
+---
+description: "LLM-as-judge scoring provider that uses language models to evaluate and score responses."
+sidebar_label: Llm-As-Judge
+title: inline::llm-as-judge
+---
+
 # inline::llm-as-judge

 ## Description
@ -8,6 +14,4 @@ LLM-as-judge scoring provider that uses language models to evaluate and score re

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/telemetry/index.md
+++ b/docs/source/providers/telemetry/index.md
@ -1,3 +1,8 @@
+---
+sidebar_label: Telemetry
+title: Telemetry
+---
+
 # Telemetry

 ## Overview
@ -6,8 +11,4 @@ This section contains documentation for all available providers for the **teleme

 ## Providers

-```{toctree}
-:maxdepth: 1
-
-inline_meta-reference
-```
+- [Meta-Reference](./inline_meta-reference)
--- a/docs/source/providers/telemetry/inline_meta-reference.md
+++ b/docs/source/providers/telemetry/inline_meta-reference.md
@ -1,3 +1,9 @@
+---
+description: "Meta's reference implementation of telemetry and observability using OpenTelemetry."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
 # inline::meta-reference

 ## Description
@ -10,7 +16,7 @@ Meta's reference implementation of telemetry and observability using OpenTelemet
 |-------|------|----------|---------|-------------|
 | `otel_exporter_otlp_endpoint` | `str \| None` | No |  | The OpenTelemetry collector endpoint URL (base URL for traces, metrics, and logs). If not set, the SDK will use OTEL_EXPORTER_OTLP_ENDPOINT environment variable. |
 | `service_name` | `<class 'str'>` | No |  | The service name to use for telemetry |
-| `sinks` | `list[inline.telemetry.meta_reference.config.TelemetrySink` | No | [<TelemetrySink.CONSOLE: 'console'>, <TelemetrySink.SQLITE: 'sqlite'>] | List of telemetry sinks to enable (possible values: otel_trace, otel_metric, sqlite, console) |
+| `sinks` | `list[inline.telemetry.meta_reference.config.TelemetrySink` | No | [&lt;TelemetrySink.CONSOLE: 'console'&gt;, &lt;TelemetrySink.SQLITE: 'sqlite'&gt;] | List of telemetry sinks to enable (possible values: otel_trace, otel_metric, sqlite, console) |
 | `sqlite_db_path` | `<class 'str'>` | No | ~/.llama/runtime/trace_store.db | The path to the SQLite database to use for storing traces |

 ## Sample Configuration
@ -20,6 +26,4 @@ service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
 sinks: ${env.TELEMETRY_SINKS:=console,sqlite}
 sqlite_db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/trace_store.db
 otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=}
-
 ```
-
--- a/docs/docs/providers/tool_runtime/index.mdx
+++ b/docs/docs/providers/tool_runtime/index.mdx
@ -0,0 +1,19 @@
+---
+sidebar_label: Tool Runtime
+title: Tool_Runtime
+---
+
+# Tool_Runtime
+
+## Overview
+
+This section contains documentation for all available providers for the **tool_runtime** API.
+
+## Providers
+
+- [Rag-Runtime](./inline_rag-runtime)
+- [Remote - Bing-Search](./remote_bing-search)
+- [Remote - Brave-Search](./remote_brave-search)
+- [Remote - Model-Context-Protocol](./remote_model-context-protocol)
+- [Remote - Tavily-Search](./remote_tavily-search)
+- [Remote - Wolfram-Alpha](./remote_wolfram-alpha)
--- a/docs/source/providers/tool_runtime/inline_rag-runtime.md
+++ b/docs/source/providers/tool_runtime/inline_rag-runtime.md
@ -1,3 +1,9 @@
+---
+description: "RAG (Retrieval-Augmented Generation) tool runtime for document ingestion, chunking, and semantic search."
+sidebar_label: Rag-Runtime
+title: inline::rag-runtime
+---
+
 # inline::rag-runtime

 ## Description
@ -8,6 +14,4 @@ RAG (Retrieval-Augmented Generation) tool runtime for document ingestion, chunki

 ```yaml
 {}
-
 ```
-
--- a/docs/source/providers/tool_runtime/remote_bing-search.md
+++ b/docs/source/providers/tool_runtime/remote_bing-search.md
@ -1,3 +1,9 @@
+---
+description: "Bing Search tool for web search capabilities using Microsoft's search engine."
+sidebar_label: Remote - Bing-Search
+title: remote::bing-search
+---
+
 # remote::bing-search

 ## Description
@ -15,6 +21,4 @@ Bing Search tool for web search capabilities using Microsoft's search engine.

 ```yaml
 api_key: ${env.BING_API_KEY:}
-
 ```
-
--- a/docs/source/providers/tool_runtime/remote_brave-search.md
+++ b/docs/source/providers/tool_runtime/remote_brave-search.md
@ -1,3 +1,9 @@
+---
+description: "Brave Search tool for web search capabilities with privacy-focused results."
+sidebar_label: Remote - Brave-Search
+title: remote::brave-search
+---
+
 # remote::brave-search

 ## Description
@ -16,6 +22,4 @@ Brave Search tool for web search capabilities with privacy-focused results.
 ```yaml
 api_key: ${env.BRAVE_SEARCH_API_KEY:=}
 max_results: 3
-
 ```
-
--- a/docs/docs/providers/tool_runtime/remote_model-context-protocol.mdx
+++ b/docs/docs/providers/tool_runtime/remote_model-context-protocol.mdx
@ -0,0 +1,17 @@
+---
+description: "Model Context Protocol (MCP) tool for standardized tool calling and context management."
+sidebar_label: Remote - Model-Context-Protocol
+title: remote::model-context-protocol
+---
+
+# remote::model-context-protocol
+
+## Description
+
+Model Context Protocol (MCP) tool for standardized tool calling and context management.
+
+## Sample Configuration
+
+```yaml
+{}
+```
--- a/docs/source/providers/tool_runtime/remote_tavily-search.md
+++ b/docs/source/providers/tool_runtime/remote_tavily-search.md
@ -1,3 +1,9 @@
+---
+description: "Tavily Search tool for AI-optimized web search with structured results."
+sidebar_label: Remote - Tavily-Search
+title: remote::tavily-search
+---
+
 # remote::tavily-search

 ## Description
@ -16,6 +22,4 @@ Tavily Search tool for AI-optimized web search with structured results.
 ```yaml
 api_key: ${env.TAVILY_SEARCH_API_KEY:=}
 max_results: 3
-
 ```
-
--- a/docs/source/providers/tool_runtime/remote_wolfram-alpha.md
+++ b/docs/source/providers/tool_runtime/remote_wolfram-alpha.md
@ -1,3 +1,9 @@
+---
+description: "Wolfram Alpha tool for computational knowledge and mathematical calculations."
+sidebar_label: Remote - Wolfram-Alpha
+title: remote::wolfram-alpha
+---
+
 # remote::wolfram-alpha

 ## Description
@ -14,6 +20,4 @@ Wolfram Alpha tool for computational knowledge and mathematical calculations.

 ```yaml
 api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
-
 ```
-
--- a/docs/docs/providers/vector_io/index.mdx
+++ b/docs/docs/providers/vector_io/index.mdx
@ -0,0 +1,25 @@
+---
+sidebar_label: Vector Io
+title: Vector_Io
+---
+
+# Vector_Io
+
+## Overview
+
+This section contains documentation for all available providers for the **vector_io** API.
+
+## Providers
+
+- [Chromadb](./inline_chromadb)
+- [Faiss](./inline_faiss)
+- [Meta-Reference](./inline_meta-reference)
+- [Milvus](./inline_milvus)
+- [Qdrant](./inline_qdrant)
+- [Sqlite-Vec](./inline_sqlite-vec)
+- [Sqlite Vec](./inline_sqlite_vec)
+- [Remote - Chromadb](./remote_chromadb)
+- [Remote - Milvus](./remote_milvus)
+- [Remote - Pgvector](./remote_pgvector)
+- [Remote - Qdrant](./remote_qdrant)
+- [Remote - Weaviate](./remote_weaviate)
--- a/docs/source/providers/vector_io/inline_chromadb.md
+++ b/docs/source/providers/vector_io/inline_chromadb.md
@ -1,3 +1,40 @@
+---
+description: |
+  [Chroma](https://www.trychroma.com/) is an inline and remote vector
+  database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Chroma supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+  ## Usage
+
+  To use Chrome in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install chroma using pip:
+
+  ```bash
+  pip install chromadb
+  ```
+
+  ## Documentation
+  See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+sidebar_label: Chromadb
+title: inline::chromadb
+---
+
 # inline::chromadb

 ## Description
@ -51,6 +88,4 @@ db_path: ${env.CHROMADB_PATH}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_inline_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/inline_faiss.md
+++ b/docs/source/providers/vector_io/inline_faiss.md
@ -1,3 +1,49 @@
+---
+description: |
+  [Faiss](https://github.com/facebookresearch/faiss) is an inline vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stack
+  - GPU support
+  - **Vector search** - FAISS supports pure vector similarity search using embeddings
+
+  ## Search Modes
+
+  **Supported:**
+  - **Vector Search** (`mode="vector"`): Performs vector similarity search using embeddings
+
+  **Not Supported:**
+  - **Keyword Search** (`mode="keyword"`): Not supported by FAISS
+  - **Hybrid Search** (`mode="hybrid"`): Not supported by FAISS
+
+  > **Note**: FAISS is designed as a pure vector similarity search library. See the [FAISS GitHub repository](https://github.com/facebookresearch/faiss) for more details about FAISS's core functionality.
+
+  ## Usage
+
+  To use Faiss in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Faiss.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install Faiss using pip:
+
+  ```bash
+  pip install faiss-cpu
+  ```
+  ## Documentation
+  See [Faiss' documentation](https://faiss.ai/) or the [Faiss Wiki](https://github.com/facebookresearch/faiss/wiki) for
+  more details about Faiss in general.
+sidebar_label: Faiss
+title: inline::faiss
+---
+
 # inline::faiss

 ## Description
@ -57,6 +103,4 @@ more details about Faiss in general.
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/faiss_store.db
-
 ```
-
--- a/docs/source/providers/vector_io/inline_meta-reference.md
+++ b/docs/source/providers/vector_io/inline_meta-reference.md
@ -1,3 +1,9 @@
+---
+description: "Meta's reference implementation of a vector database."
+sidebar_label: Meta-Reference
+title: inline::meta-reference
+---
+
 # inline::meta-reference

 ## Description
@ -16,12 +22,9 @@ Meta's reference implementation of a vector database.
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/faiss_store.db
-
 ```
-
 ## Deprecation Notice

-```{warning}
+:::warning
 Please use the `inline::faiss` provider instead.
-```
-
+:::
--- a/docs/source/providers/vector_io/inline_milvus.md
+++ b/docs/source/providers/vector_io/inline_milvus.md
@ -1,3 +1,9 @@
+---
+description: "Please refer to the remote provider documentation."
+sidebar_label: Milvus
+title: inline::milvus
+---
+
 # inline::milvus

 ## Description
@ -21,6 +27,4 @@ db_path: ${env.MILVUS_DB_PATH:=~/.llama/dummy}/milvus.db
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/inline_qdrant.md
+++ b/docs/source/providers/vector_io/inline_qdrant.md
@ -1,3 +1,50 @@
+---
+description: |
+  [Qdrant](https://qdrant.tech/documentation/) is an inline and remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  > By default, Qdrant stores vectors in RAM, delivering incredibly fast access for datasets that fit comfortably in
+  > memory. But when your dataset exceeds RAM capacity, Qdrant offers Memmap as an alternative.
+  >
+  > \[[An Introduction to Vector Databases](https://qdrant.tech/articles/what-is-a-vector-database/)\]
+
+
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stack
+  - Apache 2.0 license terms
+  - Store embeddings and their metadata
+  - Supports search by
+    [Keyword](https://qdrant.tech/articles/qdrant-introduces-full-text-filters-and-indexes/)
+    and [Hybrid](https://qdrant.tech/articles/hybrid-search/#building-a-hybrid-search-system-in-qdrant) search
+  - [Multilingual and Multimodal retrieval](https://qdrant.tech/documentation/multimodal-search/)
+  - [Medatata filtering](https://qdrant.tech/articles/vector-search-filtering/)
+  - [GPU support](https://qdrant.tech/documentation/guides/running-with-gpu/)
+
+  ## Usage
+
+  To use Qdrant in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Qdrant.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install Qdrant using docker:
+
+  ```bash
+  docker pull qdrant/qdrant
+  ```
+  ## Documentation
+  See the [Qdrant documentation](https://qdrant.tech/documentation/) for more details about Qdrant in general.
+sidebar_label: Qdrant
+title: inline::qdrant
+---
+
 # inline::qdrant

 ## Description
@ -60,6 +107,4 @@ path: ${env.QDRANT_PATH:=~/.llama/~/.llama/dummy}/qdrant.db
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/inline_sqlite-vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite-vec.md
@ -1,3 +1,205 @@
+---
+description: |
+  [SQLite-Vec](https://github.com/asg017/sqlite-vec) is an inline vector database provider for Llama Stack. It
+  allows you to store and query vectors directly within an SQLite database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+
+  - Lightweight and easy to use
+  - Fully integrated with Llama Stacks
+  - Uses disk-based storage for persistence, allowing for larger vector storage
+
+  ### Comparison to Faiss
+
+  The choice between Faiss and sqlite-vec should be made based on the needs of your application,
+  as they have different strengths.
+
+  #### Choosing the Right Provider
+
+  Scenario | Recommended Tool | Reason
+  -- |-----------------| --
+  Online Analytical Processing (OLAP) | Faiss           | Fast, in-memory searches
+  Online Transaction Processing (OLTP) | sqlite-vec      | Frequent writes and reads
+  Frequent writes | sqlite-vec      | Efficient disk-based storage and incremental indexing
+  Large datasets | sqlite-vec      | Disk-based storage for larger vector storage
+  Datasets that can fit in memory, frequent reads | Faiss | Optimized for speed, indexing, and GPU acceleration
+
+  #### Empirical Example
+
+  Consider the histogram below in which 10,000 randomly generated strings were inserted
+  in batches of 100 into both Faiss and sqlite-vec using `client.tool_runtime.rag_tool.insert()`.
+
+  ```{image} ../../../../_static/providers/vector_io/write_time_comparison_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss write times
+  :width: 400px
+  ```
+
+  You will notice that the average write time for `sqlite-vec` was 788ms, compared to
+  47,640ms for Faiss. While the number is jarring, if you look at the distribution, you can see that it is rather
+  uniformly spread across the [1500, 100000] interval.
+
+  Looking at each individual write in the order that the documents are inserted you'll see the increase in
+  write speed as Faiss reindexes the vectors after each write.
+  ```{image} ../../../../_static/providers/vector_io/write_time_sequence_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss write times
+  :width: 400px
+  ```
+
+  In comparison, the read times for Faiss was on average 10% faster than sqlite-vec.
+  The modes of the two distributions highlight the differences much further where Faiss
+  will likely yield faster read performance.
+
+  ```{image} ../../../../_static/providers/vector_io/read_time_comparison_sqlite-vec-faiss.png
+  :alt: Comparison of SQLite-Vec and Faiss read times
+  :width: 400px
+  ```
+
+  ## Usage
+
+  To use sqlite-vec in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use SQLite-Vec.
+  3. Start storing and querying vectors.
+
+  The SQLite-vec provider supports three search modes:
+
+  1. **Vector Search** (`mode="vector"`): Performs pure vector similarity search using the embeddings.
+  2. **Keyword Search** (`mode="keyword"`): Performs full-text search using SQLite's FTS5.
+  3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword search for better results. First performs keyword search to get candidate matches, then applies vector similarity search on those candidates.
+
+  Example with hybrid search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "hybrid", "max_chunks": 3, "score_threshold": 0.7},
+  )
+
+  # Using RRF ranker
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={
+          "mode": "hybrid",
+          "max_chunks": 3,
+          "score_threshold": 0.7,
+          "ranker": {"type": "rrf", "impact_factor": 60.0},
+      },
+  )
+
+  # Using weighted ranker
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={
+          "mode": "hybrid",
+          "max_chunks": 3,
+          "score_threshold": 0.7,
+          "ranker": {"type": "weighted", "alpha": 0.7},  # 70% vector, 30% keyword
+      },
+  )
+  ```
+
+  Example with explicit vector search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "vector", "max_chunks": 3, "score_threshold": 0.7},
+  )
+  ```
+
+  Example with keyword search:
+  ```python
+  response = await vector_io.query_chunks(
+      vector_db_id="my_db",
+      query="your query here",
+      params={"mode": "keyword", "max_chunks": 3, "score_threshold": 0.7},
+  )
+  ```
+
+  ## Supported Search Modes
+
+  The SQLite vector store supports three search modes:
+
+  1. **Vector Search** (`mode="vector"`): Uses vector similarity to find relevant chunks
+  2. **Keyword Search** (`mode="keyword"`): Uses keyword matching to find relevant chunks
+  3. **Hybrid Search** (`mode="hybrid"`): Combines both vector and keyword scores using a ranker
+
+  ### Hybrid Search
+
+  Hybrid search combines the strengths of both vector and keyword search by:
+  - Computing vector similarity scores
+  - Computing keyword match scores
+  - Using a ranker to combine these scores
+
+  Two ranker types are supported:
+
+  1. **RRF (Reciprocal Rank Fusion)**:
+     - Combines ranks from both vector and keyword results
+     - Uses an impact factor (default: 60.0) to control the weight of higher-ranked results
+     - Good for balancing between vector and keyword results
+     - The default impact factor of 60.0 comes from the original RRF paper by Cormack et al. (2009) [^1], which found this value to provide optimal performance across various retrieval tasks
+
+  2. **Weighted**:
+     - Linearly combines normalized vector and keyword scores
+     - Uses an alpha parameter (0-1) to control the blend:
+       - alpha=0: Only use keyword scores
+       - alpha=1: Only use vector scores
+       - alpha=0.5: Equal weight to both (default)
+
+  Example using RAGQueryConfig with different search modes:
+
+  ```python
+  from llama_stack.apis.tools import RAGQueryConfig, RRFRanker, WeightedRanker
+
+  # Vector search
+  config = RAGQueryConfig(mode="vector", max_chunks=5)
+
+  # Keyword search
+  config = RAGQueryConfig(mode="keyword", max_chunks=5)
+
+  # Hybrid search with custom RRF ranker
+  config = RAGQueryConfig(
+      mode="hybrid",
+      max_chunks=5,
+      ranker=RRFRanker(impact_factor=50.0),  # Custom impact factor
+  )
+
+  # Hybrid search with weighted ranker
+  config = RAGQueryConfig(
+      mode="hybrid",
+      max_chunks=5,
+      ranker=WeightedRanker(alpha=0.7),  # 70% vector, 30% keyword
+  )
+
+  # Hybrid search with default RRF ranker
+  config = RAGQueryConfig(
+      mode="hybrid", max_chunks=5
+  )  # Will use RRF with impact_factor=60.0
+  ```
+
+  Note: The ranker configuration is only used in hybrid mode. For vector or keyword modes, the ranker parameter is ignored.
+
+  ## Installation
+
+  You can install SQLite-Vec using pip:
+
+  ```bash
+  pip install sqlite-vec
+  ```
+
+  ## Documentation
+
+  See [sqlite-vec's GitHub repo](https://github.com/asg017/sqlite-vec/tree/main) for more details about sqlite-vec in general.
+
+  [^1]: Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). [Reciprocal rank fusion outperforms condorcet and individual rank learning methods](https://dl.acm.org/doi/10.1145/1571941.1572114). In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 758-759).
+sidebar_label: Sqlite-Vec
+title: inline::sqlite-vec
+---
+
 # inline::sqlite-vec

 ## Description
@ -215,6 +417,4 @@ db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec.db
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/inline_sqlite_vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite_vec.md
@ -1,3 +1,9 @@
+---
+description: "Please refer to the sqlite-vec provider documentation."
+sidebar_label: Sqlite Vec
+title: inline::sqlite_vec
+---
+
 # inline::sqlite_vec

 ## Description
@ -20,12 +26,9 @@ db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec.db
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/sqlite_vec_registry.db
-
 ```
-
 ## Deprecation Notice

-```{warning}
+:::warning
 Please use the `inline::sqlite-vec` provider (notice the hyphen instead of underscore) instead.
-```
-
+:::
--- a/docs/source/providers/vector_io/remote_chromadb.md
+++ b/docs/source/providers/vector_io/remote_chromadb.md
@ -1,3 +1,40 @@
+---
+description: |
+  [Chroma](https://www.trychroma.com/) is an inline and remote vector
+  database provider for Llama Stack. It allows you to store and query vectors directly within a Chroma database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Chroma supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+  ## Usage
+
+  To use Chrome in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  You can install chroma using pip:
+
+  ```bash
+  pip install chromadb
+  ```
+
+  ## Documentation
+  See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introduction) for more details about Chroma in general.
+sidebar_label: Remote - Chromadb
+title: remote::chromadb
+---
+
 # remote::chromadb

 ## Description
@ -50,6 +87,4 @@ url: ${env.CHROMADB_URL}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_remote_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/remote_milvus.md
+++ b/docs/source/providers/vector_io/remote_milvus.md
@ -1,3 +1,204 @@
+---
+description: |
+  [Milvus](https://milvus.io/) is an inline and remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly within a Milvus database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+
+  - Easy to use
+  - Fully integrated with Llama Stack
+  - Supports all search modes: vector, keyword, and hybrid search (both inline and remote configurations)
+
+  ## Usage
+
+  To use Milvus in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use Milvus.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  If you want to use inline Milvus, you can install:
+
+  ```bash
+  pip install pymilvus[milvus-lite]
+  ```
+
+  If you want to use remote Milvus, you can install:
+
+  ```bash
+  pip install pymilvus
+  ```
+
+  ## Configuration
+
+  In Llama Stack, Milvus can be configured in two ways:
+  - **Inline (Local) Configuration** - Uses Milvus-Lite for local storage
+  - **Remote Configuration** - Connects to a remote Milvus server
+
+  ### Inline (Local) Configuration
+
+  The simplest method is local configuration, which requires setting `db_path`, a path for locally storing Milvus-Lite files:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: inline::milvus
+      config:
+        db_path: ~/.llama/distributions/together/milvus_store.db
+  ```
+
+  ### Remote Configuration
+
+  Remote configuration is suitable for larger data storage requirements:
+
+  #### Standard Remote Connection
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "http://<host>:<port>"
+        token: "<user>:<password>"
+  ```
+
+  #### TLS-Enabled Remote Connection (One-way TLS)
+
+  For connections to Milvus instances with one-way TLS enabled:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "https://<host>:<port>"
+        token: "<user>:<password>"
+        secure: True
+        server_pem_path: "/path/to/server.pem"
+  ```
+
+  #### Mutual TLS (mTLS) Remote Connection
+
+  For connections to Milvus instances with mutual TLS (mTLS) enabled:
+
+  ```yaml
+  vector_io:
+    - provider_id: milvus
+      provider_type: remote::milvus
+      config:
+        uri: "https://<host>:<port>"
+        token: "<user>:<password>"
+        secure: True
+        ca_pem_path: "/path/to/ca.pem"
+        client_pem_path: "/path/to/client.pem"
+        client_key_path: "/path/to/client.key"
+  ```
+
+  #### Key Parameters for TLS Configuration
+
+  - **`secure`**: Enables TLS encryption when set to `true`. Defaults to `false`.
+  - **`server_pem_path`**: Path to the **server certificate** for verifying the server's identity (used in one-way TLS).
+  - **`ca_pem_path`**: Path to the **Certificate Authority (CA) certificate** for validating the server certificate (required in mTLS).
+  - **`client_pem_path`**: Path to the **client certificate** file (required for mTLS).
+  - **`client_key_path`**: Path to the **client private key** file (required for mTLS).
+
+  ## Search Modes
+
+  Milvus supports three different search modes for both inline and remote configurations:
+
+  ### Vector Search
+  Vector search uses semantic similarity to find the most relevant chunks based on embedding vectors. This is the default search mode and works well for finding conceptually similar content.
+
+  ```python
+  # Vector search example
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="What is machine learning?",
+      search_mode="vector",
+      max_num_results=5,
+  )
+  ```
+
+  ### Keyword Search
+  Keyword search uses traditional text-based matching to find chunks containing specific terms or phrases. This is useful when you need exact term matches.
+
+  ```python
+  # Keyword search example
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="Python programming language",
+      search_mode="keyword",
+      max_num_results=5,
+  )
+  ```
+
+  ### Hybrid Search
+  Hybrid search combines both vector and keyword search methods to provide more comprehensive results. It leverages the strengths of both semantic similarity and exact term matching.
+
+  #### Basic Hybrid Search
+  ```python
+  # Basic hybrid search example (uses RRF ranker with default impact_factor=60.0)
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+  )
+  ```
+
+  **Note**: The default `impact_factor` value of 60.0 was empirically determined to be optimal in the original RRF research paper: ["Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (Cormack et al., 2009).
+
+  #### Hybrid Search with RRF (Reciprocal Rank Fusion) Ranker
+  RRF combines rankings from vector and keyword search by using reciprocal ranks. The impact factor controls how much weight is given to higher-ranked results.
+
+  ```python
+  # Hybrid search with custom RRF parameters
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+      ranking_options={
+          "ranker": {
+              "type": "rrf",
+              "impact_factor": 100.0,  # Higher values give more weight to top-ranked results
+          }
+      },
+  )
+  ```
+
+  #### Hybrid Search with Weighted Ranker
+  Weighted ranker linearly combines normalized scores from vector and keyword search. The alpha parameter controls the balance between the two search methods.
+
+  ```python
+  # Hybrid search with weighted ranker
+  search_response = client.vector_stores.search(
+      vector_store_id=vector_store.id,
+      query="neural networks in Python",
+      search_mode="hybrid",
+      max_num_results=5,
+      ranking_options={
+          "ranker": {
+              "type": "weighted",
+              "alpha": 0.7,  # 70% vector search, 30% keyword search
+          }
+      },
+  )
+  ```
+
+  For detailed documentation on RRF and Weighted rankers, please refer to the [Milvus Reranking Guide](https://milvus.io/docs/reranking.md).
+
+  ## Documentation
+  See the [Milvus documentation](https://milvus.io/docs/install-overview.md) for more details about Milvus in general.
+
+  For more details on TLS configuration, refer to the [TLS setup guide](https://milvus.io/docs/tls.md).
+sidebar_label: Remote - Milvus
+title: remote::milvus
+---
+
 # remote::milvus

 ## Description
@ -208,12 +409,11 @@ For more details on TLS configuration, refer to the [TLS setup guide](https://mi
 | `token` | `str \| None` | No |  | The token of the Milvus server |
 | `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
-| `config` | `dict` | No | {} | This configuration allows additional fields to be passed through to the underlying Milvus client. See the [Milvus](https://milvus.io/docs/install-overview.md) documentation for more details about Milvus in general. |
-
-```{note}
- This configuration class accepts additional fields beyond those listed above. You can pass any additional configuration options that will be forwarded to the underlying provider.
- ```
+| `config` | `dict` | No | &#123;&#125; | This configuration allows additional fields to be passed through to the underlying Milvus client. See the [Milvus](https://milvus.io/docs/install-overview.md) documentation for more details about Milvus in general. |

+:::note
+This configuration class accepts additional fields beyond those listed above. You can pass any additional configuration options that will be forwarded to the underlying provider.
+:::

 ## Sample Configuration

@ -223,6 +423,4 @@ token: ${env.MILVUS_TOKEN}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_remote_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/remote_pgvector.md
+++ b/docs/source/providers/vector_io/remote_pgvector.md
@ -1,3 +1,108 @@
+---
+description: |
+  [PGVector](https://github.com/pgvector/pgvector) is a remote vector database provider for Llama Stack. It
+  allows you to store and query vectors directly in memory.
+  That means you'll get fast and efficient vector retrieval.
+
+  ## Features
+
+  - Easy to use
+  - Fully integrated with Llama Stack
+
+  There are three implementations of search for PGVectoIndex available:
+
+  1. Vector Search:
+  - How it works:
+    - Uses PostgreSQL's vector extension (pgvector) to perform similarity search
+    - Compares query embeddings against stored embeddings using Cosine distance or other distance metrics
+    - Eg. SQL query: SELECT document, embedding &lt;=&gt; %s::vector AS distance FROM table ORDER BY distance
+
+  -Characteristics:
+    - Semantic understanding - finds documents similar in meaning even if they don't share keywords
+    - Works with high-dimensional vector embeddings (typically 768, 1024, or higher dimensions)
+    - Best for: Finding conceptually related content, handling synonyms, cross-language search
+
+  2. Keyword Search
+  - How it works:
+    - Uses PostgreSQL's full-text search capabilities with tsvector and ts_rank
+    - Converts text to searchable tokens using to_tsvector('english', text). Default language is English.
+    - Eg. SQL query: SELECT document, ts_rank(tokenized_content, plainto_tsquery('english', %s)) AS score
+
+  - Characteristics:
+    - Lexical matching - finds exact keyword matches and variations
+    - Uses GIN (Generalized Inverted Index) for fast text search performance
+    - Scoring: Uses PostgreSQL's ts_rank function for relevance scoring
+    - Best for: Exact term matching, proper names, technical terms, Boolean-style queries
+
+  3. Hybrid Search
+  - How it works:
+    - Combines both vector and keyword search results
+    - Runs both searches independently, then merges results using configurable reranking
+
+  - Two reranking strategies available:
+      - Reciprocal Rank Fusion (RRF) - (default: 60.0)
+      - Weighted Average - (default: 0.5)
+
+  - Characteristics:
+    - Best of both worlds: semantic understanding + exact matching
+    - Documents appearing in both searches get boosted scores
+    - Configurable balance between semantic and lexical matching
+    - Best for: General-purpose search where you want both precision and recall
+
+  4. Database Schema
+  The PGVector implementation stores data optimized for all three search types:
+  CREATE TABLE vector_store_xxx (
+      id TEXT PRIMARY KEY,
+      document JSONB,                    -- Original document
+      embedding vector(dimension),        -- For vector search
+      content_text TEXT,                 -- Raw text content
+      tokenized_content TSVECTOR          -- For keyword search
+  );
+
+  -- Indexes for performance
+  CREATE INDEX content_gin_idx ON table USING GIN(tokenized_content);  -- Keyword search
+  -- Vector index created automatically by pgvector
+
+  ## Usage
+
+  To use PGVector in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
+  3. Start storing and querying vectors.
+
+  ## This is an example how you can set up your environment for using PGVector
+
+  1. Export env vars:
+  ```bash
+  export ENABLE_PGVECTOR=true
+  export PGVECTOR_HOST=localhost
+  export PGVECTOR_PORT=5432
+  export PGVECTOR_DB=llamastack
+  export PGVECTOR_USER=llamastack
+  export PGVECTOR_PASSWORD=llamastack
+  ```
+
+  2. Create DB:
+  ```bash
+  psql -h localhost -U postgres -c "CREATE ROLE llamastack LOGIN PASSWORD 'llamastack';"
+  psql -h localhost -U postgres -c "CREATE DATABASE llamastack OWNER llamastack;"
+  psql -h localhost -U llamastack -d llamastack -c "CREATE EXTENSION IF NOT EXISTS vector;"
+  ```
+
+  ## Installation
+
+  You can install PGVector using docker:
+
+  ```bash
+  docker pull pgvector/pgvector:pg17
+  ```
+  ## Documentation
+  See [PGVector's documentation](https://github.com/pgvector/pgvector) for more details about PGVector in general.
+sidebar_label: Remote - Pgvector
+title: remote::pgvector
+---
+
 # remote::pgvector

 ## Description
@ -18,7 +123,7 @@ There are three implementations of search for PGVectoIndex available:
 - How it works:
  - Uses PostgreSQL's vector extension (pgvector) to perform similarity search
  - Compares query embeddings against stored embeddings using Cosine distance or other distance metrics
-  - Eg. SQL query: SELECT document, embedding <=> %s::vector AS distance FROM table ORDER BY distance
+  - Eg. SQL query: SELECT document, embedding &lt;=&gt; %s::vector AS distance FROM table ORDER BY distance

 -Characteristics:
  - Semantic understanding - finds documents similar in meaning even if they don't share keywords
@ -126,6 +231,4 @@ password: ${env.PGVECTOR_PASSWORD}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/pgvector_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/remote_qdrant.md
+++ b/docs/source/providers/vector_io/remote_qdrant.md
@ -1,3 +1,9 @@
+---
+description: "Please refer to the inline provider documentation."
+sidebar_label: Remote - Qdrant
+title: remote::qdrant
+---
+
 # remote::qdrant

 ## Description
@ -29,6 +35,4 @@ api_key: ${env.QDRANT_API_KEY:=}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db
-
 ```
-
--- a/docs/source/providers/vector_io/remote_weaviate.md
+++ b/docs/source/providers/vector_io/remote_weaviate.md
@ -1,3 +1,38 @@
+---
+description: |
+  [Weaviate](https://weaviate.io/) is a vector database provider for Llama Stack.
+  It allows you to store and query vectors directly within a Weaviate database.
+  That means you're not limited to storing vectors in memory or in a separate service.
+
+  ## Features
+  Weaviate supports:
+  - Store embeddings and their metadata
+  - Vector search
+  - Full-text search
+  - Hybrid search
+  - Document storage
+  - Metadata filtering
+  - Multi-modal retrieval
+
+
+  ## Usage
+
+  To use Weaviate in your Llama Stack project, follow these steps:
+
+  1. Install the necessary dependencies.
+  2. Configure your Llama Stack project to use chroma.
+  3. Start storing and querying vectors.
+
+  ## Installation
+
+  To install Weaviate see the [Weaviate quickstart documentation](https://weaviate.io/developers/weaviate/quickstart).
+
+  ## Documentation
+  See [Weaviate's documentation](https://weaviate.io/developers/weaviate) for more details about Weaviate in general.
+sidebar_label: Remote - Weaviate
+title: remote::weaviate
+---
+
 # remote::weaviate

 ## Description
@ -50,6 +85,4 @@ weaviate_cluster_url: ${env.WEAVIATE_CLUSTER_URL:=localhost:8080}
 kvstore:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/weaviate_registry.db
-
 ```
-
--- a/docs/source/providers/datasetio/index.md
+++ b/docs/source/providers/datasetio/index.md
@ -1,15 +0,0 @@
-# Datasetio
-
-## Overview
-
-This section contains documentation for all available providers for the **datasetio** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_localfs
-remote_huggingface
-remote_nvidia
-```
--- a/docs/source/providers/eval/index.md
+++ b/docs/source/providers/eval/index.md
@ -1,16 +0,0 @@
-# Eval
-
-## Overview
-
-Llama Stack Evaluation API for running evaluations on model and agent candidates.
-
-This section contains documentation for all available providers for the **eval** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_meta-reference
-remote_nvidia
-```
--- a/docs/source/providers/inference/index.md
+++ b/docs/source/providers/inference/index.md
@ -1,42 +0,0 @@
-# Inference
-
-## Overview
-
-Llama Stack Inference API for generating completions, chat completions, and embeddings.
-
-    This API provides the raw interface to the underlying models. Two kinds of models are supported:
-    - LLM models: these models generate "raw" and "chat" (conversational) completions.
-    - Embedding models: these models generate embeddings to be used for semantic search.
-
-This section contains documentation for all available providers for the **inference** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_meta-reference
-inline_sentence-transformers
-remote_anthropic
-remote_azure
-remote_bedrock
-remote_cerebras
-remote_databricks
-remote_fireworks
-remote_gemini
-remote_groq
-remote_hf_endpoint
-remote_hf_serverless
-remote_llama-openai-compat
-remote_nvidia
-remote_ollama
-remote_openai
-remote_passthrough
-remote_runpod
-remote_sambanova
-remote_tgi
-remote_together
-remote_vertexai
-remote_vllm
-remote_watsonx
-```
--- a/docs/source/providers/inference/inline_sentence-transformers.md
+++ b/docs/source/providers/inference/inline_sentence-transformers.md
@ -1,13 +0,0 @@
-# inline::sentence-transformers
-
-## Description
-
-Sentence Transformers inference provider for text embeddings and similarity search.
-
-## Sample Configuration
-
-```yaml
-{}
-
-```
-
--- a/docs/source/providers/post_training/index.md
+++ b/docs/source/providers/post_training/index.md
@ -1,16 +0,0 @@
-# Post_Training
-
-## Overview
-
-This section contains documentation for all available providers for the **post_training** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_huggingface-gpu
-inline_torchtune-cpu
-inline_torchtune-gpu
-remote_nvidia
-```
--- a/docs/source/providers/safety/index.md
+++ b/docs/source/providers/safety/index.md
@ -1,18 +0,0 @@
-# Safety
-
-## Overview
-
-This section contains documentation for all available providers for the **safety** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_code-scanner
-inline_llama-guard
-inline_prompt-guard
-remote_bedrock
-remote_nvidia
-remote_sambanova
-```
--- a/docs/source/providers/scoring/index.md
+++ b/docs/source/providers/scoring/index.md
@ -1,15 +0,0 @@
-# Scoring
-
-## Overview
-
-This section contains documentation for all available providers for the **scoring** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_basic
-inline_braintrust
-inline_llm-as-judge
-```
--- a/docs/source/providers/tool_runtime/index.md
+++ b/docs/source/providers/tool_runtime/index.md
@ -1,18 +0,0 @@
-# Tool_Runtime
-
-## Overview
-
-This section contains documentation for all available providers for the **tool_runtime** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_rag-runtime
-remote_bing-search
-remote_brave-search
-remote_model-context-protocol
-remote_tavily-search
-remote_wolfram-alpha
-```
--- a/docs/source/providers/tool_runtime/remote_model-context-protocol.md
+++ b/docs/source/providers/tool_runtime/remote_model-context-protocol.md
@ -1,13 +0,0 @@
-# remote::model-context-protocol
-
-## Description
-
-Model Context Protocol (MCP) tool for standardized tool calling and context management.
-
-## Sample Configuration
-
-```yaml
-{}
-
-```
-
--- a/docs/source/providers/vector_io/index.md
+++ b/docs/source/providers/vector_io/index.md
@ -1,24 +0,0 @@
-# Vector_Io
-
-## Overview
-
-This section contains documentation for all available providers for the **vector_io** API.
-
-## Providers
-
-```{toctree}
-:maxdepth: 1
-
-inline_chromadb
-inline_faiss
-inline_meta-reference
-inline_milvus
-inline_qdrant
-inline_sqlite-vec
-inline_sqlite_vec
-remote_chromadb
-remote_milvus
-remote_pgvector
-remote_qdrant
-remote_weaviate
-```
--- a/llama_stack/distributions/nvidia/doc_template.md
+++ b/llama_stack/distributions/nvidia/doc_template.md
@ -49,22 +49,22 @@ The deployed platform includes the NIM Proxy microservice, which is the service
 ### Datasetio API: NeMo Data Store
 The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.

-See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
+See the [NVIDIA Datasetio docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/datasetio/nvidia/README.md) for supported features and example usage.

 ### Eval API: NeMo Evaluator
 The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.

-See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
+See the [NVIDIA Eval docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/eval/nvidia/README.md) for supported features and example usage.

 ### Post-Training API: NeMo Customizer
-The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
+The NeMo Customizer microservice supports fine-tuning models. You can reference [this list of supported models](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/models.py) that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.

-See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
+See the [NVIDIA Post-Training docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/post_training/nvidia/README.md) for supported features and example usage.

 ### Safety API: NeMo Guardrails
 The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.

-See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
+See the [NVIDIA Safety docs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/safety/nvidia/README.md) for supported features and example usage.

 ## Deploying models
 In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
@ -138,4 +138,4 @@ llama stack run ./run.yaml \
 ```

 ## Example Notebooks
-For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
+For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in [docs/notebooks/nvidia](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks/nvidia).
--- a/Show more
+++ b/Show more