Merge branch 'main' into opengauss-add

2025-12-24 07:28:03 +00:00 · 2025-08-08 20:58:48 +08:00 · 2025-08-08 20:58:48 +08:00 · 39e49ab97a
commit 39e49ab97a
parent 5e9c394500 9e78f2da96
807 changed files with 79555 additions and 26772 deletions
--- a/docs/source/providers/agents/index.md
+++ b/docs/source/providers/agents/index.md
@ -1,5 +1,13 @@
-# Agents Providers
+# Agents
+
+## Overview

 This section contains documentation for all available providers for the **agents** API.

- [inline::meta-reference](inline_meta-reference.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+```
--- a/docs/source/providers/datasetio/index.md
+++ b/docs/source/providers/datasetio/index.md
@ -1,7 +1,15 @@
-# Datasetio Providers
+# Datasetio
+
+## Overview

 This section contains documentation for all available providers for the **datasetio** API.

- [inline::localfs](inline_localfs.md)
- [remote::huggingface](remote_huggingface.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_localfs
+remote_huggingface
+remote_nvidia
+```
--- a/docs/source/providers/eval/index.md
+++ b/docs/source/providers/eval/index.md
@ -1,6 +1,14 @@
-# Eval Providers
+# Eval
+
+## Overview

 This section contains documentation for all available providers for the **eval** API.

- [inline::meta-reference](inline_meta-reference.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+remote_nvidia
+```
--- a/docs/source/providers/external/external-providers-guide.md
+++ b/docs/source/providers/external/external-providers-guide.md
@ -1,13 +1,17 @@
-# External Providers Guide
-
-Llama Stack supports external providers that live outside of the main codebase. This allows you to:
- Create and maintain your own providers independently
- Share providers with others without contributing to the main codebase
- Keep provider-specific code separate from the core Llama Stack code
+# Creating External Providers

 ## Configuration

-To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
+To enable external providers, you need to add `module` into your build yaml, allowing Llama Stack to install the required package corresponding to the external provider.
+
+an example entry in your build.yaml should look like:
+
+```
+- provider_type: remote::ramalama
+  module: ramalama_stack
+```
+
+Additionally you can configure the `external_providers_dir` in your Llama Stack configuration. This method is in the process of being deprecated in favor of the `module` method. If using this method, the external provider directory should contain your external provider specifications:

 ```yaml
 external_providers_dir: ~/.llama/providers.d/
@ -46,17 +50,6 @@ Llama Stack supports two types of external providers:
 1. **Remote Providers**: Providers that communicate with external services (e.g., cloud APIs)
 2. **Inline Providers**: Providers that run locally within the Llama Stack process

-## Known External Providers
-
-Here's a list of known external providers that you can use with Llama Stack:
-
-| Name | Description | API | Type | Repository |
-|------|-------------|-----|------|------------|
-| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
-| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
-| RamaLama | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
-| TrustyAI LM-Eval | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
-
 ### Remote Provider Specification

 Remote providers are used when you need to communicate with external services. Here's an example for a custom Ollama provider:
@ -110,9 +103,34 @@ container_image: custom-vector-store:latest  # optional
 - `provider_data_validator`: Optional validator for provider data
 - `container_image`: Optional container image to use instead of pip packages

-## Required Implementation
+## Required Fields

-### Remote Providers
+### All Providers
+
+All providers must contain a `get_provider_spec` function in their `provider` module. This is a standardized structure that Llama Stack expects and is necessary for getting things such as the config class. The `get_provider_spec` method returns a structure identical to the `adapter`. An example function may look like:
+
+```python
+from llama_stack.providers.datatypes import (
+    ProviderSpec,
+    Api,
+    AdapterSpec,
+    remote_provider_spec,
+)
+
+
+def get_provider_spec() -> ProviderSpec:
+    return remote_provider_spec(
+        api=Api.inference,
+        adapter=AdapterSpec(
+            adapter_type="ramalama",
+            pip_packages=["ramalama>=0.8.5", "pymilvus"],
+            config_class="ramalama_stack.config.RamalamaImplConfig",
+            module="ramalama_stack",
+        ),
+    )
+```
+
+#### Remote Providers

 Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
 1. `config`: An instance of the provider's config class
@ -128,7 +146,7 @@ async def get_adapter_impl(
    return OllamaInferenceAdapter(config)
 ```

-### Inline Providers
+#### Inline Providers

 Inline providers must expose a `get_provider_impl()` function in their module that takes two arguments:
 1. `config`: An instance of the provider's config class
@ -155,7 +173,40 @@ Version: 0.1.0
 Location: /path/to/venv/lib/python3.10/site-packages
 ```

-## Example: Custom Ollama Provider
+## Best Practices
+
+1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
+
+2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
+
+3. **Dependencies**: Only include the minimum required dependencies in your provider package.
+
+4. **Documentation**: Include clear documentation in your provider package about:
+   - Installation requirements
+   - Configuration options
+   - Usage examples
+   - Any limitations or known issues
+
+5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
+You can refer to the [integration tests
+guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
+information. Execute the test for the Provider type you are developing.
+
+## Troubleshooting
+
+If your external provider isn't being loaded:
+
+1. Check that `module` points to a published pip package with a top level `provider` module including `get_provider_spec`.
+1. Check that the `external_providers_dir` path is correct and accessible.
+2. Verify that the YAML files are properly formatted.
+3. Ensure all required Python packages are installed.
+4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
+   information using `LLAMA_STACK_LOGGING=all=debug`.
+5. Verify that the provider package is installed in your Python environment if using `external_providers_dir`.
+
+## Examples
+
+### Example using `external_providers_dir`: Custom Ollama Provider

 Here's a complete example of creating and using a custom Ollama provider:

@ -206,32 +257,30 @@ external_providers_dir: ~/.llama/providers.d/

 The provider will now be available in Llama Stack with the type `remote::custom_ollama`.

-## Best Practices

-1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
+### Example using `module`: ramalama-stack

-2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
+[ramalama-stack](https://github.com/containers/ramalama-stack) is a recognized external provider that supports installation via module.

-3. **Dependencies**: Only include the minimum required dependencies in your provider package.
+To install Llama Stack with this external provider a user can provider the following build.yaml:

-4. **Documentation**: Include clear documentation in your provider package about:
-   - Installation requirements
-   - Configuration options
-   - Usage examples
-   - Any limitations or known issues
+```yaml
+version: 2
+distribution_spec:
+  description: Use (an external) Ramalama server for running LLM inference
+  container_image: null
+  providers:
+    inference:
+    - provider_type: remote::ramalama
+      module: ramalama_stack==0.3.0a0
+image_type: venv
+image_name: null
+external_providers_dir: null
+additional_pip_packages:
+- aiosqlite
+- sqlalchemy[asyncio]
+```

-5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
-You can refer to the [integration tests
-guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
-information. Execute the test for the Provider type you are developing.
+No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.

-## Troubleshooting
-
-If your external provider isn't being loaded:
-
-1. Check that the `external_providers_dir` path is correct and accessible.
-2. Verify that the YAML files are properly formatted.
-3. Ensure all required Python packages are installed.
-4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
-   information using `LLAMA_STACK_LOGGING=all=debug`.
-5. Verify that the provider package is installed in your Python environment.
+The provider will now be available in Llama Stack with the type `remote::ramalama`.
--- a/docs/source/providers/external/external-providers-list.md
+++ b/docs/source/providers/external/external-providers-list.md
@ -0,0 +1,10 @@
+# Known External Providers
+
+Here's a list of known external providers that you can use with Llama Stack:
+
+| Name | Description | API | Type | Repository |
+|------|-------------|-----|------|------------|
+| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
+| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
+| RamaLama | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
+| TrustyAI LM-Eval | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
--- a/docs/source/providers/external/index.md
+++ b/docs/source/providers/external/index.md
@ -0,0 +1,13 @@
+# External Providers
+
+Llama Stack supports external providers that live outside of the main codebase. This allows you to:
+- Create and maintain your own providers independently
+- Share providers with others without contributing to the main codebase
+- Keep provider-specific code separate from the core Llama Stack code
+
+```{toctree}
+:maxdepth: 1
+
+external-providers-list
+external-providers-guide
+```
--- a/docs/source/providers/files/index.md
+++ b/docs/source/providers/files/index.md
@ -1,5 +1,13 @@
-# Files Providers
+# Files
+
+## Overview

 This section contains documentation for all available providers for the **files** API.

- [inline::localfs](inline_localfs.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_localfs
+```
--- a/docs/source/providers/files/inline_localfs.md
+++ b/docs/source/providers/files/inline_localfs.md
@ -8,7 +8,7 @@ Local filesystem-based file storage provider for managing files and documents lo

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `storage_dir` | `<class 'str'>` | No | PydanticUndefined | Directory to store uploaded files |
+| `storage_dir` | `<class 'str'>` | No |  | Directory to store uploaded files |
 | `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
 | `ttl_secs` | `<class 'int'>` | No | 31536000 |  |

--- a/docs/source/providers/index.md
+++ b/docs/source/providers/index.md
@ -1,4 +1,4 @@
-# Providers Overview
+# API Providers

 The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
 - LLM inference providers (e.g., Meta Reference, Ollama, Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, OpenAI, Anthropic, Gemini, WatsonX, etc.),
@ -12,105 +12,17 @@ Providers come in two flavors:

 Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.

-## External Providers
-
-Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently.
-
-```{toctree}
-:maxdepth: 1
-
-external
-```
-
-## Agents
-Run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
-
-```{toctree}
-:maxdepth: 1
-
-agents/index
-```
-
-## DatasetIO
-Interfaces with datasets and data loaders.
-
-```{toctree}
-:maxdepth: 1
-
-datasetio/index
-```
-
-## Eval
-Generates outputs (via Inference or Agents) and perform scoring.
-
-```{toctree}
-:maxdepth: 1
-
-eval/index
-```
-
-## Inference
-Runs inference with an LLM.
-
 ```{toctree}
 :maxdepth: 1

+external/index
+openai
 inference/index
-```
-
-## Post Training
-Fine-tunes a model.
-
-```{toctree}
-:maxdepth: 1
-
-post_training/index
-```
-
-## Safety
-Applies safety policies to the output at a Systems (not only model) level.
-
-```{toctree}
-:maxdepth: 1
-
+agents/index
+datasetio/index
 safety/index
-```
-
-## Scoring
-Evaluates the outputs of the system.
-
-```{toctree}
-:maxdepth: 1
-
-scoring/index
-```
-
-## Telemetry
-Collects telemetry data from the system.
-
-```{toctree}
-:maxdepth: 1
-
 telemetry/index
-```
-
-## Tool Runtime
-Is associated with the ToolGroup resouces.
-
-```{toctree}
-:maxdepth: 1
-
-tool_runtime/index
-```
-
-## Vector IO
-
-Vector IO refers to operations on vector databases, such as adding documents, searching, and deleting documents.
-Vector IO plays a crucial role in [Retreival Augmented Generation (RAG)](../..//building_applications/rag), where the vector
-io and database are used to store and retrieve documents for retrieval.
-
-```{toctree}
-:maxdepth: 1
-
 vector_io/index
+tool_runtime/index
+files/index
 ```
--- a/docs/source/providers/inference/index.md
+++ b/docs/source/providers/inference/index.md
@ -1,32 +1,34 @@
-# Inference Providers
+# Inference
+
+## Overview

 This section contains documentation for all available providers for the **inference** API.

- [inline::meta-reference](inline_meta-reference.md)
- [inline::sentence-transformers](inline_sentence-transformers.md)
- [inline::vllm](inline_vllm.md)
- [remote::anthropic](remote_anthropic.md)
- [remote::bedrock](remote_bedrock.md)
- [remote::cerebras](remote_cerebras.md)
- [remote::cerebras-openai-compat](remote_cerebras-openai-compat.md)
- [remote::databricks](remote_databricks.md)
- [remote::fireworks](remote_fireworks.md)
- [remote::fireworks-openai-compat](remote_fireworks-openai-compat.md)
- [remote::gemini](remote_gemini.md)
- [remote::groq](remote_groq.md)
- [remote::groq-openai-compat](remote_groq-openai-compat.md)
- [remote::hf::endpoint](remote_hf_endpoint.md)
- [remote::hf::serverless](remote_hf_serverless.md)
- [remote::llama-openai-compat](remote_llama-openai-compat.md)
- [remote::nvidia](remote_nvidia.md)
- [remote::ollama](remote_ollama.md)
- [remote::openai](remote_openai.md)
- [remote::passthrough](remote_passthrough.md)
- [remote::runpod](remote_runpod.md)
- [remote::sambanova](remote_sambanova.md)
- [remote::sambanova-openai-compat](remote_sambanova-openai-compat.md)
- [remote::tgi](remote_tgi.md)
- [remote::together](remote_together.md)
- [remote::together-openai-compat](remote_together-openai-compat.md)
- [remote::vllm](remote_vllm.md)
- [remote::watsonx](remote_watsonx.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+inline_sentence-transformers
+remote_anthropic
+remote_bedrock
+remote_cerebras
+remote_databricks
+remote_fireworks
+remote_gemini
+remote_groq
+remote_hf_endpoint
+remote_hf_serverless
+remote_llama-openai-compat
+remote_nvidia
+remote_ollama
+remote_openai
+remote_passthrough
+remote_runpod
+remote_sambanova
+remote_tgi
+remote_together
+remote_vllm
+remote_watsonx
+```
--- a/docs/source/providers/inference/inline_vllm.md
+++ b/docs/source/providers/inference/inline_vllm.md
@ -1,29 +0,0 @@
-# inline::vllm
-
-## Description
-
-vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `tensor_parallel_size` | `<class 'int'>` | No | 1 | Number of tensor parallel replicas (number of GPUs to use). |
-| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
-| `max_model_len` | `<class 'int'>` | No | 4096 | Maximum context length to use during serving. |
-| `max_num_seqs` | `<class 'int'>` | No | 4 | Maximum parallel batch size for generation. |
-| `enforce_eager` | `<class 'bool'>` | No | False | Whether to use eager mode for inference (otherwise cuda graphs are used). |
-| `gpu_memory_utilization` | `<class 'float'>` | No | 0.3 | How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading. |
-
-## Sample Configuration
-
-```yaml
-tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
-max_tokens: ${env.MAX_TOKENS:=4096}
-max_model_len: ${env.MAX_MODEL_LEN:=4096}
-max_num_seqs: ${env.MAX_NUM_SEQS:=4}
-enforce_eager: ${env.ENFORCE_EAGER:=False}
-gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}
-
-```
-
--- a/docs/source/providers/inference/remote_anthropic.md
+++ b/docs/source/providers/inference/remote_anthropic.md
@ -13,7 +13,7 @@ Anthropic inference provider for accessing Claude models and Anthropic's AI serv
 ## Sample Configuration

 ```yaml
-api_key: ${env.ANTHROPIC_API_KEY}
+api_key: ${env.ANTHROPIC_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_cerebras-openai-compat.md
+++ b/docs/source/providers/inference/remote_cerebras-openai-compat.md
@ -1,21 +0,0 @@
-# remote::cerebras-openai-compat
-
-## Description
-
-Cerebras OpenAI-compatible provider for using Cerebras models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Cerebras API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.cerebras.ai/v1 | The URL for the Cerebras API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.cerebras.ai/v1
-api_key: ${env.CEREBRAS_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_cerebras.md
+++ b/docs/source/providers/inference/remote_cerebras.md
@ -15,7 +15,7 @@ Cerebras inference provider for running models on Cerebras Cloud platform.

 ```yaml
 base_url: https://api.cerebras.ai
-api_key: ${env.CEREBRAS_API_KEY}
+api_key: ${env.CEREBRAS_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_databricks.md
+++ b/docs/source/providers/inference/remote_databricks.md
@ -14,8 +14,8 @@ Databricks inference provider for running models on Databricks' unified analytic
 ## Sample Configuration

 ```yaml
-url: ${env.DATABRICKS_URL}
-api_token: ${env.DATABRICKS_API_TOKEN}
+url: ${env.DATABRICKS_URL:=}
+api_token: ${env.DATABRICKS_API_TOKEN:=}

 ```

--- a/docs/source/providers/inference/remote_fireworks-openai-compat.md
+++ b/docs/source/providers/inference/remote_fireworks-openai-compat.md
@ -1,21 +0,0 @@
-# remote::fireworks-openai-compat
-
-## Description
-
-Fireworks AI OpenAI-compatible provider for using Fireworks models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Fireworks API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.fireworks.ai/inference/v1
-api_key: ${env.FIREWORKS_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_fireworks.md
+++ b/docs/source/providers/inference/remote_fireworks.md
@ -8,6 +8,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
 | `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
 | `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Fireworks.ai API Key |

@ -15,7 +16,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire

 ```yaml
 url: https://api.fireworks.ai/inference/v1
-api_key: ${env.FIREWORKS_API_KEY}
+api_key: ${env.FIREWORKS_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_gemini.md
+++ b/docs/source/providers/inference/remote_gemini.md
@ -13,7 +13,7 @@ Google Gemini inference provider for accessing Gemini models and Google's AI ser
 ## Sample Configuration

 ```yaml
-api_key: ${env.GEMINI_API_KEY}
+api_key: ${env.GEMINI_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_groq-openai-compat.md
+++ b/docs/source/providers/inference/remote_groq-openai-compat.md
@ -1,21 +0,0 @@
-# remote::groq-openai-compat
-
-## Description
-
-Groq OpenAI-compatible provider for using Groq models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Groq API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.groq.com/openai/v1 | The URL for the Groq API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.groq.com/openai/v1
-api_key: ${env.GROQ_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_groq.md
+++ b/docs/source/providers/inference/remote_groq.md
@ -15,7 +15,7 @@ Groq inference provider for ultra-fast inference using Groq's LPU technology.

 ```yaml
 url: https://api.groq.com
-api_key: ${env.GROQ_API_KEY}
+api_key: ${env.GROQ_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_hf_endpoint.md
+++ b/docs/source/providers/inference/remote_hf_endpoint.md
@ -8,7 +8,7 @@ HuggingFace Inference Endpoints provider for dedicated model serving.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `endpoint_name` | `<class 'str'>` | No | PydanticUndefined | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
+| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
 | `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |

 ## Sample Configuration
--- a/docs/source/providers/inference/remote_hf_serverless.md
+++ b/docs/source/providers/inference/remote_hf_serverless.md
@ -8,7 +8,7 @@ HuggingFace Inference API serverless provider for on-demand model inference.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `huggingface_repo` | `<class 'str'>` | No | PydanticUndefined | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
+| `huggingface_repo` | `<class 'str'>` | No |  | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
 | `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |

 ## Sample Configuration
--- a/docs/source/providers/inference/remote_ollama.md
+++ b/docs/source/providers/inference/remote_ollama.md
@ -9,6 +9,7 @@ Ollama inference provider for running local models through the Ollama runtime.
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `url` | `<class 'str'>` | No | http://localhost:11434 |  |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |

 ## Sample Configuration

--- a/docs/source/providers/inference/remote_openai.md
+++ b/docs/source/providers/inference/remote_openai.md
@ -9,11 +9,13 @@ OpenAI inference provider for accessing GPT models and other OpenAI services.
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `api_key` | `str \| None` | No |  | API key for OpenAI models |
+| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |

 ## Sample Configuration

 ```yaml
-api_key: ${env.OPENAI_API_KEY}
+api_key: ${env.OPENAI_API_KEY:=}
+base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}

 ```

--- a/docs/source/providers/inference/remote_sambanova-openai-compat.md
+++ b/docs/source/providers/inference/remote_sambanova-openai-compat.md
@ -15,7 +15,7 @@ SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API

 ```yaml
 openai_compat_api_base: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_sambanova.md
+++ b/docs/source/providers/inference/remote_sambanova.md
@ -15,7 +15,7 @@ SambaNova inference provider for running models on SambaNova's dataflow architec

 ```yaml
 url: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_tgi.md
+++ b/docs/source/providers/inference/remote_tgi.md
@ -8,12 +8,12 @@ Text Generation Inference (TGI) provider for HuggingFace model serving.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | PydanticUndefined | The URL for the TGI serving endpoint |
+| `url` | `<class 'str'>` | No |  | The URL for the TGI serving endpoint |

 ## Sample Configuration

 ```yaml
-url: ${env.TGI_URL}
+url: ${env.TGI_URL:=}

 ```

--- a/docs/source/providers/inference/remote_together-openai-compat.md
+++ b/docs/source/providers/inference/remote_together-openai-compat.md
@ -1,21 +0,0 @@
-# remote::together-openai-compat
-
-## Description
-
-Together AI OpenAI-compatible provider for using Together models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Together API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -8,6 +8,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
 | `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
 | `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Together AI API Key |

@ -15,7 +16,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 ```yaml
 url: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY}
+api_key: ${env.TOGETHER_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_vllm.md
+++ b/docs/source/providers/inference/remote_vllm.md
@ -12,11 +12,12 @@ Remote vLLM inference provider for connecting to vLLM servers.
 | `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
 | `api_token` | `str \| None` | No | fake | The API token |
 | `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |

 ## Sample Configuration

 ```yaml
-url: ${env.VLLM_URL}
+url: ${env.VLLM_URL:=}
 max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
 api_token: ${env.VLLM_API_TOKEN:=fake}
 tls_verify: ${env.VLLM_TLS_VERIFY:=true}
--- a/docs/source/providers/openai.md
+++ b/docs/source/providers/openai.md
@ -0,0 +1,193 @@
+## OpenAI API Compatibility
+
+### Server path
+
+Llama Stack exposes an OpenAI-compatible API endpoint at `/v1/openai/v1`. So, for a Llama Stack server running locally on port `8321`, the full url to the OpenAI-compatible API endpoint is `http://localhost:8321/v1/openai/v1`.
+
+### Clients
+
+You should be able to use any client that speaks OpenAI APIs with Llama Stack. We regularly test with the official Llama Stack clients as well as OpenAI's official Python client.
+
+#### Llama Stack Client
+
+When using the Llama Stack client, set the `base_url` to the root of your Llama Stack server. It will automatically route OpenAI-compatible requests to the right server endpoint for you.
+
+```python
+from llama_stack_client import LlamaStackClient
+
+client = LlamaStackClient(base_url="http://localhost:8321")
+```
+
+#### OpenAI Client
+
+When using an OpenAI client, set the `base_url` to the `/v1/openai/v1` path on your Llama Stack server.
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")
+```
+
+Regardless of the client you choose, the following code examples should all work the same.
+
+### APIs implemented
+
+#### Models
+
+Many of the APIs require you to pass in a model parameter. To see the list of models available in your Llama Stack server:
+
+```python
+models = client.models.list()
+```
+
+#### Responses
+
+:::{note}
+The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.
+:::
+
+##### Simple inference
+
+Request:
+
+```
+response = client.responses.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    input="Write a haiku about coding."
+)
+
+print(response.output_text)
+```
+Example output:
+
+```text
+Pixels dancing slow
+Syntax whispers secrets sweet
+Code's gentle silence
+```
+
+##### Structured Output
+
+Request:
+
+```python
+response = client.responses.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    input=[
+        {
+            "role": "system",
+            "content": "Extract the participants from the event information.",
+        },
+        {
+            "role": "user",
+            "content": "Alice and Bob are going to a science fair on Friday.",
+        },
+    ],
+    text={
+        "format": {
+            "type": "json_schema",
+            "name": "participants",
+            "schema": {
+                "type": "object",
+                "properties": {
+                    "participants": {"type": "array", "items": {"type": "string"}}
+                },
+                "required": ["participants"],
+            },
+        }
+    },
+)
+print(response.output_text)
+```
+
+Example output:
+
+```text
+{ "participants": ["Alice", "Bob"] }
+```
+
+#### Chat Completions
+
+##### Simple inference
+
+Request:
+
+```python
+chat_completion = client.chat.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[{"role": "user", "content": "Write a haiku about coding."}],
+)
+
+print(chat_completion.choices[0].message.content)
+```
+
+Example output:
+
+```text
+Lines of code unfold
+Logic flows like a river
+Code's gentle beauty
+```
+
+##### Structured Output
+
+Request:
+
+```python
+chat_completion = client.chat.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[
+        {
+            "role": "system",
+            "content": "Extract the participants from the event information.",
+        },
+        {
+            "role": "user",
+            "content": "Alice and Bob are going to a science fair on Friday.",
+        },
+    ],
+    response_format={
+        "type": "json_schema",
+        "json_schema": {
+            "name": "participants",
+            "schema": {
+                "type": "object",
+                "properties": {
+                    "participants": {"type": "array", "items": {"type": "string"}}
+                },
+                "required": ["participants"],
+            },
+        },
+    },
+)
+
+print(chat_completion.choices[0].message.content)
+```
+
+Example output:
+
+```text
+{ "participants": ["Alice", "Bob"] }
+```
+
+#### Completions
+
+##### Simple inference
+
+Request:
+
+```python
+completion = client.completions.create(
+    model="meta-llama/Llama-3.2-3B-Instruct", prompt="Write a haiku about coding."
+)
+
+print(completion.choices[0].text)
+```
+
+Example output:
+
+```text
+Lines of code unfurl
+Logic whispers in the dark
+Art in hidden form
+```
--- a/docs/source/providers/post_training/huggingface.md
+++ b/docs/source/providers/post_training/huggingface.md
@ -1,122 +0,0 @@
---
-orphan: true
---
-# HuggingFace SFTTrainer
-
-[HuggingFace SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) is an inline post training provider for Llama Stack. It allows you to run supervised fine tuning on a variety of models using many datasets
-
-## Features
-
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support, CPU support, and MPS support (MacOS Metal Performance Shaders)
-
-## Usage
-
-To use the HF SFTTrainer in your Llama Stack project, follow these steps:
-
-1. Configure your Llama Stack project to use this provider.
-2. Kick off a SFT job using the Llama Stack post_training API.
-
-## Setup
-
-You can access the HuggingFace trainer via the `ollama` distribution:
-
-```bash
-llama stack build --template starter --image-type venv
-llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
-```
-
-## Run Training
-
-You can access the provider and the `supervised_fine_tune` method via the post_training API:
-
-```python
-import time
-import uuid
-
-
-from llama_stack_client.types import (
-    post_training_supervised_fine_tune_params,
-    algorithm_config_param,
-)
-
-
-def create_http_client():
-    from llama_stack_client import LlamaStackClient
-
-    return LlamaStackClient(base_url="http://localhost:8321")
-
-
-client = create_http_client()
-
-# Example Dataset
-client.datasets.register(
-    purpose="post-training/messages",
-    source={
-        "type": "uri",
-        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
-    },
-    dataset_id="simpleqa",
-)
-
-training_config = post_training_supervised_fine_tune_params.TrainingConfig(
-    data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
-        batch_size=32,
-        data_format="instruct",
-        dataset_id="simpleqa",
-        shuffle=True,
-    ),
-    gradient_accumulation_steps=1,
-    max_steps_per_epoch=0,
-    max_validation_steps=1,
-    n_epochs=4,
-)
-
-algorithm_config = algorithm_config_param.LoraFinetuningConfig(  # this config is also currently mandatory but should not be
-    alpha=1,
-    apply_lora_to_mlp=True,
-    apply_lora_to_output=False,
-    lora_attn_modules=["q_proj"],
-    rank=1,
-    type="LoRA",
-)
-
-job_uuid = f"test-job{uuid.uuid4()}"
-
-# Example Model
-training_model = "ibm-granite/granite-3.3-8b-instruct"
-
-start_time = time.time()
-response = client.post_training.supervised_fine_tune(
-    job_uuid=job_uuid,
-    logger_config={},
-    model=training_model,
-    hyperparam_search_config={},
-    training_config=training_config,
-    algorithm_config=algorithm_config,
-    checkpoint_dir="output",
-)
-print("Job: ", job_uuid)
-
-
-# Wait for the job to complete!
-while True:
-    status = client.post_training.job.status(job_uuid=job_uuid)
-    if not status:
-        print("Job not found")
-        break
-
-    print(status)
-    if status.status == "completed":
-        break
-
-    print("Waiting for job to complete...")
-    time.sleep(5)
-
-end_time = time.time()
-print("Job completed in", end_time - start_time, "seconds!")
-
-print("Artifacts:")
-print(client.post_training.job.artifacts(job_uuid=job_uuid))
-```
--- a/docs/source/providers/post_training/index.md
+++ b/docs/source/providers/post_training/index.md
@ -1,7 +1,15 @@
-# Post_Training Providers
+# Post_Training
+
+## Overview

 This section contains documentation for all available providers for the **post_training** API.

- [inline::huggingface](inline_huggingface.md)
- [inline::torchtune](inline_torchtune.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_huggingface
+inline_torchtune
+remote_nvidia
+```
--- a/docs/source/providers/post_training/inline_huggingface.md
+++ b/docs/source/providers/post_training/inline_huggingface.md
@ -24,6 +24,10 @@ HuggingFace-based post-training provider for fine-tuning models using the Huggin
 | `weight_decay` | `<class 'float'>` | No | 0.01 |  |
 | `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
 | `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |

 ## Sample Configuration

@ -31,6 +35,7 @@ HuggingFace-based post-training provider for fine-tuning models using the Huggin
 checkpoint_format: huggingface
 distributed_backend: null
 device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output

 ```

--- a/docs/source/providers/post_training/nvidia_nemo.md
+++ b/docs/source/providers/post_training/nvidia_nemo.md
@ -1,163 +0,0 @@
---
-orphan: true
---
-# NVIDIA NEMO
-
-[NVIDIA NEMO](https://developer.nvidia.com/nemo-framework) is a remote post training provider for Llama Stack. It provides enterprise-grade fine-tuning capabilities through NVIDIA's NeMo Customizer service.
-
-## Features
-
- Enterprise-grade fine-tuning capabilities
- Support for LoRA and SFT fine-tuning
- Integration with NVIDIA's NeMo Customizer service
- Support for various NVIDIA-optimized models
- Efficient training with NVIDIA hardware acceleration
-
-## Usage
-
-To use NVIDIA NEMO in your Llama Stack project, follow these steps:
-
-1. Configure your Llama Stack project to use this provider.
-2. Set up your NVIDIA API credentials.
-3. Kick off a fine-tuning job using the Llama Stack post_training API.
-
-## Setup
-
-You'll need to set the following environment variables:
-
-```bash
-export NVIDIA_API_KEY="your-api-key"
-export NVIDIA_DATASET_NAMESPACE="default"
-export NVIDIA_CUSTOMIZER_URL="your-customizer-url"
-export NVIDIA_PROJECT_ID="your-project-id"
-export NVIDIA_OUTPUT_MODEL_DIR="your-output-model-dir"
-```
-
-## Run Training
-
-You can access the provider and the `supervised_fine_tune` method via the post_training API:
-
-```python
-import time
-import uuid
-
-from llama_stack_client.types import (
-    post_training_supervised_fine_tune_params,
-    algorithm_config_param,
-)
-
-
-def create_http_client():
-    from llama_stack_client import LlamaStackClient
-
-    return LlamaStackClient(base_url="http://localhost:8321")
-
-
-client = create_http_client()
-
-# Example Dataset
-client.datasets.register(
-    purpose="post-training/messages",
-    source={
-        "type": "uri",
-        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
-    },
-    dataset_id="simpleqa",
-)
-
-training_config = post_training_supervised_fine_tune_params.TrainingConfig(
-    data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
-        batch_size=8,  # Default batch size for NEMO
-        data_format="instruct",
-        dataset_id="simpleqa",
-        shuffle=True,
-    ),
-    n_epochs=50,  # Default epochs for NEMO
-    optimizer_config=post_training_supervised_fine_tune_params.TrainingConfigOptimizerConfig(
-        lr=0.0001,  # Default learning rate
-        weight_decay=0.01,  # NEMO-specific parameter
-    ),
-    # NEMO-specific parameters
-    log_every_n_steps=None,
-    val_check_interval=0.25,
-    sequence_packing_enabled=False,
-    hidden_dropout=None,
-    attention_dropout=None,
-    ffn_dropout=None,
-)
-
-algorithm_config = algorithm_config_param.LoraFinetuningConfig(
-    alpha=16,  # Default alpha for NEMO
-    type="LoRA",
-)
-
-job_uuid = f"test-job{uuid.uuid4()}"
-
-# Example Model - must be a supported NEMO model
-training_model = "meta/llama-3.1-8b-instruct"
-
-start_time = time.time()
-response = client.post_training.supervised_fine_tune(
-    job_uuid=job_uuid,
-    logger_config={},
-    model=training_model,
-    hyperparam_search_config={},
-    training_config=training_config,
-    algorithm_config=algorithm_config,
-    checkpoint_dir="output",
-)
-print("Job: ", job_uuid)
-
-# Wait for the job to complete!
-while True:
-    status = client.post_training.job.status(job_uuid=job_uuid)
-    if not status:
-        print("Job not found")
-        break
-
-    print(status)
-    if status.status == "completed":
-        break
-
-    print("Waiting for job to complete...")
-    time.sleep(5)
-
-end_time = time.time()
-print("Job completed in", end_time - start_time, "seconds!")
-
-print("Artifacts:")
-print(client.post_training.job.artifacts(job_uuid=job_uuid))
-```
-
-## Supported Models
-
-Currently supports the following models:
- meta/llama-3.1-8b-instruct
- meta/llama-3.2-1b-instruct
-
-## Supported Parameters
-
-### TrainingConfig
- n_epochs (default: 50)
- data_config
- optimizer_config
- log_every_n_steps
- val_check_interval (default: 0.25)
- sequence_packing_enabled (default: False)
- hidden_dropout (0.0-1.0)
- attention_dropout (0.0-1.0)
- ffn_dropout (0.0-1.0)
-
-### DataConfig
- dataset_id
- batch_size (default: 8)
-
-### OptimizerConfig
- lr (default: 0.0001)
- weight_decay (default: 0.01)
-
-### LoRA Config
- alpha (default: 16)
- type (must be "LoRA")
-
-Note: Some parameters from the standard Llama Stack API are not supported and will be ignored with a warning.
--- a/docs/source/providers/post_training/torchtune.md
+++ b/docs/source/providers/post_training/torchtune.md
@ -1,125 +0,0 @@
---
-orphan: true
---
-# TorchTune
-
-[TorchTune](https://github.com/pytorch/torchtune) is an inline post training provider for Llama Stack. It provides a simple and efficient way to fine-tune language models using PyTorch.
-
-## Features
-
- Simple access through the post_training API
- Fully integrated with Llama Stack
- GPU support and single device capabilities.
- Support for LoRA
-
-## Usage
-
-To use TorchTune in your Llama Stack project, follow these steps:
-
-1. Configure your Llama Stack project to use this provider.
-2. Kick off a fine-tuning job using the Llama Stack post_training API.
-
-## Setup
-
-You can access the TorchTune trainer by writing your own yaml pointing to the provider:
-
-```yaml
-post_training:
-  - provider_id: torchtune
-    provider_type: inline::torchtune
-    config: {}
-```
-
-you can then build and run your own stack with this provider.
-
-## Run Training
-
-You can access the provider and the `supervised_fine_tune` method via the post_training API:
-
-```python
-import time
-import uuid
-
-from llama_stack_client.types import (
-    post_training_supervised_fine_tune_params,
-    algorithm_config_param,
-)
-
-
-def create_http_client():
-    from llama_stack_client import LlamaStackClient
-
-    return LlamaStackClient(base_url="http://localhost:8321")
-
-
-client = create_http_client()
-
-# Example Dataset
-client.datasets.register(
-    purpose="post-training/messages",
-    source={
-        "type": "uri",
-        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
-    },
-    dataset_id="simpleqa",
-)
-
-training_config = post_training_supervised_fine_tune_params.TrainingConfig(
-    data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
-        batch_size=32,
-        data_format="instruct",
-        dataset_id="simpleqa",
-        shuffle=True,
-    ),
-    gradient_accumulation_steps=1,
-    max_steps_per_epoch=0,
-    max_validation_steps=1,
-    n_epochs=4,
-)
-
-algorithm_config = algorithm_config_param.LoraFinetuningConfig(
-    alpha=1,
-    apply_lora_to_mlp=True,
-    apply_lora_to_output=False,
-    lora_attn_modules=["q_proj"],
-    rank=1,
-    type="LoRA",
-)
-
-job_uuid = f"test-job{uuid.uuid4()}"
-
-# Example Model
-training_model = "meta-llama/Llama-2-7b-hf"
-
-start_time = time.time()
-response = client.post_training.supervised_fine_tune(
-    job_uuid=job_uuid,
-    logger_config={},
-    model=training_model,
-    hyperparam_search_config={},
-    training_config=training_config,
-    algorithm_config=algorithm_config,
-    checkpoint_dir="output",
-)
-print("Job: ", job_uuid)
-
-# Wait for the job to complete!
-while True:
-    status = client.post_training.job.status(job_uuid=job_uuid)
-    if not status:
-        print("Job not found")
-        break
-
-    print(status)
-    if status.status == "completed":
-        break
-
-    print("Waiting for job to complete...")
-    time.sleep(5)
-
-end_time = time.time()
-print("Job completed in", end_time - start_time, "seconds!")
-
-print("Artifacts:")
-print(client.post_training.job.artifacts(job_uuid=job_uuid))
-```
--- a/docs/source/providers/safety/index.md
+++ b/docs/source/providers/safety/index.md
@ -1,10 +1,18 @@
-# Safety Providers
+# Safety
+
+## Overview

 This section contains documentation for all available providers for the **safety** API.

- [inline::code-scanner](inline_code-scanner.md)
- [inline::llama-guard](inline_llama-guard.md)
- [inline::prompt-guard](inline_prompt-guard.md)
- [remote::bedrock](remote_bedrock.md)
- [remote::nvidia](remote_nvidia.md)
- [remote::sambanova](remote_sambanova.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_code-scanner
+inline_llama-guard
+inline_prompt-guard
+remote_bedrock
+remote_nvidia
+remote_sambanova
+```
--- a/docs/source/providers/safety/remote_sambanova.md
+++ b/docs/source/providers/safety/remote_sambanova.md
@ -15,7 +15,7 @@ SambaNova's safety provider for content moderation and safety filtering.

 ```yaml
 url: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/scoring/index.md
+++ b/docs/source/providers/scoring/index.md
@ -1,7 +1,15 @@
-# Scoring Providers
+# Scoring
+
+## Overview

 This section contains documentation for all available providers for the **scoring** API.

- [inline::basic](inline_basic.md)
- [inline::braintrust](inline_braintrust.md)
- [inline::llm-as-judge](inline_llm-as-judge.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_basic
+inline_braintrust
+inline_llm-as-judge
+```
--- a/docs/source/providers/telemetry/index.md
+++ b/docs/source/providers/telemetry/index.md
@ -1,5 +1,13 @@
-# Telemetry Providers
+# Telemetry
+
+## Overview

 This section contains documentation for all available providers for the **telemetry** API.

- [inline::meta-reference](inline_meta-reference.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+```
--- a/docs/source/providers/tool_runtime/index.md
+++ b/docs/source/providers/tool_runtime/index.md
@ -1,10 +1,18 @@
-# Tool_Runtime Providers
+# Tool_Runtime
+
+## Overview

 This section contains documentation for all available providers for the **tool_runtime** API.

- [inline::rag-runtime](inline_rag-runtime.md)
- [remote::bing-search](remote_bing-search.md)
- [remote::brave-search](remote_brave-search.md)
- [remote::model-context-protocol](remote_model-context-protocol.md)
- [remote::tavily-search](remote_tavily-search.md)
- [remote::wolfram-alpha](remote_wolfram-alpha.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_rag-runtime
+remote_bing-search
+remote_brave-search
+remote_model-context-protocol
+remote_tavily-search
+remote_wolfram-alpha
+```
--- a/docs/source/providers/vector_io/index.md
+++ b/docs/source/providers/vector_io/index.md
@ -1,17 +1,17 @@
-# Vector_Io Providers
+## Providers

-This section contains documentation for all available providers for the **vector_io** API.
+```{toctree}
+:maxdepth: 1

- [inline::chromadb](inline_chromadb.md)
- [inline::faiss](inline_faiss.md)
- [inline::meta-reference](inline_meta-reference.md)
- [inline::milvus](inline_milvus.md)
- [inline::qdrant](inline_qdrant.md)
- [inline::sqlite-vec](inline_sqlite-vec.md)
- [inline::sqlite_vec](inline_sqlite_vec.md)
- [remote::chromadb](remote_chromadb.md)
- [remote::milvus](remote_milvus.md)
- [remote::opengauss](remote_opengauss.md)
- [remote::pgvector](remote_pgvector.md)
- [remote::qdrant](remote_qdrant.md)
- [remote::weaviate](remote_weaviate.md)
+inline_chromadb
+inline_faiss
+inline_meta-reference
+inline_milvus
+inline_qdrant
+inline_sqlite-vec
+remote_chromadb
+remote_milvus
+remote_opengauss
+remote_pgvector
+remote_qdrant
+remote_weaviate
--- a/docs/source/providers/vector_io/inline_chromadb.md
+++ b/docs/source/providers/vector_io/inline_chromadb.md
@ -41,12 +41,16 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `db_path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |

 ## Sample Configuration

 ```yaml
 db_path: ${env.CHROMADB_PATH}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_inline_registry.db

 ```

--- a/docs/source/providers/vector_io/inline_milvus.md
+++ b/docs/source/providers/vector_io/inline_milvus.md
@ -10,7 +10,7 @@ Please refer to the remote provider documentation.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `db_path` | `<class 'str'>` | No |  |  |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |
 | `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |

--- a/docs/source/providers/vector_io/inline_qdrant.md
+++ b/docs/source/providers/vector_io/inline_qdrant.md
@ -50,12 +50,16 @@ See the [Qdrant documentation](https://qdrant.tech/documentation/) for more deta

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |

 ## Sample Configuration

 ```yaml
 path: ${env.QDRANT_PATH:=~/.llama/~/.llama/dummy}/qdrant.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db

 ```

--- a/docs/source/providers/vector_io/inline_sqlite-vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite-vec.md
@ -205,7 +205,7 @@ See [sqlite-vec's GitHub repo](https://github.com/asg017/sqlite-vec/tree/main) f

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined | Path to the SQLite database file |
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration
--- a/docs/source/providers/vector_io/inline_sqlite_vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite_vec.md
@ -10,7 +10,7 @@ Please refer to the sqlite-vec provider documentation.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined | Path to the SQLite database file |
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration
--- a/docs/source/providers/vector_io/remote_chromadb.md
+++ b/docs/source/providers/vector_io/remote_chromadb.md
@ -40,12 +40,16 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `url` | `str \| None` | No | PydanticUndefined |  |
+| `url` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |

 ## Sample Configuration

 ```yaml
 url: ${env.CHROMADB_URL}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_remote_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_milvus.md
+++ b/docs/source/providers/vector_io/remote_milvus.md
@ -111,10 +111,10 @@ For more details on TLS configuration, refer to the [TLS setup guide](https://mi

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `uri` | `<class 'str'>` | No | PydanticUndefined | The URI of the Milvus server |
-| `token` | `str \| None` | No | PydanticUndefined | The token of the Milvus server |
+| `uri` | `<class 'str'>` | No |  | The URI of the Milvus server |
+| `token` | `str \| None` | No |  | The token of the Milvus server |
 | `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |
-| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
 | `config` | `dict` | No | {} | This configuration allows additional fields to be passed through to the underlying Milvus client. See the [Milvus](https://milvus.io/docs/install-overview.md) documentation for more details about Milvus in general. |

 > **Note**: This configuration class accepts additional fields beyond those listed above. You can pass any additional configuration options that will be forwarded to the underlying provider.
@ -124,6 +124,9 @@ For more details on TLS configuration, refer to the [TLS setup guide](https://mi
 ```yaml
 uri: ${env.MILVUS_ENDPOINT}
 token: ${env.MILVUS_TOKEN}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_remote_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_pgvector.md
+++ b/docs/source/providers/vector_io/remote_pgvector.md
@ -17,7 +17,7 @@ That means you'll get fast and efficient vector retrieval.
 To use PGVector in your Llama Stack project, follow these steps:

 1. Install the necessary dependencies.
-2. Configure your Llama Stack project to use Faiss.
+2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
 3. Start storing and querying vectors.

 ## Installation
@ -40,6 +40,7 @@ See [PGVector's documentation](https://github.com/pgvector/pgvector) for more de
 | `db` | `str \| None` | No | postgres |  |
 | `user` | `str \| None` | No | postgres |  |
 | `password` | `str \| None` | No | mysecretpassword |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration

@ -49,6 +50,9 @@ port: ${env.PGVECTOR_PORT:=5432}
 db: ${env.PGVECTOR_DB}
 user: ${env.PGVECTOR_USER}
 password: ${env.PGVECTOR_PASSWORD}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/pgvector_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_qdrant.md
+++ b/docs/source/providers/vector_io/remote_qdrant.md
@ -20,11 +20,15 @@ Please refer to the inline provider documentation.
 | `prefix` | `str \| None` | No |  |  |
 | `timeout` | `int \| None` | No |  |  |
 | `host` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |

 ## Sample Configuration

 ```yaml
-api_key: ${env.QDRANT_API_KEY}
+api_key: ${env.QDRANT_API_KEY:=}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_weaviate.md
+++ b/docs/source/providers/vector_io/remote_weaviate.md
@ -33,10 +33,22 @@ To install Weaviate see the [Weaviate quickstart documentation](https://weaviate
 See [Weaviate's documentation](https://weaviate.io/developers/weaviate) for more details about Weaviate in general.


+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `weaviate_api_key` | `str \| None` | No |  | The API key for the Weaviate instance |
+| `weaviate_cluster_url` | `str \| None` | No | localhost:8080 | The URL of the Weaviate cluster |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+
 ## Sample Configuration

 ```yaml
-{}
+weaviate_api_key: null
+weaviate_cluster_url: ${env.WEAVIATE_CLUSTER_URL:=localhost:8080}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/weaviate_registry.db

 ```