Merge branch 'main' into opengauss-add

2025-12-23 00:59:41 +00:00 · 2025-08-08 20:58:48 +08:00 · 2025-08-08 20:58:48 +08:00 · 39e49ab97a
commit 39e49ab97a
parent 5e9c394500 9e78f2da96
807 changed files with 79555 additions and 26772 deletions
--- a/docs/source/advanced_apis/eval/index.md
+++ b/docs/source/advanced_apis/eval/index.md
@ -0,0 +1,6 @@
+# Eval Providers
+
+This section contains documentation for all available providers for the **eval** API.
+
+- [inline::meta-reference](inline_meta-reference.md)
+- [remote::nvidia](remote_nvidia.md)
--- a/docs/source/advanced_apis/eval/inline_meta-reference.md
+++ b/docs/source/advanced_apis/eval/inline_meta-reference.md
@ -0,0 +1,25 @@
+---
+orphan: true
+---
+
+# inline::meta-reference
+
+## Description
+
+Meta's reference implementation of evaluation tasks with support for multiple languages and evaluation metrics.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |
+
+## Sample Configuration
+
+```yaml
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/meta_reference_eval.db
+
+```
+
--- a/docs/source/advanced_apis/eval/remote_nvidia.md
+++ b/docs/source/advanced_apis/eval/remote_nvidia.md
@ -0,0 +1,23 @@
+---
+orphan: true
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's evaluation provider for running evaluation tasks on NVIDIA's platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `evaluator_url` | `<class 'str'>` | No | http://0.0.0.0:7331 | The url for accessing the evaluator service |
+
+## Sample Configuration
+
+```yaml
+evaluator_url: ${env.NVIDIA_EVALUATOR_URL:=http://localhost:7331}
+
+```
+
--- a/docs/source/advanced_apis/evaluation_concepts.md
+++ b/docs/source/advanced_apis/evaluation_concepts.md
@ -43,7 +43,7 @@ We have built-in functionality to run the supported open-benckmarks using llama-

 Spin up llama stack server with 'open-benchmark' template
 ```
-llama stack run llama_stack/templates/open-benchmark/run.yaml
+llama stack run llama_stack/distributions/open-benchmark/run.yaml

 ```

--- a/docs/source/advanced_apis/index.md
+++ b/docs/source/advanced_apis/index.md
@ -0,0 +1,33 @@
+# Advanced APIs
+
+## Post-training
+Fine-tunes a model.
+
+```{toctree}
+:maxdepth: 1
+
+post_training/index
+```
+
+## Eval
+Generates outputs (via Inference or Agents) and perform scoring.
+
+```{toctree}
+:maxdepth: 1
+
+eval/index
+```
+
+```{include} evaluation_concepts.md
+:start-after: ## Evaluation Concepts
+```
+
+## Scoring
+Evaluates the outputs of the system.
+
+```{toctree}
+:maxdepth: 1
+
+scoring/index
+```
+
--- a/docs/source/advanced_apis/post_training/huggingface.md
+++ b/docs/source/advanced_apis/post_training/huggingface.md
@ -23,7 +23,7 @@ To use the HF SFTTrainer in your Llama Stack project, follow these steps:
 You can access the HuggingFace trainer via the `ollama` distribution:

 ```bash
-llama stack build --template starter --image-type venv
+llama stack build --distro starter --image-type venv
 llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
 ```

--- a/docs/source/advanced_apis/post_training/index.md
+++ b/docs/source/advanced_apis/post_training/index.md
@ -0,0 +1,7 @@
+# Post_Training Providers
+
+This section contains documentation for all available providers for the **post_training** API.
+
+- [inline::huggingface](inline_huggingface.md)
+- [inline::torchtune](inline_torchtune.md)
+- [remote::nvidia](remote_nvidia.md)
--- a/docs/source/advanced_apis/post_training/inline_huggingface.md
+++ b/docs/source/advanced_apis/post_training/inline_huggingface.md
@ -0,0 +1,37 @@
+---
+orphan: true
+---
+
+# inline::huggingface
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | |
+| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+
+```
+
--- a/docs/source/advanced_apis/post_training/inline_torchtune.md
+++ b/docs/source/advanced_apis/post_training/inline_torchtune.md
@ -0,0 +1,24 @@
+---
+orphan: true
+---
+
+# inline::torchtune
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+
+```
+
--- a/docs/source/advanced_apis/post_training/nvidia_nemo.md
+++ b/docs/source/advanced_apis/post_training/nvidia_nemo.md
--- a/docs/source/advanced_apis/post_training/remote_nvidia.md
+++ b/docs/source/advanced_apis/post_training/remote_nvidia.md
@ -0,0 +1,32 @@
+---
+orphan: true
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The NVIDIA API key. |
+| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
+| `project_id` | `str \| None` | No | test-example-model@v1 | The NVIDIA project ID. |
+| `customizer_url` | `str \| None` | No |  | Base URL for the NeMo Customizer API |
+| `timeout` | `<class 'int'>` | No | 300 | Timeout for the NVIDIA Post Training API |
+| `max_retries` | `<class 'int'>` | No | 3 | Maximum number of retries for the NVIDIA Post Training API |
+| `output_model_dir` | `<class 'str'>` | No | test-example-model@v1 | Directory to save the output model |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.NVIDIA_API_KEY:=}
+dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
+project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
+customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
+
+```
+
--- a/docs/source/advanced_apis/post_training/torchtune.md
+++ b/docs/source/advanced_apis/post_training/torchtune.md
--- a/docs/source/advanced_apis/scoring/index.md
+++ b/docs/source/advanced_apis/scoring/index.md
@ -0,0 +1,7 @@
+# Scoring Providers
+
+This section contains documentation for all available providers for the **scoring** API.
+
+- [inline::basic](inline_basic.md)
+- [inline::braintrust](inline_braintrust.md)
+- [inline::llm-as-judge](inline_llm-as-judge.md)
--- a/docs/source/advanced_apis/scoring/inline_basic.md
+++ b/docs/source/advanced_apis/scoring/inline_basic.md
@ -0,0 +1,17 @@
+---
+orphan: true
+---
+
+# inline::basic
+
+## Description
+
+Basic scoring provider for simple evaluation metrics and scoring functions.
+
+## Sample Configuration
+
+```yaml
+{}
+
+```
+
--- a/docs/source/advanced_apis/scoring/inline_braintrust.md
+++ b/docs/source/advanced_apis/scoring/inline_braintrust.md
@ -0,0 +1,23 @@
+---
+orphan: true
+---
+
+# inline::braintrust
+
+## Description
+
+Braintrust scoring provider for evaluation and scoring using the Braintrust platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `openai_api_key` | `str \| None` | No |  | The OpenAI API Key |
+
+## Sample Configuration
+
+```yaml
+openai_api_key: ${env.OPENAI_API_KEY:=}
+
+```
+
--- a/docs/source/advanced_apis/scoring/inline_llm-as-judge.md
+++ b/docs/source/advanced_apis/scoring/inline_llm-as-judge.md
@ -0,0 +1,17 @@
+---
+orphan: true
+---
+
+# inline::llm-as-judge
+
+## Description
+
+LLM-as-judge scoring provider that uses language models to evaluate and score responses.
+
+## Sample Configuration
+
+```yaml
+{}
+
+```
+
--- a/docs/source/apis/external.md
+++ b/docs/source/apis/external.md
@ -0,0 +1,392 @@
+# External APIs
+
+Llama Stack supports external APIs that live outside of the main codebase. This allows you to:
+- Create and maintain your own APIs independently
+- Share APIs with others without contributing to the main codebase
+- Keep API-specific code separate from the core Llama Stack code
+
+## Configuration
+
+To enable external APIs, you need to configure the `external_apis_dir` in your Llama Stack configuration. This directory should contain your external API specifications:
+
+```yaml
+external_apis_dir: ~/.llama/apis.d/
+```
+
+## Directory Structure
+
+The external APIs directory should follow this structure:
+
+```
+apis.d/
+  custom_api1.yaml
+  custom_api2.yaml
+```
+
+Each YAML file in these directories defines an API specification.
+
+## API Specification
+
+Here's an example of an external API specification for a weather API:
+
+```yaml
+module: weather
+api_dependencies:
+  - inference
+protocol: WeatherAPI
+name: weather
+pip_packages:
+  - llama-stack-api-weather
+```
+
+### API Specification Fields
+
+- `module`: Python module containing the API implementation
+- `protocol`: Name of the protocol class for the API
+- `name`: Name of the API
+- `pip_packages`: List of pip packages to install the API, typically a single package
+
+## Required Implementation
+
+External APIs must expose a `available_providers()` function in their module that returns a list of provider names:
+
+```python
+# llama_stack_api_weather/api.py
+from llama_stack.providers.datatypes import Api, InlineProviderSpec, ProviderSpec
+
+
+def available_providers() -> list[ProviderSpec]:
+    return [
+        InlineProviderSpec(
+            api=Api.weather,
+            provider_type="inline::darksky",
+            pip_packages=[],
+            module="llama_stack_provider_darksky",
+            config_class="llama_stack_provider_darksky.DarkSkyWeatherImplConfig",
+        ),
+    ]
+```
+
+A Protocol class like so:
+
+```python
+# llama_stack_api_weather/api.py
+from typing import Protocol
+
+from llama_stack.schema_utils import webmethod
+
+
+class WeatherAPI(Protocol):
+    """
+    A protocol for the Weather API.
+    """
+
+    @webmethod(route="/locations", method="GET")
+    async def get_available_locations() -> dict[str, list[str]]:
+        """
+        Get the available locations.
+        """
+        ...
+```
+
+## Example: Custom API
+
+Here's a complete example of creating and using a custom API:
+
+1. First, create the API package:
+
+```bash
+mkdir -p llama-stack-api-weather
+cd llama-stack-api-weather
+mkdir src/llama_stack_api_weather
+git init
+uv init
+```
+
+2. Edit `pyproject.toml`:
+
+```toml
+[project]
+name = "llama-stack-api-weather"
+version = "0.1.0"
+description = "Weather API for Llama Stack"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = ["llama-stack", "pydantic"]
+
+[build-system]
+requires = ["setuptools"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools.packages.find]
+where = ["src"]
+include = ["llama_stack_api_weather", "llama_stack_api_weather.*"]
+```
+
+3. Create the initial files:
+
+```bash
+touch src/llama_stack_api_weather/__init__.py
+touch src/llama_stack_api_weather/api.py
+```
+
+```python
+# llama-stack-api-weather/src/llama_stack_api_weather/__init__.py
+"""Weather API for Llama Stack."""
+
+from .api import WeatherAPI, available_providers
+
+__all__ = ["WeatherAPI", "available_providers"]
+```
+
+4. Create the API implementation:
+
+```python
+# llama-stack-api-weather/src/llama_stack_api_weather/weather.py
+from typing import Protocol
+
+from llama_stack.providers.datatypes import (
+    AdapterSpec,
+    Api,
+    ProviderSpec,
+    RemoteProviderSpec,
+)
+from llama_stack.schema_utils import webmethod
+
+
+def available_providers() -> list[ProviderSpec]:
+    return [
+        RemoteProviderSpec(
+            api=Api.weather,
+            provider_type="remote::kaze",
+            config_class="llama_stack_provider_kaze.KazeProviderConfig",
+            adapter=AdapterSpec(
+                adapter_type="kaze",
+                module="llama_stack_provider_kaze",
+                pip_packages=["llama_stack_provider_kaze"],
+                config_class="llama_stack_provider_kaze.KazeProviderConfig",
+            ),
+        ),
+    ]
+
+
+class WeatherProvider(Protocol):
+    """
+    A protocol for the Weather API.
+    """
+
+    @webmethod(route="/weather/locations", method="GET")
+    async def get_available_locations() -> dict[str, list[str]]:
+        """
+        Get the available locations.
+        """
+        ...
+```
+
+5. Create the API specification:
+
+```yaml
+# ~/.llama/apis.d/weather.yaml
+module: llama_stack_api_weather
+name: weather
+pip_packages: ["llama-stack-api-weather"]
+protocol: WeatherProvider
+
+```
+
+6. Install the API package:
+
+```bash
+uv pip install -e .
+```
+
+7. Configure Llama Stack to use external APIs:
+
+```yaml
+version: "2"
+image_name: "llama-stack-api-weather"
+apis:
+  - weather
+providers: {}
+external_apis_dir: ~/.llama/apis.d
+```
+
+The API will now be available at `/v1/weather/locations`.
+
+## Example: custom provider for the weather API
+
+1. Create the provider package:
+
+```bash
+mkdir -p llama-stack-provider-kaze
+cd llama-stack-provider-kaze
+uv init
+```
+
+2. Edit `pyproject.toml`:
+
+```toml
+[project]
+name = "llama-stack-provider-kaze"
+version = "0.1.0"
+description = "Kaze weather provider for Llama Stack"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = ["llama-stack", "pydantic", "aiohttp"]
+
+[build-system]
+requires = ["setuptools"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools.packages.find]
+where = ["src"]
+include = ["llama_stack_provider_kaze", "llama_stack_provider_kaze.*"]
+```
+
+3. Create the initial files:
+
+```bash
+touch src/llama_stack_provider_kaze/__init__.py
+touch src/llama_stack_provider_kaze/kaze.py
+```
+
+4. Create the provider implementation:
+
+
+Initialization function:
+
+```python
+# llama-stack-provider-kaze/src/llama_stack_provider_kaze/__init__.py
+"""Kaze weather provider for Llama Stack."""
+
+from .config import KazeProviderConfig
+from .kaze import WeatherKazeAdapter
+
+__all__ = ["KazeProviderConfig", "WeatherKazeAdapter"]
+
+
+async def get_adapter_impl(config: KazeProviderConfig, _deps):
+    from .kaze import WeatherKazeAdapter
+
+    impl = WeatherKazeAdapter(config)
+    await impl.initialize()
+    return impl
+```
+
+Configuration:
+
+```python
+# llama-stack-provider-kaze/src/llama_stack_provider_kaze/config.py
+from pydantic import BaseModel, Field
+
+
+class KazeProviderConfig(BaseModel):
+    """Configuration for the Kaze weather provider."""
+
+    base_url: str = Field(
+        "https://api.kaze.io/v1",
+        description="Base URL for the Kaze weather API",
+    )
+```
+
+Main implementation:
+
+```python
+# llama-stack-provider-kaze/src/llama_stack_provider_kaze/kaze.py
+from llama_stack_api_weather.api import WeatherProvider
+
+from .config import KazeProviderConfig
+
+
+class WeatherKazeAdapter(WeatherProvider):
+    """Kaze weather provider implementation."""
+
+    def __init__(
+        self,
+        config: KazeProviderConfig,
+    ) -> None:
+        self.config = config
+
+    async def initialize(self) -> None:
+        pass
+
+    async def get_available_locations(self) -> dict[str, list[str]]:
+        """Get available weather locations."""
+        return {"locations": ["Paris", "Tokyo"]}
+```
+
+5. Create the provider specification:
+
+```yaml
+# ~/.llama/providers.d/remote/weather/kaze.yaml
+adapter:
+  adapter_type: kaze
+  pip_packages: ["llama_stack_provider_kaze"]
+  config_class: llama_stack_provider_kaze.config.KazeProviderConfig
+  module: llama_stack_provider_kaze
+optional_api_dependencies: []
+```
+
+6. Install the provider package:
+
+```bash
+uv pip install -e .
+```
+
+7. Configure Llama Stack to use the provider:
+
+```yaml
+# ~/.llama/run-byoa.yaml
+version: "2"
+image_name: "llama-stack-api-weather"
+apis:
+  - weather
+providers:
+  weather:
+  - provider_id: kaze
+    provider_type: remote::kaze
+    config: {}
+external_apis_dir: ~/.llama/apis.d
+external_providers_dir: ~/.llama/providers.d
+server:
+  port: 8321
+```
+
+8. Run the server:
+
+```bash
+python -m llama_stack.core.server.server --yaml-config ~/.llama/run-byoa.yaml
+```
+
+9. Test the API:
+
+```bash
+curl -sSf http://127.0.0.1:8321/v1/weather/locations
+{"locations":["Paris","Tokyo"]}%
+```
+
+## Best Practices
+
+1. **Package Naming**: Use a clear and descriptive name for your API package.
+
+2. **Version Management**: Keep your API package versioned and compatible with the Llama Stack version you're using.
+
+3. **Dependencies**: Only include the minimum required dependencies in your API package.
+
+4. **Documentation**: Include clear documentation in your API package about:
+   - Installation requirements
+   - Configuration options
+   - API endpoints and usage
+   - Any limitations or known issues
+
+5. **Testing**: Include tests in your API package to ensure it works correctly with Llama Stack.
+
+## Troubleshooting
+
+If your external API isn't being loaded:
+
+1. Check that the `external_apis_dir` path is correct and accessible.
+2. Verify that the YAML files are properly formatted.
+3. Ensure all required Python packages are installed.
+4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more information using `LLAMA_STACK_LOGGING=all=debug`.
+5. Verify that the API package is installed in your Python environment.
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -1,4 +1,4 @@
-# Building AI Applications (Examples)
+# AI Application Examples

 Llama Stack provides all the building blocks needed to create sophisticated AI applications.

@ -11,6 +11,7 @@ Here are some key topics that will help you build effective agents:
 - **[RAG (Retrieval-Augmented Generation)](rag)**: Learn how to enhance your agents with external knowledge through retrieval mechanisms.
 - **[Agent](agent)**: Understand the components and design patterns of the Llama Stack agent framework.
 - **[Agent Execution Loop](agent_execution_loop)**: Understand how agents process information, make decisions, and execute actions in a continuous loop.
+- **[Agents vs Responses API](responses_vs_agents)**: Learn the differences between the Agents API and Responses API, and when to use each one.
 - **[Tools](tools)**: Extend your agents' capabilities by integrating with external tools and APIs.
 - **[Evals](evals)**: Evaluate your agents' effectiveness and identify areas for improvement.
 - **[Telemetry](telemetry)**: Monitor and analyze your agents' performance and behavior.
@ -23,8 +24,10 @@ Here are some key topics that will help you build effective agents:
 rag
 agent
 agent_execution_loop
+responses_vs_agents
 tools
 evals
 telemetry
 safety
-```
+playground/index
+```
--- a/docs/source/building_applications/playground/index.md
+++ b/docs/source/building_applications/playground/index.md
@ -1,4 +1,4 @@
-# Llama Stack Playground
+## Llama Stack Playground

 ```{note}
 The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
@ -9,7 +9,7 @@ The Llama Stack Playground is an simple interface which aims to:
 - Demo **end-to-end** application code to help users get started to build their own applications
 - Provide an **UI** to help users inspect and understand Llama Stack API providers and resources

-## Key Features
+### Key Features

 #### Playground
 Interactive pages for users to play with and explore Llama Stack API capabilities.
@ -90,18 +90,18 @@ Interactive pages for users to play with and explore Llama Stack API capabilitie
  - Under the hood, it uses Llama Stack's `/<resources>/list` API to get information about each resources.
  - Please visit [Core Concepts](https://llama-stack.readthedocs.io/en/latest/concepts/index.html) for more details about the resources.

-## Starting the Llama Stack Playground
+### Starting the Llama Stack Playground

 To start the Llama Stack Playground, run the following commands:

 1. Start up the Llama Stack API server

 ```bash
-llama stack build --template together --image-type conda
+llama stack build --distro together --image-type venv
 llama stack run together
 ```

 2. Start Streamlit UI
 ```bash
-uv run --with ".[ui]" streamlit run llama_stack/distribution/ui/app.py
+uv run --with ".[ui]" streamlit run llama_stack.core/ui/app.py
 ```
--- a/docs/source/building_applications/responses_vs_agents.md
+++ b/docs/source/building_applications/responses_vs_agents.md
@ -0,0 +1,177 @@
+# Agents vs OpenAI Responses API
+
+Llama Stack (LLS) provides two different APIs for building AI applications with tool calling capabilities: the **Agents API** and the **OpenAI Responses API**. While both enable AI systems to use tools, and maintain full conversation history, they serve different use cases and have distinct characteristics.
+
+> **Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](https://llama-stack.readthedocs.io/en/latest/providers/index.html#chat-completions) directly, before progressing to Agents or Responses API.
+
+## Overview
+
+### LLS Agents API
+The Agents API is a full-featured, stateful system designed for complex, multi-turn conversations. It maintains conversation state through persistent sessions identified by a unique session ID. The API supports comprehensive agent lifecycle management, detailed execution tracking, and rich metadata about each interaction through a structured session/turn/step hierarchy. The API can orchestrate multiple tool calls within a single turn.
+
+### OpenAI Responses API
+The OpenAI Responses API is a full-featured, stateful system designed for complex, multi-turn conversations, with direct compatibility with OpenAI's conversational patterns enhanced by LLama Stack's tool calling capabilities. It maintains conversation state by chaining responses through a `previous_response_id`, allowing interactions to branch or continue from any prior point. Each response can perform multiple tool calls within a single turn.
+
+### Key Differences
+The LLS Agents API uses the Chat Completions API on the backend for inference as it's the industry standard for building AI applications and most LLM providers are compatible with this API. For a detailed comparison between Responses and Chat Completions, see [OpenAI's documentation](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
+
+Additionally, Agents let you specify input/output shields whereas Responses do not (though support is planned). Agents use a linear conversation model referenced by a single session ID. Responses, on the other hand, support branching, where each response can serve as a fork point, and conversations are tracked by the latest response ID. Responses also lets you dynamically choose the model, vector store, files, MCP servers, and more on each inference call, enabling more complex workflows. Agents require a static configuration for these components at the start of the session.
+
+Today the Agents and Responses APIs can be used independently depending on the use case. But, it is also productive to treat the APIs as complementary. It is not currently supported, but it is planned for the LLS Agents API to alternatively use the Responses API as its backend instead of the default Chat Completions API, i.e., enabling a combination of the safety features of Agents with the dynamic configuration and branching capabilities of Responses.
+
+| Feature | LLS Agents API | OpenAI Responses API |
+|---------|------------|---------------------|
+| **Conversation Management** | Linear persistent sessions | Can branch from any previous response ID |
+| **Input/Output Safety Shields** | Supported | Not yet supported |
+| **Per-call Flexibility** | Static per-session configuration | Dynamic per-call configuration |
+
+## Use Case Example: Research with Multiple Search Methods
+
+Let's compare how both APIs handle a research task where we need to:
+1. Search for current information and examples
+2. Access different information sources dynamically
+3. Continue the conversation based on search results
+
+### Agents API: Session-based configuration with safety shields
+
+```python
+# Create agent with static session configuration
+agent = Agent(
+    client,
+    model="Llama3.2-3B-Instruct",
+    instructions="You are a helpful coding assistant",
+    tools=[
+        {
+            "name": "builtin::rag/knowledge_search",
+            "args": {"vector_db_ids": ["code_docs"]},
+        },
+        "builtin::code_interpreter",
+    ],
+    input_shields=["llama_guard"],
+    output_shields=["llama_guard"],
+)
+
+session_id = agent.create_session("code_session")
+
+# First turn: Search and execute
+response1 = agent.create_turn(
+    messages=[
+        {
+            "role": "user",
+            "content": "Find examples of sorting algorithms and run a bubble sort on [3,1,4,1,5]",
+        },
+    ],
+    session_id=session_id,
+)
+
+# Continue conversation in same session
+response2 = agent.create_turn(
+    messages=[
+        {
+            "role": "user",
+            "content": "Now optimize that code and test it with a larger dataset",
+        },
+    ],
+    session_id=session_id,  # Same session, maintains full context
+)
+
+# Agents API benefits:
+# ✅ Safety shields protect against malicious code execution
+# ✅ Session maintains context between code executions
+# ✅ Consistent tool configuration throughout conversation
+print(f"First result: {response1.output_message.content}")
+print(f"Optimization: {response2.output_message.content}")
+```
+
+### Responses API: Dynamic per-call configuration with branching
+
+```python
+# First response: Use web search for latest algorithms
+response1 = client.responses.create(
+    model="Llama3.2-3B-Instruct",
+    input="Search for the latest efficient sorting algorithms and their performance comparisons",
+    tools=[
+        {
+            "type": "web_search",
+        },
+    ],  # Web search for current information
+)
+
+# Continue conversation: Switch to file search for local docs
+response2 = client.responses.create(
+    model="Llama3.2-1B-Instruct",  # Switch to faster model
+    input="Now search my uploaded files for existing sorting implementations",
+    tools=[
+        {  # Using Responses API built-in tools
+            "type": "file_search",
+            "vector_store_ids": ["vs_abc123"],  # Vector store containing uploaded files
+        },
+    ],
+    previous_response_id=response1.id,
+)
+
+# Branch from first response: Try different search approach
+response3 = client.responses.create(
+    model="Llama3.2-3B-Instruct",
+    input="Instead, search the web for Python-specific sorting best practices",
+    tools=[{"type": "web_search"}],  # Different web search query
+    previous_response_id=response1.id,  # Branch from response1
+)
+
+# Responses API benefits:
+# ✅ Dynamic tool switching (web search ↔ file search per call)
+# ✅ OpenAI-compatible tool patterns (web_search, file_search)
+# ✅ Branch conversations to explore different information sources
+# ✅ Model flexibility per search type
+print(f"Web search results: {response1.output_message.content}")
+print(f"File search results: {response2.output_message.content}")
+print(f"Alternative web search: {response3.output_message.content}")
+```
+
+Both APIs demonstrate distinct strengths that make them valuable on their own for different scenarios. The Agents API excels in providing structured, safety-conscious workflows with persistent session management, while the Responses API offers flexibility through dynamic configuration and OpenAI compatible tool patterns.
+
+## Use Case Examples
+
+### 1. **Research and Analysis with Safety Controls**
+**Best Choice: Agents API**
+
+**Scenario:** You're building a research assistant for a financial institution that needs to analyze market data, execute code to process financial models, and search through internal compliance documents. The system must ensure all interactions are logged for regulatory compliance and protected by safety shields to prevent malicious code execution or data leaks.
+
+**Why Agents API?** The Agents API provides persistent session management for iterative research workflows, built-in safety shields to protect against malicious code in financial models, and structured execution logs (session/turn/step) required for regulatory compliance. The static tool configuration ensures consistent access to your knowledge base and code interpreter throughout the entire research session.
+
+### 2. **Dynamic Information Gathering with Branching Exploration**
+**Best Choice: Responses API**
+
+**Scenario:** You're building a competitive intelligence tool that helps businesses research market trends. Users need to dynamically switch between web search for current market data and file search through uploaded industry reports. They also want to branch conversations to explore different market segments simultaneously and experiment with different models for various analysis types.
+
+**Why Responses API?** The Responses API's branching capability lets users explore multiple market segments from any research point. Dynamic per-call configuration allows switching between web search and file search as needed, while experimenting with different models (faster models for quick searches, more powerful models for deep analysis). The OpenAI-compatible tool patterns make integration straightforward.
+
+### 3. **OpenAI Migration with Advanced Tool Capabilities**
+**Best Choice: Responses API**
+
+**Scenario:** You have an existing application built with OpenAI's Assistants API that uses file search and web search capabilities. You want to migrate to Llama Stack for better performance and cost control while maintaining the same tool calling patterns and adding new capabilities like dynamic vector store selection.
+
+**Why Responses API?** The Responses API provides full OpenAI tool compatibility (`web_search`, `file_search`) with identical syntax, making migration seamless. The dynamic per-call configuration enables advanced features like switching vector stores per query or changing models based on query complexity - capabilities that extend beyond basic OpenAI functionality while maintaining compatibility.
+
+### 4. **Educational Programming Tutor**
+**Best Choice: Agents API**
+
+**Scenario:** You're building a programming tutor that maintains student context across multiple sessions, safely executes code exercises, and tracks learning progress with audit trails for educators.
+
+**Why Agents API?** Persistent sessions remember student progress across multiple interactions, safety shields prevent malicious code execution while allowing legitimate programming exercises, and structured execution logs help educators track learning patterns.
+
+### 5. **Advanced Software Debugging Assistant**
+**Best Choice: Agents API with Responses Backend**
+
+**Scenario:** You're building a debugging assistant that helps developers troubleshoot complex issues. It needs to maintain context throughout a debugging session, safely execute diagnostic code, switch between different analysis tools dynamically, and branch conversations to explore multiple potential causes simultaneously.
+
+**Why Agents + Responses?** The Agent provides safety shields for code execution and session management for the overall debugging workflow. The underlying Responses API enables dynamic model selection and flexible tool configuration per query, while branching lets you explore different theories (memory leak vs. concurrency issue) from the same debugging point and compare results.
+
+> **Note:** The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
+
+## For More Information
+
+- **LLS Agents API**: For detailed information on creating and managing agents, see the [Agents documentation](https://llama-stack.readthedocs.io/en/latest/building_applications/agent.html)
+- **OpenAI Responses API**: For information on using the OpenAI-compatible responses API, see the [OpenAI API documentation](https://platform.openai.com/docs/api-reference/responses)
+- **Chat Completions API**: For the default backend API used by Agents, see the [Chat Completions providers documentation](https://llama-stack.readthedocs.io/en/latest/providers/index.html#chat-completions)
+- **Agent Execution Loop**: For understanding how agents process turns and steps in their execution, see the [Agent Execution Loop documentation](https://llama-stack.readthedocs.io/en/latest/building_applications/agent_execution_loop.html)
--- a/docs/source/concepts/apis.md
+++ b/docs/source/concepts/apis.md
@ -10,9 +10,11 @@ A Llama Stack API is described as a collection of REST endpoints. We currently s
 - **Eval**: generate outputs (via Inference or Agents) and perform scoring
 - **VectorIO**: perform operations on vector stores, such as adding documents, searching, and deleting documents
 - **Telemetry**: collect telemetry data from the system
+- **Post Training**: fine-tune a model
+- **Tool Runtime**: interact with various tools and protocols
+- **Responses**: generate responses from an LLM using this OpenAI compatible API.

 We are working on adding a few more APIs to complete the application lifecycle. These will include:
 - **Batch Inference**: run inference on a dataset of inputs
 - **Batch Agents**: run agents on a dataset of inputs
- **Post Training**: fine-tune a model
 - **Synthetic Data Generation**: generate synthetic data for model development
--- a/docs/source/concepts/architecture.md
+++ b/docs/source/concepts/architecture.md
@ -1,31 +1,39 @@
-# Why Llama Stack?
+## Llama Stack architecture

-Building production AI applications today requires solving multiple challenges:
-
-**Infrastructure Complexity**
- Running large language models efficiently requires specialized infrastructure.
- Different deployment scenarios (local development, cloud, edge) need different solutions.
- Moving from development to production often requires significant rework.
-
-**Essential Capabilities**
- Safety guardrails and content filtering are necessary in an enterprise setting.
- Just model inference is not enough - Knowledge retrieval and RAG capabilities are required.
- Nearly any application needs composable multi-step workflows.
- Finally, without monitoring, observability and evaluation, you end up operating in the dark.
-
-**Lack of Flexibility and Choice**
- Directly integrating with multiple providers creates tight coupling.
- Different providers have different APIs and abstractions.
- Changing providers requires significant code changes.
-
-
-### Our Solution: A Universal Stack
+Llama Stack allows you to build different layers of distributions for your AI workloads using various SDKs and API providers.

 ```{image} ../../_static/llama-stack.png
 :alt: Llama Stack
 :width: 400px
 ```

+### Benefits of Llama stack
+
+#### Current challenges in custom AI applications
+
+Building production AI applications today requires solving multiple challenges:
+
+**Infrastructure Complexity**
+
+- Running large language models efficiently requires specialized infrastructure.
+- Different deployment scenarios (local development, cloud, edge) need different solutions.
+- Moving from development to production often requires significant rework.
+
+**Essential Capabilities**
+
+- Safety guardrails and content filtering are necessary in an enterprise setting.
+- Just model inference is not enough - Knowledge retrieval and RAG capabilities are required.
+- Nearly any application needs composable multi-step workflows.
+- Without monitoring, observability and evaluation, you end up operating in the dark.
+
+**Lack of Flexibility and Choice**
+
+- Directly integrating with multiple providers creates tight coupling.
+- Different providers have different APIs and abstractions.
+- Changing providers requires significant code changes.
+
+#### Our Solution: A Universal Stack
+
 Llama Stack addresses these challenges through a service-oriented, API-first approach:

 **Develop Anywhere, Deploy Everywhere**
@ -59,4 +67,4 @@ Llama Stack addresses these challenges through a service-oriented, API-first app
 - **Turnkey Solutions**: Easy to deploy built in solutions for popular deployment scenarios


-With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
+With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
--- a/docs/source/concepts/index.md
+++ b/docs/source/concepts/index.md
@ -2,6 +2,10 @@

 Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.

+```{include} architecture.md
+:start-after: ## Llama Stack architecture
+```
+
 ```{include} apis.md
 :start-after: ## APIs
 ```
@ -10,14 +14,10 @@ Given Llama Stack's service-oriented philosophy, a few concepts and workflows ar
 :start-after: ## API Providers
 ```

-```{include} resources.md
-:start-after: ## Resources
-```
-
 ```{include} distributions.md
 :start-after: ## Distributions
 ```

-```{include} evaluation_concepts.md
-:start-after: ## Evaluation Concepts
+```{include} resources.md
+:start-after: ## Resources
 ```
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -52,7 +52,18 @@ extensions = [
    "sphinxcontrib.redoc",
    "sphinxcontrib.mermaid",
    "sphinxcontrib.video",
+    "sphinx_reredirects"
 ]
+
+redirects = {
+    "providers/post_training/index": "../../advanced_apis/post_training/index.html",
+    "providers/eval/index": "../../advanced_apis/eval/index.html",
+    "providers/scoring/index": "../../advanced_apis/scoring/index.html",
+    "playground/index": "../../building_applications/playground/index.html",
+    "openai/index": "../../providers/index.html#openai-api-compatibility",
+    "introduction/index": "../concepts/index.html#llama-stack-architecture"
+}
+
 myst_enable_extensions = ["colon_fence"]

 html_theme = "sphinx_rtd_theme"
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@ -11,4 +11,5 @@ See the [Adding a New API Provider](new_api_provider.md) which describes how to
 :hidden:

 new_api_provider
+testing
 ```
--- a/docs/source/contributing/new_api_provider.md
+++ b/docs/source/contributing/new_api_provider.md
@ -6,7 +6,7 @@ This guide will walk you through the process of adding a new API provider to Lla
 - Begin by reviewing the [core concepts](../concepts/index.md) of Llama Stack and choose the API your provider belongs to (Inference, Safety, VectorIO, etc.)
 - Determine the provider type ({repopath}`Remote::llama_stack/providers/remote` or {repopath}`Inline::llama_stack/providers/inline`). Remote providers make requests to external services, while inline providers execute implementation locally.
 - Add your provider to the appropriate {repopath}`Registry::llama_stack/providers/registry/`. Specify pip dependencies necessary.
- Update any distribution {repopath}`Templates::llama_stack/templates/` `build.yaml` and `run.yaml` files if they should include your provider by default. Run {repopath}`./scripts/distro_codegen.py` if necessary. Note that `distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. This usually means the distribution's `get_distribution_template()` code path should only import any necessary Config or model alias definitions from each provider and not the provider's actual implementation.
+- Update any distribution {repopath}`Templates::llama_stack/distributions/` `build.yaml` and `run.yaml` files if they should include your provider by default. Run {repopath}`./scripts/distro_codegen.py` if necessary. Note that `distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. This usually means the distribution's `get_distribution_template()` code path should only import any necessary Config or model alias definitions from each provider and not the provider's actual implementation.


 Here are some example PRs to help you get started:
@ -14,10 +14,45 @@ Here are some example PRs to help you get started:
   - [Nvidia Inference Implementation](https://github.com/meta-llama/llama-stack/pull/355)
   - [Model context protocol Tool Runtime](https://github.com/meta-llama/llama-stack/pull/665)

+## Inference Provider Patterns
+
+When implementing Inference providers for OpenAI-compatible APIs, Llama Stack provides several mixin classes to simplify development and ensure consistent behavior across providers.
+
+### OpenAIMixin
+
+The `OpenAIMixin` class provides direct OpenAI API functionality for providers that work with OpenAI-compatible endpoints. It includes:
+
+#### Direct API Methods
+- **`openai_completion()`**: Legacy text completion API with full parameter support
+- **`openai_chat_completion()`**: Chat completion API supporting streaming, tools, and function calling
+- **`openai_embeddings()`**: Text embeddings generation with customizable encoding and dimensions
+
+#### Model Management
+- **`check_model_availability()`**: Queries the API endpoint to verify if a model exists and is accessible
+
+#### Client Management
+- **`client` property**: Automatically creates and configures AsyncOpenAI client instances using your provider's credentials
+
+#### Required Implementation
+
+To use `OpenAIMixin`, your provider must implement these abstract methods:
+
+```python
+@abstractmethod
+def get_api_key(self) -> str:
+    """Return the API key for authentication"""
+    pass
+
+
+@abstractmethod
+def get_base_url(self) -> str:
+    """Return the OpenAI-compatible API base URL"""
+    pass
+```

 ## Testing the Provider

-Before running tests, you must have required dependencies installed. This depends on the providers or distributions you are testing. For example, if you are testing the `together` distribution, you should install dependencies via `llama stack build --template together`.
+Before running tests, you must have required dependencies installed. This depends on the providers or distributions you are testing. For example, if you are testing the `together` distribution, you should install dependencies via `llama stack build --distro together`.

 ### 1. Integration Testing

--- a/docs/source/deploying/index.md
+++ b/docs/source/deploying/index.md
@ -0,0 +1,4 @@
+# Deployment Examples
+
+```{include} kubernetes_deployment.md
+```
--- a/docs/source/distributions/kubernetes_deployment.md
+++ b/docs/source/distributions/kubernetes_deployment.md
@ -1,4 +1,4 @@
-# Kubernetes Deployment Guide
+## Kubernetes Deployment Guide

 Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster.

@ -174,7 +174,7 @@ spec:
      - name: llama-stack
        image: localhost/llama-stack-run-k8s:latest
        imagePullPolicy: IfNotPresent
-        command: ["python", "-m", "llama_stack.distribution.server.server", "--config", "/app/config.yaml"]
+        command: ["python", "-m", "llama_stack.core.server.server", "--config", "/app/config.yaml"]
        ports:
          - containerPort: 5000
        volumeMounts:
@ -222,10 +222,21 @@ llama-stack-client --endpoint http://localhost:5000 inference chat-completion --

 ## Deploying Llama Stack Server in AWS EKS

-We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster. Once you have an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html), you can run the following script to deploy the Llama Stack server.
+We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster.
+
+Prerequisites:
+- Set up an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html).
+- Create a [Github OAuth app](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/creating-an-oauth-app) and get the client ID and client secret.
+  - Set the `Authorization callback URL` to `http://<your-llama-stack-ui-url>/api/auth/callback/`


+Run the following script to deploy the Llama Stack server:
 ```
+export HF_TOKEN=<your-huggingface-token>
+export GITHUB_CLIENT_ID=<your-github-client-id>
+export GITHUB_CLIENT_SECRET=<your-github-client-secret>
+export LLAMA_STACK_UI_URL=<your-llama-stack-ui-url>
+
 cd docs/source/distributions/eks
 ./apply.sh
 ```
--- a/docs/source/distributions/building_distro.md
+++ b/docs/source/distributions/building_distro.md
@ -47,30 +47,37 @@ pip install -e .
 ```
 Use the CLI to build your distribution.
 The main points to consider are:
-1. **Image Type** - Do you want a Conda / venv environment or a Container (eg. Docker)
+1. **Image Type** - Do you want a venv environment or a Container (eg. Docker)
 2. **Template** - Do you want to use a template to build your distribution? or start from scratch ?
 3. **Config** - Do you want to use a pre-existing config file to build your distribution?

 ```
 llama stack build -h
-usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--list-templates] [--image-type {conda,container,venv}] [--image-name IMAGE_NAME] [--print-deps-only] [--run]
+usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--distro DISTRIBUTION] [--list-distros] [--image-type {container,venv}] [--image-name IMAGE_NAME] [--print-deps-only]
+                         [--run] [--providers PROVIDERS]

 Build a Llama stack container

 options:
  -h, --help            show this help message and exit
-  --config CONFIG       Path to a config file to use for the build. You can find example configs in llama_stack/distributions/**/build.yaml. If this argument is not provided, you will
-                        be prompted to enter information interactively (default: None)
-  --template TEMPLATE   Name of the example template config to use for build. You may use `llama stack build --list-templates` to check out the available templates (default: None)
-  --list-templates      Show the available templates for building a Llama Stack distribution (default: False)
-  --image-type {conda,container,venv}
+  --config CONFIG       Path to a config file to use for the build. You can find example configs in llama_stack.cores/**/build.yaml. If this argument is not provided, you will be prompted to
+                        enter information interactively (default: None)
+  --template TEMPLATE   (deprecated) Name of the example template config to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default:
+                        None)
+  --distro DISTRIBUTION, --distribution DISTRIBUTION
+                        Name of the distribution to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default: None)
+  --list-distros, --list-distributions
+                        Show the available distributions for building a Llama Stack distribution (default: False)
+  --image-type {container,venv}
                        Image Type to use for the build. If not specified, will use the image type from the template config. (default: None)
  --image-name IMAGE_NAME
-                        [for image-type=conda|container|venv] Name of the conda or virtual environment to use for the build. If not specified, currently active environment will be used if
-                        found. (default: None)
+                        [for image-type=container|venv] Name of the virtual environment to use for the build. If not specified, currently active environment will be used if found. (default:
+                        None)
  --print-deps-only     Print the dependencies for the stack only, without building the stack (default: False)
  --run                 Run the stack after building using the same image type, name, and other applicable arguments (default: False)
-
+  --providers PROVIDERS
+                        Build a config for a list of providers and only those providers. This list is formatted like: api1=provider1,api2=provider2. Where there can be multiple providers per
+                        API. (default: None)
 ```

 After this step is complete, a file named `<name>-build.yaml` and template file `<name>-run.yaml` will be generated and saved at the output file path specified at the end of the command.
@ -141,10 +148,14 @@ You may then pick a template to build your distribution with providers fitted to

 For example, to build a distribution with TGI as the inference provider, you can run:
 ```
-$ llama stack build --template starter
+$ llama stack build --distro starter
 ...
 You can now edit ~/.llama/distributions/llamastack-starter/starter-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-starter/starter-run.yaml`
 ```
+
+```{tip}
+The generated `run.yaml` file is a starting point for your configuration. For comprehensive guidance on customizing it for your specific needs, infrastructure, and deployment scenarios, see [Customizing Your run.yaml Configuration](customizing_run_yaml.md).
+```
 :::
 :::{tab-item} Building from Scratch

@ -155,7 +166,7 @@ It would be best to start with a template and understand the structure of the co
 llama stack build

 > Enter a name for your Llama Stack (e.g. my-local-stack): my-stack
-> Enter the image type you want your Llama Stack to be built as (container or conda or venv): conda
+> Enter the image type you want your Llama Stack to be built as (container or venv): venv

 Llama Stack is composed of several APIs working together. Let's select
 the provider types (implementations) you want to use for these APIs.
@ -180,10 +191,10 @@ You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack
 :::{tab-item} Building from a pre-existing build config file
 - In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.

- The config file will be of contents like the ones in `llama_stack/templates/*build.yaml`.
+- The config file will be of contents like the ones in `llama_stack/distributions/*build.yaml`.

 ```
-llama stack build --config llama_stack/templates/starter/build.yaml
+llama stack build --config llama_stack/distributions/starter/build.yaml
 ```
 :::

@ -249,11 +260,11 @@ Podman is supported as an alternative to Docker. Set `CONTAINER_BINARY` to `podm
 To build a container image, you may start off from a template and use the `--image-type container` flag to specify `container` as the build image type.

 ```
-llama stack build --template starter --image-type container
+llama stack build --distro starter --image-type container
 ```

 ```
-$ llama stack build --template starter --image-type container
+$ llama stack build --distro starter --image-type container
 ...
 Containerfile created successfully in /tmp/tmp.viA3a3Rdsg/ContainerfileFROM python:3.10-slim
 ...
@ -308,7 +319,7 @@ Now, let's start the Llama Stack Distribution Server. You will need the YAML con
 ```
 llama stack run -h
 usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
-                       [--image-type {conda,venv}] [--enable-ui]
+                       [--image-type {venv}] [--enable-ui]
                       [config | template]

 Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
@ -322,8 +333,8 @@ options:
  --image-name IMAGE_NAME
                        Name of the image to run. Defaults to the current environment (default: None)
  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: None)
-  --image-type {conda,venv}
-                        Image Type used during the build. This can be either conda or venv. (default: None)
+  --image-type {venv}
+                        Image Type used during the build. This should be venv. (default: None)
  --enable-ui           Start the UI server (default: False)
 ```

@ -338,9 +349,6 @@ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-

 # Start using a venv
 llama stack run --image-type venv ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
-
-# Start using a conda environment
-llama stack run --image-type conda ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
 ```

 ```
--- a/docs/source/distributions/configuration.md
+++ b/docs/source/distributions/configuration.md
@ -2,11 +2,14 @@

 The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:

+```{note}
+The default `run.yaml` files generated by templates are starting points for your configuration. For guidance on customizing these files for your specific needs, see [Customizing Your run.yaml Configuration](customizing_run_yaml.md).
+```
+
 ```{dropdown} 👋 Click here for a Sample Configuration File

 ```yaml
 version: 2
-conda_env: ollama
 apis:
 - agents
 - inference
@ -381,6 +384,166 @@ And must respond with:

 If no access attributes are returned, the token is used as a namespace.

+### Access control
+
+When authentication is enabled, access to resources is controlled
+through the `access_policy` attribute of the auth config section under
+server. The value for this is a list of access rules.
+
+Each access rule defines a list of actions either to permit or to
+forbid. It may specify a principal or a resource that must match for
+the rule to take effect.
+
+Valid actions are create, read, update, and delete. The resource to
+match should be specified in the form of a type qualified identifier,
+e.g.  model::my-model or vector_db::some-db, or a wildcard for all
+resources of a type, e.g. model::*. If the principal or resource are
+not specified, they will match all requests.
+
+The valid resource types are model, shield, vector_db, dataset,
+scoring_function, benchmark, tool, tool_group and session.
+
+A rule may also specify a condition, either a 'when' or an 'unless',
+with additional constraints as to where the rule applies. The
+constraints supported at present are:
+
+ - 'user with <attr-value> in <attr-name>'
+ - 'user with <attr-value> not in <attr-name>'
+ - 'user is owner'
+ - 'user is not owner'
+ - 'user in owners <attr-name>'
+ - 'user not in owners <attr-name>'
+
+The attributes defined for a user will depend on how the auth
+configuration is defined.
+
+When checking whether a particular action is allowed by the current
+user for a resource, all the defined rules are tested in order to find
+a match. If a match is found, the request is permitted or forbidden
+depending on the type of rule. If no match is found, the request is
+denied.
+
+If no explicit rules are specified, a default policy is defined with
+which all users can access all resources defined in config but
+resources created dynamically can only be accessed by the user that
+created them.
+
+Examples:
+
+The following restricts access to particular github users:
+
+```yaml
+server:
+  auth:
+    provider_config:
+      type: "github_token"
+      github_api_base_url: "https://api.github.com"
+  access_policy:
+  - permit:
+      principal: user-1
+      actions: [create, read, delete]
+    description: user-1 has full access to all resources
+  - permit:
+      principal: user-2
+      actions: [read]
+      resource: model::model-1
+    description: user-2 has read access to model-1 only
+```
+
+Similarly, the following restricts access to particular kubernetes
+service accounts:
+
+```yaml
+server:
+  auth:
+    provider_config:
+      type: "oauth2_token"
+      audience: https://kubernetes.default.svc.cluster.local
+      issuer: https://kubernetes.default.svc.cluster.local
+      tls_cafile: /home/gsim/.minikube/ca.crt
+      jwks:
+        uri: https://kubernetes.default.svc.cluster.local:8443/openid/v1/jwks
+        token: ${env.TOKEN}
+    access_policy:
+    - permit:
+        principal: system:serviceaccount:my-namespace:my-serviceaccount
+        actions: [create, read, delete]
+      description: specific serviceaccount has full access to all resources
+    - permit:
+        principal: system:serviceaccount:default:default
+        actions: [read]
+        resource: model::model-1
+      description: default account has read access to model-1 only
+```
+
+The following policy, which assumes that users are defined with roles
+and teams by whichever authentication system is in use, allows any
+user with a valid token to use models, create resources other than
+models, read and delete resources they created and read resources
+created by users sharing a team with them:
+
+```
+    access_policy:
+    - permit:
+        actions: [read]
+        resource: model::*
+      description: all users have read access to models
+    - forbid:
+        actions: [create, delete]
+        resource: model::*
+      unless: user with admin in roles
+      description: only user with admin role can create or delete models
+    - permit:
+        actions: [create, read, delete]
+      when: user is owner
+      description: users can create resources other than models and read and delete those they own
+    - permit:
+        actions: [read]
+      when: user in owner teams
+      description: any user has read access to any resource created by a user with the same team
+```
+
+#### API Endpoint Authorization with Scopes
+
+In addition to resource-based access control, Llama Stack supports endpoint-level authorization using OAuth 2.0 style scopes. When authentication is enabled, specific API endpoints require users to have particular scopes in their authentication token.
+
+**Scope-Gated APIs:**
+The following APIs are currently gated by scopes:
+
+- **Telemetry API** (scope: `telemetry.read`):
+  - `POST /telemetry/traces` - Query traces
+  - `GET /telemetry/traces/{trace_id}` - Get trace by ID
+  - `GET /telemetry/traces/{trace_id}/spans/{span_id}` - Get span by ID
+  - `POST /telemetry/spans/{span_id}/tree` - Get span tree
+  - `POST /telemetry/spans` - Query spans
+  - `POST /telemetry/metrics/{metric_name}` - Query metrics
+
+**Authentication Configuration:**
+
+For **JWT/OAuth2 providers**, scopes should be included in the JWT's claims:
+```json
+{
+  "sub": "user123",
+  "scope": "telemetry.read",
+  "aud": "llama-stack"
+}
+```
+
+For **custom authentication providers**, the endpoint must return user attributes including the `scopes` array:
+```json
+{
+  "principal": "user123",
+  "attributes": {
+    "scopes": ["telemetry.read"]
+  }
+}
+```
+
+**Behavior:**
+- Users without the required scope receive a 403 Forbidden response
+- When authentication is disabled, scope checks are bypassed
+- Endpoints without `required_scope` work normally for all authenticated users
+
 ### Quota Configuration

 The `quota` section allows you to enable server-side request throttling for both
--- a/docs/source/distributions/customizing_run_yaml.md
+++ b/docs/source/distributions/customizing_run_yaml.md
@ -0,0 +1,40 @@
+# Customizing run.yaml Files
+
+The `run.yaml` files generated by Llama Stack templates are **starting points** designed to be customized for your specific needs. They are not meant to be used as-is in production environments.
+
+## Key Points
+
+- **Templates are starting points**: Generated `run.yaml` files contain defaults for development/testing
+- **Customization expected**: Update URLs, credentials, models, and settings for your environment
+- **Version control separately**: Keep customized configs in your own repository
+- **Environment-specific**: Create different configurations for dev, staging, production
+
+## What You Can Customize
+
+You can customize:
+- **Provider endpoints**: Change `http://localhost:8000` to your actual servers
+- **Swap providers**: Replace default providers (e.g., swap Tavily with Brave for search)
+- **Storage paths**: Move from `/tmp/` to production directories
+- **Authentication**: Add API keys, SSL, timeouts
+- **Models**: Different model sizes for dev vs prod
+- **Database settings**: Switch from SQLite to PostgreSQL
+- **Tool configurations**: Add custom tools and integrations
+
+## Best Practices
+
+- Use environment variables for secrets and environment-specific values
+- Create separate `run.yaml` files for different environments (dev, staging, prod)
+- Document your changes with comments
+- Test configurations before deployment
+- Keep your customized configs in version control
+
+Example structure:
+```
+your-project/
+├── configs/
+│   ├── dev-run.yaml
+│   ├── prod-run.yaml
+└── README.md
+```
+
+The goal is to take the generated template and adapt it to your specific infrastructure and operational needs.
--- a/docs/source/distributions/importing_as_library.md
+++ b/docs/source/distributions/importing_as_library.md
@ -6,14 +6,14 @@ This avoids the overhead of setting up a server.
 ```bash
 # setup
 uv pip install llama-stack
-llama stack build --template starter --image-type venv
+llama stack build --distro starter --image-type venv
 ```

 ```python
-from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
+from llama_stack.core.library_client import LlamaStackAsLibraryClient

 client = LlamaStackAsLibraryClient(
-    "ollama",
+    "starter",
    # provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
    provider_data={"tavily_search_api_key": os.environ["TAVILY_SEARCH_API_KEY"]},
 )
--- a/docs/source/distributions/index.md
+++ b/docs/source/distributions/index.md
@ -6,13 +6,10 @@ This section provides an overview of the distributions available in Llama Stack.

 ```{toctree}
 :maxdepth: 3
-
+list_of_distributions
+building_distro
+customizing_run_yaml
+starting_llama_stack_server
 importing_as_library
 configuration
-list_of_distributions
-kubernetes_deployment
-building_distro
-on_device_distro
-remote_hosted_distro
-self_hosted_distro
 ```
--- a/docs/source/distributions/k8s/apply.sh
+++ b/docs/source/distributions/k8s/apply.sh
@ -21,6 +21,24 @@ else
  exit 1
 fi

+if [ -z "${GITHUB_CLIENT_ID:-}" ]; then
+  echo "ERROR: GITHUB_CLIENT_ID not set. You need it for Github login to work. Refer to https://llama-stack.readthedocs.io/en/latest/deploying/index.html#kubernetes-deployment-guide"
+  exit 1
+fi
+
+if [ -z "${GITHUB_CLIENT_SECRET:-}" ]; then
+  echo "ERROR: GITHUB_CLIENT_SECRET not set. You need it for Github login to work. Refer to https://llama-stack.readthedocs.io/en/latest/deploying/index.html#kubernetes-deployment-guide"
+  exit 1
+fi
+
+if [ -z "${LLAMA_STACK_UI_URL:-}" ]; then
+  echo "ERROR: LLAMA_STACK_UI_URL not set. Should be set to the external URL of the UI (excluding port). You need it for Github login to work. Refer to https://llama-stack.readthedocs.io/en/latest/deploying/index.html#kubernetes-deployment-guide"
+  exit 1
+fi
+
+
+
+
 set -euo pipefail
 set -x

--- a/docs/source/distributions/k8s/stack-configmap.yaml
+++ b/docs/source/distributions/k8s/stack-configmap.yaml
@ -34,6 +34,13 @@ data:
        provider_type: remote::chromadb
        config:
          url: ${env.CHROMADB_URL:=}
+          kvstore:
+            type: postgres
+            host: ${env.POSTGRES_HOST:=localhost}
+            port: ${env.POSTGRES_PORT:=5432}
+            db: ${env.POSTGRES_DB:=llamastack}
+            user: ${env.POSTGRES_USER:=llamastack}
+            password: ${env.POSTGRES_PASSWORD:=llamastack}
      safety:
      - provider_id: llama-guard
        provider_type: inline::llama-guard
@ -122,6 +129,9 @@ data:
      provider_id: rag-runtime
    server:
      port: 8321
+      auth:
+        provider_config:
+          type: github_token
 kind: ConfigMap
 metadata:
  creationTimestamp: null
--- a/docs/source/distributions/k8s/stack-k8s.yaml.template
+++ b/docs/source/distributions/k8s/stack-k8s.yaml.template
@ -27,7 +27,7 @@ spec:
    spec:
      containers:
      - name: llama-stack
-        image: llamastack/distribution-remote-vllm:latest
+        image: llamastack/distribution-starter:latest
        imagePullPolicy: Always # since we have specified latest instead of a version
        env:
        - name: ENABLE_CHROMADB
@ -52,7 +52,7 @@ spec:
          value: "${SAFETY_MODEL}"
        - name: TAVILY_SEARCH_API_KEY
          value: "${TAVILY_SEARCH_API_KEY}"
-        command: ["python", "-m", "llama_stack.distribution.server.server", "--config", "/etc/config/stack_run_config.yaml", "--port", "8321"]
+        command: ["python", "-m", "llama_stack.core.server.server", "--config", "/etc/config/stack_run_config.yaml", "--port", "8321"]
        ports:
          - containerPort: 8321
        volumeMounts:
--- a/docs/source/distributions/k8s/stack_run_config.yaml
+++ b/docs/source/distributions/k8s/stack_run_config.yaml
@ -31,6 +31,13 @@ providers:
    provider_type: remote::chromadb
    config:
      url: ${env.CHROMADB_URL:=}
+      kvstore:
+        type: postgres
+        host: ${env.POSTGRES_HOST:=localhost}
+        port: ${env.POSTGRES_PORT:=5432}
+        db: ${env.POSTGRES_DB:=llamastack}
+        user: ${env.POSTGRES_USER:=llamastack}
+        password: ${env.POSTGRES_PASSWORD:=llamastack}
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
@ -119,3 +126,6 @@ tool_groups:
  provider_id: rag-runtime
 server:
  port: 8321
+  auth:
+    provider_config:
+      type: github_token
--- a/docs/source/distributions/k8s/ui-k8s.yaml.template
+++ b/docs/source/distributions/k8s/ui-k8s.yaml.template
@ -26,6 +26,12 @@ spec:
          value: "http://llama-stack-service:8321"
        - name: LLAMA_STACK_UI_PORT
          value: "8322"
+        - name: GITHUB_CLIENT_ID
+          value: "${GITHUB_CLIENT_ID}"
+        - name: GITHUB_CLIENT_SECRET
+          value: "${GITHUB_CLIENT_SECRET}"
+        - name: NEXTAUTH_URL
+          value: "${LLAMA_STACK_UI_URL}:8322"
        args:
          - -c
          - |
--- a/docs/source/distributions/ondevice_distro/android_sdk.md
+++ b/docs/source/distributions/ondevice_distro/android_sdk.md
@ -56,12 +56,12 @@ Breaking down the demo app, this section will show the core pieces that are used
 ### Setup Remote Inferencing
 Start a Llama Stack server on localhost. Here is an example of how you can do this using the firework.ai distribution:
 ```
-conda create -n stack-fireworks python=3.10
-conda activate stack-fireworks
+uv venv starter --python 3.12
+source starter/bin/activate  # On Windows: starter\Scripts\activate
 pip install --no-cache llama-stack==0.2.2
-llama stack build --template fireworks --image-type conda
+llama stack build --distro starter --image-type venv
 export FIREWORKS_API_KEY=<SOME_KEY>
-llama stack run fireworks --port 5050
+llama stack run starter --port 5050
 ```

 Ensure the Llama Stack server version is the same as the Kotlin SDK Library for maximum compatibility.
--- a/docs/source/distributions/remote_hosted_distro/watsonx.md
+++ b/docs/source/distributions/remote_hosted_distro/watsonx.md
@ -57,7 +57,7 @@ Make sure you have access to a watsonx API Key. You can get one by referring [wa

 ## Running Llama Stack with watsonx

-You can do this via Conda (build code), venv or Docker which has a pre-built image.
+You can do this via venv or Docker which has a pre-built image.

 ### Via Docker

@ -76,13 +76,3 @@ docker run \
  --env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID \
  --env WATSONX_BASE_URL=$WATSONX_BASE_URL
 ```
-
-### Via Conda
-
-```bash
-llama stack build --template watsonx --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env WATSONX_API_KEY=$WATSONX_API_KEY \
-  --env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID
-```
--- a/docs/source/distributions/self_hosted_distro/dell.md
+++ b/docs/source/distributions/self_hosted_distro/dell.md
@ -114,7 +114,7 @@ podman run --rm -it \

 ## Running Llama Stack

-Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
+Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.

 ### Via Docker

@ -153,7 +153,7 @@ docker run \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v $HOME/.llama:/root/.llama \
-  -v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
+  -v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-run.yaml \
  llamastack/distribution-dell \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
@ -164,12 +164,12 @@ docker run \
  --env CHROMA_URL=$CHROMA_URL
 ```

-### Via Conda
+### Via venv

 Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.

 ```bash
-llama stack build --template dell --image-type conda
+llama stack build --distro dell --image-type venv
 llama stack run dell
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
--- a/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
@ -70,7 +70,7 @@ $ llama model list --downloaded

 ## Running the Distribution

-You can do this via Conda (build code) or Docker which has a pre-built image.
+You can do this via venv or Docker which has a pre-built image.

 ### Via Docker

@ -104,12 +104,12 @@ docker run \
  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
 ```

-### Via Conda
+### Via venv

 Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

 ```bash
-llama stack build --template meta-reference-gpu --image-type conda
+llama stack build --distro meta-reference-gpu --image-type venv
 llama stack run distributions/meta-reference-gpu/run.yaml \
  --port 8321 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
--- a/docs/source/distributions/self_hosted_distro/nvidia.md
+++ b/docs/source/distributions/self_hosted_distro/nvidia.md
@ -1,3 +1,6 @@
+---
+orphan: true
+---
 <!-- This file was auto-generated by distro_codegen.py, please edit source -->
 # NVIDIA Distribution

@ -37,16 +40,16 @@ The following environment variables can be configured:

 The following models are available by default:

- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)`
- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)`
- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
+- `meta/llama3-8b-instruct `
+- `meta/llama3-70b-instruct `
+- `meta/llama-3.1-8b-instruct `
+- `meta/llama-3.1-70b-instruct `
+- `meta/llama-3.1-405b-instruct `
+- `meta/llama-3.2-1b-instruct `
+- `meta/llama-3.2-3b-instruct `
+- `meta/llama-3.2-11b-vision-instruct `
+- `meta/llama-3.2-90b-vision-instruct `
+- `meta/llama-3.3-70b-instruct `
 - `nvidia/llama-3.2-nv-embedqa-1b-v2 `
 - `nvidia/nv-embedqa-e5-v5 `
 - `nvidia/nv-embedqa-mistral-7b-v2 `
@ -130,7 +133,7 @@ curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-inst

 ## Running Llama Stack with NVIDIA

-You can do this via Conda or venv (build code), or Docker which has a pre-built image.
+You can do this via venv (build code), or Docker which has a pre-built image.

 ### Via Docker

@ -149,24 +152,13 @@ docker run \
  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
 ```

-### Via Conda
-
-```bash
-INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
-llama stack build --template nvidia --image-type conda
-llama stack run ./run.yaml \
-  --port 8321 \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL
-```
-
 ### Via venv

 If you've set up your local development environment, you can also build the image using your local virtual environment.

 ```bash
-INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
-llama stack build --template nvidia --image-type venv
+INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+llama stack build --distro nvidia --image-type venv
 llama stack run ./run.yaml \
  --port 8321 \
  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
--- a/docs/source/distributions/self_hosted_distro/starter.md
+++ b/docs/source/distributions/self_hosted_distro/starter.md
@ -100,10 +100,6 @@ The following environment variables can be configured:
 ### Model Configuration
 - `INFERENCE_MODEL`: HuggingFace model for serverless inference
 - `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
- `OLLAMA_INFERENCE_MODEL`: Ollama model name
- `OLLAMA_EMBEDDING_MODEL`: Ollama embedding model name
- `OLLAMA_EMBEDDING_DIMENSION`: Ollama embedding dimension (default: `384`)
- `VLLM_INFERENCE_MODEL`: vLLM model name

 ### Vector Database Configuration
 - `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
@ -127,47 +123,29 @@ The following environment variables can be configured:

 ## Enabling Providers

-You can enable specific providers by setting their provider ID to a valid value using environment variables. This is useful when you want to use certain providers or don't have the required API keys.
+You can enable specific providers by setting appropriate environment variables. For example,

-### Examples of Enabling Providers
-
-#### Enable FAISS Vector Provider
 ```bash
-export ENABLE_FAISS=faiss
+# self-hosted
+export OLLAMA_URL=http://localhost:11434   # enables the Ollama inference provider
+export VLLM_URL=http://localhost:8000/v1   # enables the vLLM inference provider
+export TGI_URL=http://localhost:8000/v1   # enables the TGI inference provider
+
+# cloud-hosted requiring API key configuration on the server
+export CEREBRAS_API_KEY=your_cerebras_api_key   # enables the Cerebras inference provider
+export NVIDIA_API_KEY=your_nvidia_api_key   # enables the NVIDIA inference provider
+
+# vector providers
+export MILVUS_URL=http://localhost:19530   # enables the Milvus vector provider
+export CHROMADB_URL=http://localhost:8000/v1   # enables the ChromaDB vector provider
+export PGVECTOR_DB=llama_stack_db   # enables the PGVector vector provider
 ```

-#### Enable Ollama Models
-```bash
-export ENABLE_OLLAMA=ollama
-```
-
-#### Disable vLLM Models
-```bash
-export VLLM_INFERENCE_MODEL=__disabled__
-```
-
-#### Disable Optional Vector Providers
-```bash
-export ENABLE_SQLITE_VEC=__disabled__
-export ENABLE_CHROMADB=__disabled__
-export ENABLE_PGVECTOR=__disabled__
-```
-
-### Provider ID Patterns
-
-The starter distribution uses several patterns for provider IDs:
-
-1. **Direct provider IDs**: `faiss`, `ollama`, `vllm`
-2. **Environment-based provider IDs**: `${env.ENABLE_SQLITE_VEC+sqlite-vec}`
-3. **Model-based provider IDs**: `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`
-
-When using the `+` pattern (like `${env.ENABLE_SQLITE_VEC+sqlite-vec}`), the provider is enabled by default and can be disabled by setting the environment variable to `__disabled__`.
-
-When using the `:` pattern (like `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`), the provider is disabled by default and can be enabled by setting the environment variable to a valid value.
+This distribution comes with a default "llama-guard" shield that can be enabled by setting the `SAFETY_MODEL` environment variable to point to an appropriate Llama Guard model id. Use `llama-stack-client models list` to see the list of available models.

 ## Running the Distribution

-You can run the starter distribution via Docker or Conda.
+You can run the starter distribution via Docker or venv.

 ### Via Docker

@ -186,17 +164,12 @@ docker run \
  --port $LLAMA_STACK_PORT
 ```

-### Via Conda
+### Via venv

-Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+Ensure you have configured the starter distribution using the environment variables explained above.

 ```bash
-llama stack build --template starter --image-type conda
-llama stack run distributions/starter/run.yaml \
-  --port 8321 \
-  --env OPENAI_API_KEY=your_openai_key \
-  --env FIREWORKS_API_KEY=your_fireworks_key \
-  --env TOGETHER_API_KEY=your_together_key
+uv run --with llama-stack llama stack build --distro starter --image-type venv --run
 ```

 ## Example Usage
--- a/docs/source/distributions/starting_llama_stack_server.md
+++ b/docs/source/distributions/starting_llama_stack_server.md
@ -11,12 +11,6 @@ This is the simplest way to get started. Using Llama Stack as a library means yo

 Another simple way to start interacting with Llama Stack is to just spin up a container (via Docker or Podman) which is pre-built with all the providers you need. We provide a number of pre-built images so you can start a Llama Stack server instantly. You can also build your own custom container. Which distribution to choose depends on the hardware you have. See [Selection of a Distribution](selection) for more details.

-
-## Conda:
-
-If you have a custom or an advanced setup or you are developing on Llama Stack you can also build a custom Llama Stack server. Using `llama stack build` and `llama stack run` you can build/run a custom Llama Stack server containing the exact combination of providers you wish. We have also provided various templates to make getting started easier. See [Building a Custom Distribution](building_distro) for more details.
-
-
 ## Kubernetes:

 If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](kubernetes_deployment) for more details.
@ -28,5 +22,4 @@ If you have built a container image and want to deploy it in a Kubernetes cluste

 importing_as_library
 configuration
-kubernetes_deployment
 ```
--- a/docs/source/getting_started/demo_script.py
+++ b/docs/source/getting_started/demo_script.py
@ -0,0 +1,67 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
+
+vector_db_id = "my_demo_vector_db"
+client = LlamaStackClient(base_url="http://localhost:8321")
+
+models = client.models.list()
+
+# Select the first LLM and first embedding models
+model_id = next(m for m in models if m.model_type == "llm").identifier
+embedding_model_id = (
+    em := next(m for m in models if m.model_type == "embedding")
+).identifier
+embedding_dimension = em.metadata["embedding_dimension"]
+
+_ = client.vector_dbs.register(
+    vector_db_id=vector_db_id,
+    embedding_model=embedding_model_id,
+    embedding_dimension=embedding_dimension,
+    provider_id="faiss",
+)
+source = "https://www.paulgraham.com/greatwork.html"
+print("rag_tool> Ingesting document:", source)
+document = RAGDocument(
+    document_id="document_1",
+    content=source,
+    mime_type="text/html",
+    metadata={},
+)
+client.tool_runtime.rag_tool.insert(
+    documents=[document],
+    vector_db_id=vector_db_id,
+    chunk_size_in_tokens=50,
+)
+agent = Agent(
+    client,
+    model=model_id,
+    instructions="You are a helpful assistant",
+    tools=[
+        {
+            "name": "builtin::rag/knowledge_search",
+            "args": {"vector_db_ids": [vector_db_id]},
+        }
+    ],
+)
+
+prompt = "How do you do great work?"
+print("prompt>", prompt)
+
+use_stream = True
+response = agent.create_turn(
+    messages=[{"role": "user", "content": prompt}],
+    session_id=agent.create_session("rag_session"),
+    stream=use_stream,
+)
+
+# Only call `AgentEventLogger().log(response)` for streaming responses.
+if use_stream:
+    for log in AgentEventLogger().log(response):
+        log.print()
+else:
+    print(response)
--- a/docs/source/getting_started/detailed_tutorial.md
+++ b/docs/source/getting_started/detailed_tutorial.md
@ -1,4 +1,4 @@
-# Detailed Tutorial
+## Detailed Tutorial

 In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple agent.
 A Llama Stack agent is a simple integrated system that can perform tasks by combining a Llama model for reasoning with
@ -10,7 +10,7 @@ Llama Stack is a stateful service with REST APIs to support seamless transition
 In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/)
 as the inference [provider](../providers/index.md#inference) for a Llama Model.

-## Step 1: Installation and Setup
+### Step 1: Installation and Setup

 Install Ollama by following the instructions on the [Ollama website](https://ollama.com/download), then
 download Llama 3.2 3B model, and then start the Ollama service.
@ -45,7 +45,7 @@ Setup your virtual environment.
 uv sync --python 3.12
 source .venv/bin/activate
 ```
-## Step 2:  Run Llama Stack
+### Step 2:  Run Llama Stack
 Llama Stack is a server that exposes multiple APIs, you connect with it using the Llama Stack client SDK.

 ::::{tab-set}
@ -54,15 +54,15 @@ Llama Stack is a server that exposes multiple APIs, you connect with it using th
 You can use Python to build and run the Llama Stack server, which is useful for testing and development.

 Llama Stack uses a [YAML configuration file](../distributions/configuration.md) to specify the stack setup,
-which defines the providers and their settings.
+which defines the providers and their settings. The generated configuration serves as a starting point that you can [customize for your specific needs](../distributions/customizing_run_yaml.md).
 Now let's build and run the Llama Stack config for Ollama.
 We use `starter` as template. By default all providers are disabled, this requires enable ollama by passing environment variables.

 ```bash
-ENABLE_OLLAMA=ollama OLLAMA_INFERENCE_MODEL="llama3.2:3b" llama stack build --template starter --image-type venv --run
+llama stack build --distro starter --image-type venv --run
 ```
 :::
-:::{tab-item} Using `conda`
+:::{tab-item} Using `venv`
 You can use Python to build and run the Llama Stack server, which is useful for testing and development.

 Llama Stack uses a [YAML configuration file](../distributions/configuration.md) to specify the stack setup,
@ -70,18 +70,16 @@ which defines the providers and their settings.
 Now let's build and run the Llama Stack config for Ollama.

 ```bash
-ENABLE_OLLAMA=ollama INFERENCE_MODEL="llama3.2:3b" llama stack build --template starter --image-type conda --run
+llama stack build --distro starter --image-type venv --run
 ```
 :::
 :::{tab-item} Using a Container
 You can use a container image to run the Llama Stack server. We provide several container images for the server
 component that works with different inference providers out of the box. For this guide, we will use
 `llamastack/distribution-starter` as the container image. If you'd like to build your own image or customize the
-configurations, please check out [this guide](../references/index.md).
+configurations, please check out [this guide](../distributions/building_distro.md).
 First lets setup some environment variables and create a local directory to mount into the container’s file system.
 ```bash
-export INFERENCE_MODEL="llama3.2:3b"
-export ENABLE_OLLAMA=ollama
 export LLAMA_STACK_PORT=8321
 mkdir -p ~/.llama
 ```
@ -94,7 +92,6 @@ docker run -it \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-starter \
  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://host.docker.internal:11434
 ```
 Note to start the container with Podman, you can do the same but replace `docker` at the start of the command with
@ -116,7 +113,6 @@ docker run -it \
  --network=host \
  llamastack/distribution-starter \
  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://localhost:11434
 ```
 :::
@ -132,7 +128,7 @@ Now you can use the Llama Stack client to run inference and build agents!
 You can reuse the server setup or use the [Llama Stack Client](https://github.com/meta-llama/llama-stack-client-python/).
 Note that the client package is already included in the `llama-stack` package.

-## Step 3: Run Client CLI
+### Step 3: Run Client CLI

 Open a new terminal and navigate to the same directory you started the server from. Then set up a new or activate your
 existing server virtual environment.
@ -154,13 +150,7 @@ pip install llama-stack-client
 ```
 :::

-:::{tab-item} Install with `conda`
-```bash
-yes | conda create -n stack-client python=3.12
-conda activate stack-client
-pip install llama-stack-client
-```
-:::
+
 ::::

 Now let's use the `llama-stack-client` [CLI](../references/llama_stack_client_cli_reference.md) to check the
@ -232,7 +222,7 @@ OpenAIChatCompletion(
 )
 ```

-## Step 4: Run the Demos
+### Step 4: Run the Demos

 Note that these demos show the [Python Client SDK](../references/python_sdk_reference/index.md).
 Other SDKs are also available, please refer to the [Client SDK](../index.md#client-sdks) list for the complete options.
@ -242,7 +232,7 @@ Other SDKs are also available, please refer to the [Client SDK](../index.md#clie
 :::{tab-item} Basic Inference
 Now you can run inference using the Llama Stack client SDK.

-### i. Create the Script
+#### i. Create the Script

 Create a file `inference.py` and add the following code:
 ```python
@ -269,7 +259,7 @@ response = client.chat.completions.create(
 print(response)
 ```

-### ii. Run the Script
+#### ii. Run the Script
 Let's run the script using `uv`
 ```bash
 uv run python inference.py
@ -283,7 +273,7 @@ OpenAIChatCompletion(id='chatcmpl-30cd0f28-a2ad-4b6d-934b-13707fc60ebf', choices

 :::{tab-item} Build a Simple Agent
 Next we can move beyond simple inference and build an agent that can perform tasks using the Llama Stack server.
-### i. Create the Script
+#### i. Create the Script
 Create a file `agent.py` and add the following code:

 ```python
@ -455,7 +445,7 @@ uv run python agent.py

 For our last demo, we can build a RAG agent that can answer questions about the Torchtune project using the documents
 in a vector database.
-### i. Create the Script
+#### i. Create the Script
 Create a file `rag_agent.py` and add the following code:

 ```python
@ -533,7 +523,7 @@ for t in turns:
    for event in AgentEventLogger().log(stream):
        event.print()
 ```
-### ii. Run the Script
+#### ii. Run the Script
 Let's run the script using `uv`
 ```bash
 uv run python rag_agent.py
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -1,123 +1,13 @@
-# Quickstart
+# Getting Started

-Get started with Llama Stack in minutes!
-
-Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different
-environments. You can build and test using a local server first and deploy to a hosted endpoint for production.
-
-In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
-as the inference [provider](../providers/inference/index) for a Llama Model.
-
-**💡 Notebook Version:** You can also follow this quickstart guide in a Jupyter notebook format: [quick_start.ipynb](https://github.com/meta-llama/llama-stack/blob/main/docs/quick_start.ipynb)
-
-#### Step 1: Install and setup
-1. Install [uv](https://docs.astral.sh/uv/)
-2. Run inference on a Llama model with [Ollama](https://ollama.com/download)
-```bash
-ollama run llama3.2:3b --keepalive 60m
+```{include} quickstart.md
+:start-after: ## Quickstart
 ```
-#### Step 2: Run the Llama Stack server
-We will use `uv` to run the Llama Stack server.
-```bash
-INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template starter --image-type venv --run
+
+```{include} libraries.md
+:start-after: ## Libraries (SDKs)
 ```
-#### Step 3: Run the demo
-Now open up a new terminal and copy the following script into a file named `demo_script.py`.

-```python
-from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
-
-vector_db_id = "my_demo_vector_db"
-client = LlamaStackClient(base_url="http://localhost:8321")
-
-models = client.models.list()
-
-# Select the first LLM and first embedding models
-model_id = next(m for m in models if m.model_type == "llm").identifier
-embedding_model_id = (
-    em := next(m for m in models if m.model_type == "embedding")
-).identifier
-embedding_dimension = em.metadata["embedding_dimension"]
-
-_ = client.vector_dbs.register(
-    vector_db_id=vector_db_id,
-    embedding_model=embedding_model_id,
-    embedding_dimension=embedding_dimension,
-    provider_id="faiss",
-)
-source = "https://www.paulgraham.com/greatwork.html"
-print("rag_tool> Ingesting document:", source)
-document = RAGDocument(
-    document_id="document_1",
-    content=source,
-    mime_type="text/html",
-    metadata={},
-)
-client.tool_runtime.rag_tool.insert(
-    documents=[document],
-    vector_db_id=vector_db_id,
-    chunk_size_in_tokens=50,
-)
-agent = Agent(
-    client,
-    model=model_id,
-    instructions="You are a helpful assistant",
-    tools=[
-        {
-            "name": "builtin::rag/knowledge_search",
-            "args": {"vector_db_ids": [vector_db_id]},
-        }
-    ],
-)
-
-prompt = "How do you do great work?"
-print("prompt>", prompt)
-
-response = agent.create_turn(
-    messages=[{"role": "user", "content": prompt}],
-    session_id=agent.create_session("rag_session"),
-    stream=True,
-)
-
-for log in AgentEventLogger().log(response):
-    log.print()
+```{include} detailed_tutorial.md
+:start-after: ## Detailed Tutorial
 ```
-We will use `uv` to run the script
-```
-uv run --with llama-stack-client,fire,requests demo_script.py
-```
-And you should see output like below.
-```
-rag_tool> Ingesting document: https://www.paulgraham.com/greatwork.html
-
-prompt> How do you do great work?
-
-inference> [knowledge_search(query="What is the key to doing great work")]
-
-tool_execution> Tool:knowledge_search Args:{'query': 'What is the key to doing great work'}
-
-tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 2:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 3:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 4:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 5:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
-
-inference> Based on the search results, it seems that doing great work means doing something important so well that you expand people's ideas of what's possible. However, there is no clear threshold for importance, and it can be difficult to judge at the time.
-
-To further clarify, I would suggest that doing great work involves:
-
-* Completing tasks with high quality and attention to detail
-* Expanding on existing knowledge or ideas
-* Making a positive impact on others through your work
-* Striving for excellence and continuous improvement
-
-Ultimately, great work is about making a meaningful contribution and leaving a lasting impression.
-```
-Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳
-
-## Next Steps
-
-Now you're ready to dive deeper into Llama Stack!
- Explore the [Detailed Tutorial](./detailed_tutorial.md).
- Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
- Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
- Learn about Llama Stack [Concepts](../concepts/index.md).
- Discover how to [Build Llama Stacks](../distributions/index.md).
- Refer to our [References](../references/index.md) for details on the Llama CLI and Python SDK.
- Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.
--- a/docs/source/getting_started/libraries.md
+++ b/docs/source/getting_started/libraries.md
@ -0,0 +1,10 @@
+## Libraries (SDKs)
+
+We have a number of client-side SDKs available for different languages.
+
+|  **Language** |  **Client SDK** | **Package** |
+| :----: | :----: | :----: |
+| Python |  [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/)
+| Swift  | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/tree/latest-release) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift)
+| Node   | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
+| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@ -0,0 +1,77 @@
+## Quickstart
+
+Get started with Llama Stack in minutes!
+
+Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different
+environments. You can build and test using a local server first and deploy to a hosted endpoint for production.
+
+In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
+as the inference [provider](../providers/inference/index) for a Llama Model.
+
+**💡 Notebook Version:** You can also follow this quickstart guide in a Jupyter notebook format: [quick_start.ipynb](https://github.com/meta-llama/llama-stack/blob/main/docs/quick_start.ipynb)
+
+#### Step 1: Install and setup
+1. Install [uv](https://docs.astral.sh/uv/)
+2. Run inference on a Llama model with [Ollama](https://ollama.com/download)
+```bash
+ollama run llama3.2:3b --keepalive 60m
+```
+
+#### Step 2: Run the Llama Stack server
+
+We will use `uv` to run the Llama Stack server.
+```bash
+OLLAMA_URL=http://localhost:11434 \
+  uv run --with llama-stack llama stack build --distro starter --image-type venv --run
+```
+#### Step 3: Run the demo
+Now open up a new terminal and copy the following script into a file named `demo_script.py`.
+
+```{literalinclude} ./demo_script.py
+:language: python
+```
+We will use `uv` to run the script
+```
+uv run --with llama-stack-client,fire,requests demo_script.py
+```
+And you should see output like below.
+```
+rag_tool> Ingesting document: https://www.paulgraham.com/greatwork.html
+
+prompt> How do you do great work?
+
+inference> [knowledge_search(query="What is the key to doing great work")]
+
+tool_execution> Tool:knowledge_search Args:{'query': 'What is the key to doing great work'}
+
+tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 2:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 3:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 4:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text="Result 5:\nDocument_id:docum\nContent:  work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
+
+inference> Based on the search results, it seems that doing great work means doing something important so well that you expand people's ideas of what's possible. However, there is no clear threshold for importance, and it can be difficult to judge at the time.
+
+To further clarify, I would suggest that doing great work involves:
+
+* Completing tasks with high quality and attention to detail
+* Expanding on existing knowledge or ideas
+* Making a positive impact on others through your work
+* Striving for excellence and continuous improvement
+
+Ultimately, great work is about making a meaningful contribution and leaving a lasting impression.
+```
+Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳
+
+```{admonition} HuggingFace access
+:class: tip
+
+If you are getting a **401 Client Error** from HuggingFace for the **all-MiniLM-L6-v2** model, try setting **HF_TOKEN** to a valid HuggingFace token in your environment
+```
+
+### Next Steps
+
+Now you're ready to dive deeper into Llama Stack!
+- Explore the [Detailed Tutorial](./detailed_tutorial.md).
+- Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
+- Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
+- Learn about Llama Stack [Concepts](../concepts/index.md).
+- Discover how to [Build Llama Stacks](../distributions/index.md).
+- Refer to our [References](../references/index.md) for details on the Llama CLI and Python SDK.
+- Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -40,17 +40,6 @@ Kotlin.
 - Ready to build? Check out the [Quick Start](getting_started/index) to get started.
 - Want to contribute? See the [Contributing](contributing/index) guide.

-## Client SDKs
-
-We have a number of client-side SDKs available for different languages.
-
-|  **Language** |  **Client SDK** | **Package** |
-| :----: | :----: | :----: |
-| Python |  [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/)
-| Swift  | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/tree/latest-release) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift)
-| Node   | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
-| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
-
 ## Supported Llama Stack Implementations

 A number of "adapters" are available for some popular Inference and Vector Store providers. For other APIs (particularly Safety and Agents), we provide *reference implementations* you can use to get started. We expect this list to grow over time. We are slowly onboarding more providers to the ecosystem as we get more confidence in the APIs.
@ -133,14 +122,12 @@ A number of "adapters" are available for some popular Inference and Vector Store

 self
 getting_started/index
-getting_started/detailed_tutorial
-introduction/index
 concepts/index
-openai/index
 providers/index
 distributions/index
+advanced_apis/index
 building_applications/index
-playground/index
+deploying/index
 contributing/index
 references/index
 ```
--- a/docs/source/providers/agents/index.md
+++ b/docs/source/providers/agents/index.md
@ -1,5 +1,13 @@
-# Agents Providers
+# Agents
+
+## Overview

 This section contains documentation for all available providers for the **agents** API.

- [inline::meta-reference](inline_meta-reference.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+```
--- a/docs/source/providers/datasetio/index.md
+++ b/docs/source/providers/datasetio/index.md
@ -1,7 +1,15 @@
-# Datasetio Providers
+# Datasetio
+
+## Overview

 This section contains documentation for all available providers for the **datasetio** API.

- [inline::localfs](inline_localfs.md)
- [remote::huggingface](remote_huggingface.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_localfs
+remote_huggingface
+remote_nvidia
+```
--- a/docs/source/providers/eval/index.md
+++ b/docs/source/providers/eval/index.md
@ -1,6 +1,14 @@
-# Eval Providers
+# Eval
+
+## Overview

 This section contains documentation for all available providers for the **eval** API.

- [inline::meta-reference](inline_meta-reference.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+remote_nvidia
+```
--- a/docs/source/providers/external/external-providers-guide.md
+++ b/docs/source/providers/external/external-providers-guide.md
@ -1,13 +1,17 @@
-# External Providers Guide
-
-Llama Stack supports external providers that live outside of the main codebase. This allows you to:
- Create and maintain your own providers independently
- Share providers with others without contributing to the main codebase
- Keep provider-specific code separate from the core Llama Stack code
+# Creating External Providers

 ## Configuration

-To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
+To enable external providers, you need to add `module` into your build yaml, allowing Llama Stack to install the required package corresponding to the external provider.
+
+an example entry in your build.yaml should look like:
+
+```
+- provider_type: remote::ramalama
+  module: ramalama_stack
+```
+
+Additionally you can configure the `external_providers_dir` in your Llama Stack configuration. This method is in the process of being deprecated in favor of the `module` method. If using this method, the external provider directory should contain your external provider specifications:

 ```yaml
 external_providers_dir: ~/.llama/providers.d/
@ -46,17 +50,6 @@ Llama Stack supports two types of external providers:
 1. **Remote Providers**: Providers that communicate with external services (e.g., cloud APIs)
 2. **Inline Providers**: Providers that run locally within the Llama Stack process

-## Known External Providers
-
-Here's a list of known external providers that you can use with Llama Stack:
-
-| Name | Description | API | Type | Repository |
-|------|-------------|-----|------|------------|
-| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
-| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
-| RamaLama | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
-| TrustyAI LM-Eval | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
-
 ### Remote Provider Specification

 Remote providers are used when you need to communicate with external services. Here's an example for a custom Ollama provider:
@ -110,9 +103,34 @@ container_image: custom-vector-store:latest  # optional
 - `provider_data_validator`: Optional validator for provider data
 - `container_image`: Optional container image to use instead of pip packages

-## Required Implementation
+## Required Fields

-### Remote Providers
+### All Providers
+
+All providers must contain a `get_provider_spec` function in their `provider` module. This is a standardized structure that Llama Stack expects and is necessary for getting things such as the config class. The `get_provider_spec` method returns a structure identical to the `adapter`. An example function may look like:
+
+```python
+from llama_stack.providers.datatypes import (
+    ProviderSpec,
+    Api,
+    AdapterSpec,
+    remote_provider_spec,
+)
+
+
+def get_provider_spec() -> ProviderSpec:
+    return remote_provider_spec(
+        api=Api.inference,
+        adapter=AdapterSpec(
+            adapter_type="ramalama",
+            pip_packages=["ramalama>=0.8.5", "pymilvus"],
+            config_class="ramalama_stack.config.RamalamaImplConfig",
+            module="ramalama_stack",
+        ),
+    )
+```
+
+#### Remote Providers

 Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
 1. `config`: An instance of the provider's config class
@ -128,7 +146,7 @@ async def get_adapter_impl(
    return OllamaInferenceAdapter(config)
 ```

-### Inline Providers
+#### Inline Providers

 Inline providers must expose a `get_provider_impl()` function in their module that takes two arguments:
 1. `config`: An instance of the provider's config class
@ -155,7 +173,40 @@ Version: 0.1.0
 Location: /path/to/venv/lib/python3.10/site-packages
 ```

-## Example: Custom Ollama Provider
+## Best Practices
+
+1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
+
+2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
+
+3. **Dependencies**: Only include the minimum required dependencies in your provider package.
+
+4. **Documentation**: Include clear documentation in your provider package about:
+   - Installation requirements
+   - Configuration options
+   - Usage examples
+   - Any limitations or known issues
+
+5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
+You can refer to the [integration tests
+guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
+information. Execute the test for the Provider type you are developing.
+
+## Troubleshooting
+
+If your external provider isn't being loaded:
+
+1. Check that `module` points to a published pip package with a top level `provider` module including `get_provider_spec`.
+1. Check that the `external_providers_dir` path is correct and accessible.
+2. Verify that the YAML files are properly formatted.
+3. Ensure all required Python packages are installed.
+4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
+   information using `LLAMA_STACK_LOGGING=all=debug`.
+5. Verify that the provider package is installed in your Python environment if using `external_providers_dir`.
+
+## Examples
+
+### Example using `external_providers_dir`: Custom Ollama Provider

 Here's a complete example of creating and using a custom Ollama provider:

@ -206,32 +257,30 @@ external_providers_dir: ~/.llama/providers.d/

 The provider will now be available in Llama Stack with the type `remote::custom_ollama`.

-## Best Practices

-1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
+### Example using `module`: ramalama-stack

-2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
+[ramalama-stack](https://github.com/containers/ramalama-stack) is a recognized external provider that supports installation via module.

-3. **Dependencies**: Only include the minimum required dependencies in your provider package.
+To install Llama Stack with this external provider a user can provider the following build.yaml:

-4. **Documentation**: Include clear documentation in your provider package about:
-   - Installation requirements
-   - Configuration options
-   - Usage examples
-   - Any limitations or known issues
+```yaml
+version: 2
+distribution_spec:
+  description: Use (an external) Ramalama server for running LLM inference
+  container_image: null
+  providers:
+    inference:
+    - provider_type: remote::ramalama
+      module: ramalama_stack==0.3.0a0
+image_type: venv
+image_name: null
+external_providers_dir: null
+additional_pip_packages:
+- aiosqlite
+- sqlalchemy[asyncio]
+```

-5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
-You can refer to the [integration tests
-guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
-information. Execute the test for the Provider type you are developing.
+No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.

-## Troubleshooting
-
-If your external provider isn't being loaded:
-
-1. Check that the `external_providers_dir` path is correct and accessible.
-2. Verify that the YAML files are properly formatted.
-3. Ensure all required Python packages are installed.
-4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
-   information using `LLAMA_STACK_LOGGING=all=debug`.
-5. Verify that the provider package is installed in your Python environment.
+The provider will now be available in Llama Stack with the type `remote::ramalama`.
--- a/docs/source/providers/external/external-providers-list.md
+++ b/docs/source/providers/external/external-providers-list.md
@ -0,0 +1,10 @@
+# Known External Providers
+
+Here's a list of known external providers that you can use with Llama Stack:
+
+| Name | Description | API | Type | Repository |
+|------|-------------|-----|------|------------|
+| KubeFlow Training | Train models with KubeFlow | Post Training | Remote | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
+| KubeFlow Pipelines | Train models with KubeFlow Pipelines | Post Training | Inline **and** Remote | [llama-stack-provider-kfp-trainer](https://github.com/opendatahub-io/llama-stack-provider-kfp-trainer) |
+| RamaLama | Inference models with RamaLama | Inference | Remote | [ramalama-stack](https://github.com/containers/ramalama-stack) |
+| TrustyAI LM-Eval | Evaluate models with TrustyAI LM-Eval | Eval | Remote | [llama-stack-provider-lmeval](https://github.com/trustyai-explainability/llama-stack-provider-lmeval) |
--- a/docs/source/providers/external/index.md
+++ b/docs/source/providers/external/index.md
@ -0,0 +1,13 @@
+# External Providers
+
+Llama Stack supports external providers that live outside of the main codebase. This allows you to:
+- Create and maintain your own providers independently
+- Share providers with others without contributing to the main codebase
+- Keep provider-specific code separate from the core Llama Stack code
+
+```{toctree}
+:maxdepth: 1
+
+external-providers-list
+external-providers-guide
+```
--- a/docs/source/providers/files/index.md
+++ b/docs/source/providers/files/index.md
@ -1,5 +1,13 @@
-# Files Providers
+# Files
+
+## Overview

 This section contains documentation for all available providers for the **files** API.

- [inline::localfs](inline_localfs.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_localfs
+```
--- a/docs/source/providers/files/inline_localfs.md
+++ b/docs/source/providers/files/inline_localfs.md
@ -8,7 +8,7 @@ Local filesystem-based file storage provider for managing files and documents lo

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `storage_dir` | `<class 'str'>` | No | PydanticUndefined | Directory to store uploaded files |
+| `storage_dir` | `<class 'str'>` | No |  | Directory to store uploaded files |
 | `metadata_store` | `utils.sqlstore.sqlstore.SqliteSqlStoreConfig \| utils.sqlstore.sqlstore.PostgresSqlStoreConfig` | No | sqlite | SQL store configuration for file metadata |
 | `ttl_secs` | `<class 'int'>` | No | 31536000 |  |

--- a/docs/source/providers/index.md
+++ b/docs/source/providers/index.md
@ -1,4 +1,4 @@
-# Providers Overview
+# API Providers

 The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
 - LLM inference providers (e.g., Meta Reference, Ollama, Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, OpenAI, Anthropic, Gemini, WatsonX, etc.),
@ -12,105 +12,17 @@ Providers come in two flavors:

 Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.

-## External Providers
-
-Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently.
-
-```{toctree}
-:maxdepth: 1
-
-external
-```
-
-## Agents
-Run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
-
-```{toctree}
-:maxdepth: 1
-
-agents/index
-```
-
-## DatasetIO
-Interfaces with datasets and data loaders.
-
-```{toctree}
-:maxdepth: 1
-
-datasetio/index
-```
-
-## Eval
-Generates outputs (via Inference or Agents) and perform scoring.
-
-```{toctree}
-:maxdepth: 1
-
-eval/index
-```
-
-## Inference
-Runs inference with an LLM.
-
 ```{toctree}
 :maxdepth: 1

+external/index
+openai
 inference/index
-```
-
-## Post Training
-Fine-tunes a model.
-
-```{toctree}
-:maxdepth: 1
-
-post_training/index
-```
-
-## Safety
-Applies safety policies to the output at a Systems (not only model) level.
-
-```{toctree}
-:maxdepth: 1
-
+agents/index
+datasetio/index
 safety/index
-```
-
-## Scoring
-Evaluates the outputs of the system.
-
-```{toctree}
-:maxdepth: 1
-
-scoring/index
-```
-
-## Telemetry
-Collects telemetry data from the system.
-
-```{toctree}
-:maxdepth: 1
-
 telemetry/index
-```
-
-## Tool Runtime
-Is associated with the ToolGroup resouces.
-
-```{toctree}
-:maxdepth: 1
-
-tool_runtime/index
-```
-
-## Vector IO
-
-Vector IO refers to operations on vector databases, such as adding documents, searching, and deleting documents.
-Vector IO plays a crucial role in [Retreival Augmented Generation (RAG)](../..//building_applications/rag), where the vector
-io and database are used to store and retrieve documents for retrieval.
-
-```{toctree}
-:maxdepth: 1
-
 vector_io/index
+tool_runtime/index
+files/index
 ```
--- a/docs/source/providers/inference/index.md
+++ b/docs/source/providers/inference/index.md
@ -1,32 +1,34 @@
-# Inference Providers
+# Inference
+
+## Overview

 This section contains documentation for all available providers for the **inference** API.

- [inline::meta-reference](inline_meta-reference.md)
- [inline::sentence-transformers](inline_sentence-transformers.md)
- [inline::vllm](inline_vllm.md)
- [remote::anthropic](remote_anthropic.md)
- [remote::bedrock](remote_bedrock.md)
- [remote::cerebras](remote_cerebras.md)
- [remote::cerebras-openai-compat](remote_cerebras-openai-compat.md)
- [remote::databricks](remote_databricks.md)
- [remote::fireworks](remote_fireworks.md)
- [remote::fireworks-openai-compat](remote_fireworks-openai-compat.md)
- [remote::gemini](remote_gemini.md)
- [remote::groq](remote_groq.md)
- [remote::groq-openai-compat](remote_groq-openai-compat.md)
- [remote::hf::endpoint](remote_hf_endpoint.md)
- [remote::hf::serverless](remote_hf_serverless.md)
- [remote::llama-openai-compat](remote_llama-openai-compat.md)
- [remote::nvidia](remote_nvidia.md)
- [remote::ollama](remote_ollama.md)
- [remote::openai](remote_openai.md)
- [remote::passthrough](remote_passthrough.md)
- [remote::runpod](remote_runpod.md)
- [remote::sambanova](remote_sambanova.md)
- [remote::sambanova-openai-compat](remote_sambanova-openai-compat.md)
- [remote::tgi](remote_tgi.md)
- [remote::together](remote_together.md)
- [remote::together-openai-compat](remote_together-openai-compat.md)
- [remote::vllm](remote_vllm.md)
- [remote::watsonx](remote_watsonx.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+inline_sentence-transformers
+remote_anthropic
+remote_bedrock
+remote_cerebras
+remote_databricks
+remote_fireworks
+remote_gemini
+remote_groq
+remote_hf_endpoint
+remote_hf_serverless
+remote_llama-openai-compat
+remote_nvidia
+remote_ollama
+remote_openai
+remote_passthrough
+remote_runpod
+remote_sambanova
+remote_tgi
+remote_together
+remote_vllm
+remote_watsonx
+```
--- a/docs/source/providers/inference/inline_vllm.md
+++ b/docs/source/providers/inference/inline_vllm.md
@ -1,29 +0,0 @@
-# inline::vllm
-
-## Description
-
-vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `tensor_parallel_size` | `<class 'int'>` | No | 1 | Number of tensor parallel replicas (number of GPUs to use). |
-| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
-| `max_model_len` | `<class 'int'>` | No | 4096 | Maximum context length to use during serving. |
-| `max_num_seqs` | `<class 'int'>` | No | 4 | Maximum parallel batch size for generation. |
-| `enforce_eager` | `<class 'bool'>` | No | False | Whether to use eager mode for inference (otherwise cuda graphs are used). |
-| `gpu_memory_utilization` | `<class 'float'>` | No | 0.3 | How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading. |
-
-## Sample Configuration
-
-```yaml
-tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
-max_tokens: ${env.MAX_TOKENS:=4096}
-max_model_len: ${env.MAX_MODEL_LEN:=4096}
-max_num_seqs: ${env.MAX_NUM_SEQS:=4}
-enforce_eager: ${env.ENFORCE_EAGER:=False}
-gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}
-
-```
-
--- a/docs/source/providers/inference/remote_anthropic.md
+++ b/docs/source/providers/inference/remote_anthropic.md
@ -13,7 +13,7 @@ Anthropic inference provider for accessing Claude models and Anthropic's AI serv
 ## Sample Configuration

 ```yaml
-api_key: ${env.ANTHROPIC_API_KEY}
+api_key: ${env.ANTHROPIC_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_cerebras-openai-compat.md
+++ b/docs/source/providers/inference/remote_cerebras-openai-compat.md
@ -1,21 +0,0 @@
-# remote::cerebras-openai-compat
-
-## Description
-
-Cerebras OpenAI-compatible provider for using Cerebras models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Cerebras API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.cerebras.ai/v1 | The URL for the Cerebras API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.cerebras.ai/v1
-api_key: ${env.CEREBRAS_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_cerebras.md
+++ b/docs/source/providers/inference/remote_cerebras.md
@ -15,7 +15,7 @@ Cerebras inference provider for running models on Cerebras Cloud platform.

 ```yaml
 base_url: https://api.cerebras.ai
-api_key: ${env.CEREBRAS_API_KEY}
+api_key: ${env.CEREBRAS_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_databricks.md
+++ b/docs/source/providers/inference/remote_databricks.md
@ -14,8 +14,8 @@ Databricks inference provider for running models on Databricks' unified analytic
 ## Sample Configuration

 ```yaml
-url: ${env.DATABRICKS_URL}
-api_token: ${env.DATABRICKS_API_TOKEN}
+url: ${env.DATABRICKS_URL:=}
+api_token: ${env.DATABRICKS_API_TOKEN:=}

 ```

--- a/docs/source/providers/inference/remote_fireworks-openai-compat.md
+++ b/docs/source/providers/inference/remote_fireworks-openai-compat.md
@ -1,21 +0,0 @@
-# remote::fireworks-openai-compat
-
-## Description
-
-Fireworks AI OpenAI-compatible provider for using Fireworks models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Fireworks API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.fireworks.ai/inference/v1
-api_key: ${env.FIREWORKS_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_fireworks.md
+++ b/docs/source/providers/inference/remote_fireworks.md
@ -8,6 +8,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
 | `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
 | `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Fireworks.ai API Key |

@ -15,7 +16,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire

 ```yaml
 url: https://api.fireworks.ai/inference/v1
-api_key: ${env.FIREWORKS_API_KEY}
+api_key: ${env.FIREWORKS_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_gemini.md
+++ b/docs/source/providers/inference/remote_gemini.md
@ -13,7 +13,7 @@ Google Gemini inference provider for accessing Gemini models and Google's AI ser
 ## Sample Configuration

 ```yaml
-api_key: ${env.GEMINI_API_KEY}
+api_key: ${env.GEMINI_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_groq-openai-compat.md
+++ b/docs/source/providers/inference/remote_groq-openai-compat.md
@ -1,21 +0,0 @@
-# remote::groq-openai-compat
-
-## Description
-
-Groq OpenAI-compatible provider for using Groq models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Groq API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.groq.com/openai/v1 | The URL for the Groq API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.groq.com/openai/v1
-api_key: ${env.GROQ_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_groq.md
+++ b/docs/source/providers/inference/remote_groq.md
@ -15,7 +15,7 @@ Groq inference provider for ultra-fast inference using Groq's LPU technology.

 ```yaml
 url: https://api.groq.com
-api_key: ${env.GROQ_API_KEY}
+api_key: ${env.GROQ_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_hf_endpoint.md
+++ b/docs/source/providers/inference/remote_hf_endpoint.md
@ -8,7 +8,7 @@ HuggingFace Inference Endpoints provider for dedicated model serving.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `endpoint_name` | `<class 'str'>` | No | PydanticUndefined | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
+| `endpoint_name` | `<class 'str'>` | No |  | The name of the Hugging Face Inference Endpoint in the format of '{namespace}/{endpoint_name}' (e.g. 'my-cool-org/meta-llama-3-1-8b-instruct-rce'). Namespace is optional and will default to the user account if not provided. |
 | `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |

 ## Sample Configuration
--- a/docs/source/providers/inference/remote_hf_serverless.md
+++ b/docs/source/providers/inference/remote_hf_serverless.md
@ -8,7 +8,7 @@ HuggingFace Inference API serverless provider for on-demand model inference.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `huggingface_repo` | `<class 'str'>` | No | PydanticUndefined | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
+| `huggingface_repo` | `<class 'str'>` | No |  | The model ID of the model on the Hugging Face Hub (e.g. 'meta-llama/Meta-Llama-3.1-70B-Instruct') |
 | `api_token` | `pydantic.types.SecretStr \| None` | No |  | Your Hugging Face user access token (will default to locally saved token if not provided) |

 ## Sample Configuration
--- a/docs/source/providers/inference/remote_ollama.md
+++ b/docs/source/providers/inference/remote_ollama.md
@ -9,6 +9,7 @@ Ollama inference provider for running local models through the Ollama runtime.
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `url` | `<class 'str'>` | No | http://localhost:11434 |  |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |

 ## Sample Configuration

--- a/docs/source/providers/inference/remote_openai.md
+++ b/docs/source/providers/inference/remote_openai.md
@ -9,11 +9,13 @@ OpenAI inference provider for accessing GPT models and other OpenAI services.
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `api_key` | `str \| None` | No |  | API key for OpenAI models |
+| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |

 ## Sample Configuration

 ```yaml
-api_key: ${env.OPENAI_API_KEY}
+api_key: ${env.OPENAI_API_KEY:=}
+base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}

 ```

--- a/docs/source/providers/inference/remote_sambanova-openai-compat.md
+++ b/docs/source/providers/inference/remote_sambanova-openai-compat.md
@ -15,7 +15,7 @@ SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API

 ```yaml
 openai_compat_api_base: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_sambanova.md
+++ b/docs/source/providers/inference/remote_sambanova.md
@ -15,7 +15,7 @@ SambaNova inference provider for running models on SambaNova's dataflow architec

 ```yaml
 url: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_tgi.md
+++ b/docs/source/providers/inference/remote_tgi.md
@ -8,12 +8,12 @@ Text Generation Inference (TGI) provider for HuggingFace model serving.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `url` | `<class 'str'>` | No | PydanticUndefined | The URL for the TGI serving endpoint |
+| `url` | `<class 'str'>` | No |  | The URL for the TGI serving endpoint |

 ## Sample Configuration

 ```yaml
-url: ${env.TGI_URL}
+url: ${env.TGI_URL:=}

 ```

--- a/docs/source/providers/inference/remote_together-openai-compat.md
+++ b/docs/source/providers/inference/remote_together-openai-compat.md
@ -1,21 +0,0 @@
-# remote::together-openai-compat
-
-## Description
-
-Together AI OpenAI-compatible provider for using Together models with OpenAI API format.
-
-## Configuration
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `api_key` | `str \| None` | No |  | The Together API key |
-| `openai_compat_api_base` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together API server |
-
-## Sample Configuration
-
-```yaml
-openai_compat_api_base: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY}
-
-```
-
--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -8,6 +8,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
+| `allowed_models` | `list[str \| None` | No |  | List of models that should be registered with the model registry. If None, all models are allowed. |
 | `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
 | `api_key` | `pydantic.types.SecretStr \| None` | No |  | The Together AI API Key |

@ -15,7 +16,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 ```yaml
 url: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY}
+api_key: ${env.TOGETHER_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_vllm.md
+++ b/docs/source/providers/inference/remote_vllm.md
@ -12,11 +12,12 @@ Remote vLLM inference provider for connecting to vLLM servers.
 | `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
 | `api_token` | `str \| None` | No | fake | The API token |
 | `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
+| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |

 ## Sample Configuration

 ```yaml
-url: ${env.VLLM_URL}
+url: ${env.VLLM_URL:=}
 max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
 api_token: ${env.VLLM_API_TOKEN:=fake}
 tls_verify: ${env.VLLM_TLS_VERIFY:=true}
--- a/docs/source/providers/openai.md
+++ b/docs/source/providers/openai.md
@ -1,14 +1,14 @@
-# OpenAI API Compatibility
+## OpenAI API Compatibility

-## Server path
+### Server path

 Llama Stack exposes an OpenAI-compatible API endpoint at `/v1/openai/v1`. So, for a Llama Stack server running locally on port `8321`, the full url to the OpenAI-compatible API endpoint is `http://localhost:8321/v1/openai/v1`.

-## Clients
+### Clients

 You should be able to use any client that speaks OpenAI APIs with Llama Stack. We regularly test with the official Llama Stack clients as well as OpenAI's official Python client.

-### Llama Stack Client
+#### Llama Stack Client

 When using the Llama Stack client, set the `base_url` to the root of your Llama Stack server. It will automatically route OpenAI-compatible requests to the right server endpoint for you.

@ -18,7 +18,7 @@ from llama_stack_client import LlamaStackClient
 client = LlamaStackClient(base_url="http://localhost:8321")
 ```

-### OpenAI Client
+#### OpenAI Client

 When using an OpenAI client, set the `base_url` to the `/v1/openai/v1` path on your Llama Stack server.

@ -30,9 +30,9 @@ client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")

 Regardless of the client you choose, the following code examples should all work the same.

-## APIs implemented
+### APIs implemented

-### Models
+#### Models

 Many of the APIs require you to pass in a model parameter. To see the list of models available in your Llama Stack server:

@ -40,13 +40,13 @@ Many of the APIs require you to pass in a model parameter. To see the list of mo
 models = client.models.list()
 ```

-### Responses
+#### Responses

 :::{note}
 The Responses API implementation is still in active development. While it is quite usable, there are still unimplemented parts of the API. We'd love feedback on any use-cases you try that do not work to help prioritize the pieces left to implement. Please open issues in the [meta-llama/llama-stack](https://github.com/meta-llama/llama-stack) GitHub repository with details of anything that does not work.
 :::

-#### Simple inference
+##### Simple inference

 Request:

@ -66,7 +66,7 @@ Syntax whispers secrets sweet
 Code's gentle silence
 ```

-#### Structured Output
+##### Structured Output

 Request:

@ -106,9 +106,9 @@ Example output:
 { "participants": ["Alice", "Bob"] }
 ```

-### Chat Completions
+#### Chat Completions

-#### Simple inference
+##### Simple inference

 Request:

@ -129,7 +129,7 @@ Logic flows like a river
 Code's gentle beauty
 ```

-#### Structured Output
+##### Structured Output

 Request:

@ -170,9 +170,9 @@ Example output:
 { "participants": ["Alice", "Bob"] }
 ```

-### Completions
+#### Completions

-#### Simple inference
+##### Simple inference

 Request:

--- a/docs/source/providers/post_training/index.md
+++ b/docs/source/providers/post_training/index.md
@ -1,7 +1,15 @@
-# Post_Training Providers
+# Post_Training
+
+## Overview

 This section contains documentation for all available providers for the **post_training** API.

- [inline::huggingface](inline_huggingface.md)
- [inline::torchtune](inline_torchtune.md)
- [remote::nvidia](remote_nvidia.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_huggingface
+inline_torchtune
+remote_nvidia
+```
--- a/docs/source/providers/post_training/inline_huggingface.md
+++ b/docs/source/providers/post_training/inline_huggingface.md
@ -24,6 +24,10 @@ HuggingFace-based post-training provider for fine-tuning models using the Huggin
 | `weight_decay` | `<class 'float'>` | No | 0.01 |  |
 | `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
 | `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |

 ## Sample Configuration

@ -31,6 +35,7 @@ HuggingFace-based post-training provider for fine-tuning models using the Huggin
 checkpoint_format: huggingface
 distributed_backend: null
 device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output

 ```

--- a/docs/source/providers/safety/index.md
+++ b/docs/source/providers/safety/index.md
@ -1,10 +1,18 @@
-# Safety Providers
+# Safety
+
+## Overview

 This section contains documentation for all available providers for the **safety** API.

- [inline::code-scanner](inline_code-scanner.md)
- [inline::llama-guard](inline_llama-guard.md)
- [inline::prompt-guard](inline_prompt-guard.md)
- [remote::bedrock](remote_bedrock.md)
- [remote::nvidia](remote_nvidia.md)
- [remote::sambanova](remote_sambanova.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_code-scanner
+inline_llama-guard
+inline_prompt-guard
+remote_bedrock
+remote_nvidia
+remote_sambanova
+```
--- a/docs/source/providers/safety/remote_sambanova.md
+++ b/docs/source/providers/safety/remote_sambanova.md
@ -15,7 +15,7 @@ SambaNova's safety provider for content moderation and safety filtering.

 ```yaml
 url: https://api.sambanova.ai/v1
-api_key: ${env.SAMBANOVA_API_KEY}
+api_key: ${env.SAMBANOVA_API_KEY:=}

 ```

--- a/docs/source/providers/scoring/index.md
+++ b/docs/source/providers/scoring/index.md
@ -1,7 +1,15 @@
-# Scoring Providers
+# Scoring
+
+## Overview

 This section contains documentation for all available providers for the **scoring** API.

- [inline::basic](inline_basic.md)
- [inline::braintrust](inline_braintrust.md)
- [inline::llm-as-judge](inline_llm-as-judge.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_basic
+inline_braintrust
+inline_llm-as-judge
+```
--- a/docs/source/providers/telemetry/index.md
+++ b/docs/source/providers/telemetry/index.md
@ -1,5 +1,13 @@
-# Telemetry Providers
+# Telemetry
+
+## Overview

 This section contains documentation for all available providers for the **telemetry** API.

- [inline::meta-reference](inline_meta-reference.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_meta-reference
+```
--- a/docs/source/providers/tool_runtime/index.md
+++ b/docs/source/providers/tool_runtime/index.md
@ -1,10 +1,18 @@
-# Tool_Runtime Providers
+# Tool_Runtime
+
+## Overview

 This section contains documentation for all available providers for the **tool_runtime** API.

- [inline::rag-runtime](inline_rag-runtime.md)
- [remote::bing-search](remote_bing-search.md)
- [remote::brave-search](remote_brave-search.md)
- [remote::model-context-protocol](remote_model-context-protocol.md)
- [remote::tavily-search](remote_tavily-search.md)
- [remote::wolfram-alpha](remote_wolfram-alpha.md)
+## Providers
+
+```{toctree}
+:maxdepth: 1
+
+inline_rag-runtime
+remote_bing-search
+remote_brave-search
+remote_model-context-protocol
+remote_tavily-search
+remote_wolfram-alpha
+```
--- a/docs/source/providers/vector_io/index.md
+++ b/docs/source/providers/vector_io/index.md
@ -1,17 +1,17 @@
-# Vector_Io Providers
+## Providers

-This section contains documentation for all available providers for the **vector_io** API.
+```{toctree}
+:maxdepth: 1

- [inline::chromadb](inline_chromadb.md)
- [inline::faiss](inline_faiss.md)
- [inline::meta-reference](inline_meta-reference.md)
- [inline::milvus](inline_milvus.md)
- [inline::qdrant](inline_qdrant.md)
- [inline::sqlite-vec](inline_sqlite-vec.md)
- [inline::sqlite_vec](inline_sqlite_vec.md)
- [remote::chromadb](remote_chromadb.md)
- [remote::milvus](remote_milvus.md)
- [remote::opengauss](remote_opengauss.md)
- [remote::pgvector](remote_pgvector.md)
- [remote::qdrant](remote_qdrant.md)
- [remote::weaviate](remote_weaviate.md)
+inline_chromadb
+inline_faiss
+inline_meta-reference
+inline_milvus
+inline_qdrant
+inline_sqlite-vec
+remote_chromadb
+remote_milvus
+remote_opengauss
+remote_pgvector
+remote_qdrant
+remote_weaviate
--- a/docs/source/providers/vector_io/inline_chromadb.md
+++ b/docs/source/providers/vector_io/inline_chromadb.md
@ -41,12 +41,16 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `db_path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |

 ## Sample Configuration

 ```yaml
 db_path: ${env.CHROMADB_PATH}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_inline_registry.db

 ```

--- a/docs/source/providers/vector_io/inline_milvus.md
+++ b/docs/source/providers/vector_io/inline_milvus.md
@ -10,7 +10,7 @@ Please refer to the remote provider documentation.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `db_path` | `<class 'str'>` | No |  |  |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |
 | `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |

--- a/docs/source/providers/vector_io/inline_qdrant.md
+++ b/docs/source/providers/vector_io/inline_qdrant.md
@ -50,12 +50,16 @@ See the [Qdrant documentation](https://qdrant.tech/documentation/) for more deta

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `path` | `<class 'str'>` | No | PydanticUndefined |  |
+| `path` | `<class 'str'>` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |

 ## Sample Configuration

 ```yaml
 path: ${env.QDRANT_PATH:=~/.llama/~/.llama/dummy}/qdrant.db
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db

 ```

--- a/docs/source/providers/vector_io/inline_sqlite-vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite-vec.md
@ -205,7 +205,7 @@ See [sqlite-vec's GitHub repo](https://github.com/asg017/sqlite-vec/tree/main) f

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined | Path to the SQLite database file |
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration
--- a/docs/source/providers/vector_io/inline_sqlite_vec.md
+++ b/docs/source/providers/vector_io/inline_sqlite_vec.md
@ -10,7 +10,7 @@ Please refer to the sqlite-vec provider documentation.

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `db_path` | `<class 'str'>` | No | PydanticUndefined | Path to the SQLite database file |
+| `db_path` | `<class 'str'>` | No |  | Path to the SQLite database file |
 | `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration
--- a/docs/source/providers/vector_io/remote_chromadb.md
+++ b/docs/source/providers/vector_io/remote_chromadb.md
@ -40,12 +40,16 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `url` | `str \| None` | No | PydanticUndefined |  |
+| `url` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |

 ## Sample Configuration

 ```yaml
 url: ${env.CHROMADB_URL}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_remote_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_milvus.md
+++ b/docs/source/providers/vector_io/remote_milvus.md
@ -111,10 +111,10 @@ For more details on TLS configuration, refer to the [TLS setup guide](https://mi

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `uri` | `<class 'str'>` | No | PydanticUndefined | The URI of the Milvus server |
-| `token` | `str \| None` | No | PydanticUndefined | The token of the Milvus server |
+| `uri` | `<class 'str'>` | No |  | The URI of the Milvus server |
+| `token` | `str \| None` | No |  | The token of the Milvus server |
 | `consistency_level` | `<class 'str'>` | No | Strong | The consistency level of the Milvus server |
-| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
 | `config` | `dict` | No | {} | This configuration allows additional fields to be passed through to the underlying Milvus client. See the [Milvus](https://milvus.io/docs/install-overview.md) documentation for more details about Milvus in general. |

 > **Note**: This configuration class accepts additional fields beyond those listed above. You can pass any additional configuration options that will be forwarded to the underlying provider.
@ -124,6 +124,9 @@ For more details on TLS configuration, refer to the [TLS setup guide](https://mi
 ```yaml
 uri: ${env.MILVUS_ENDPOINT}
 token: ${env.MILVUS_TOKEN}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_remote_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_pgvector.md
+++ b/docs/source/providers/vector_io/remote_pgvector.md
@ -17,7 +17,7 @@ That means you'll get fast and efficient vector retrieval.
 To use PGVector in your Llama Stack project, follow these steps:

 1. Install the necessary dependencies.
-2. Configure your Llama Stack project to use Faiss.
+2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
 3. Start storing and querying vectors.

 ## Installation
@ -40,6 +40,7 @@ See [PGVector's documentation](https://github.com/pgvector/pgvector) for more de
 | `db` | `str \| None` | No | postgres |  |
 | `user` | `str \| None` | No | postgres |  |
 | `password` | `str \| None` | No | mysecretpassword |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |

 ## Sample Configuration

@ -49,6 +50,9 @@ port: ${env.PGVECTOR_PORT:=5432}
 db: ${env.PGVECTOR_DB}
 user: ${env.PGVECTOR_USER}
 password: ${env.PGVECTOR_PASSWORD}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/pgvector_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_qdrant.md
+++ b/docs/source/providers/vector_io/remote_qdrant.md
@ -20,11 +20,15 @@ Please refer to the inline provider documentation.
 | `prefix` | `str \| None` | No |  |  |
 | `timeout` | `int \| None` | No |  |  |
 | `host` | `str \| None` | No |  |  |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite |  |

 ## Sample Configuration

 ```yaml
-api_key: ${env.QDRANT_API_KEY}
+api_key: ${env.QDRANT_API_KEY:=}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/qdrant_registry.db

 ```

--- a/docs/source/providers/vector_io/remote_weaviate.md
+++ b/docs/source/providers/vector_io/remote_weaviate.md
@ -33,10 +33,22 @@ To install Weaviate see the [Weaviate quickstart documentation](https://weaviate
 See [Weaviate's documentation](https://weaviate.io/developers/weaviate) for more details about Weaviate in general.


+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `weaviate_api_key` | `str \| None` | No |  | The API key for the Weaviate instance |
+| `weaviate_cluster_url` | `str \| None` | No | localhost:8080 | The URL of the Weaviate cluster |
+| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig, annotation=NoneType, required=False, default='sqlite', discriminator='type'` | No |  | Config for KV store backend (SQLite only for now) |
+
 ## Sample Configuration

 ```yaml
-{}
+weaviate_api_key: null
+weaviate_cluster_url: ${env.WEAVIATE_CLUSTER_URL:=localhost:8080}
+kvstore:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/weaviate_registry.db

 ```

--- a/Show more
+++ b/Show more