docs: concepts and building_applicatins migration

2025-10-04 04:04:14 +00:00 · 2025-09-23 14:00:26 -07:00 · 2025-09-23 14:00:26 -07:00 · f2f0a03e90
commit f2f0a03e90
parent 05ff4c4420
82 changed files with 2535 additions and 1237 deletions
--- a/docs/source/building_applications/agent.md
+++ b/docs/source/building_applications/agent.md
@ -1,9 +1,18 @@
 ---
 title: Agents
 description: Build powerful AI applications with the Llama Stack agent framework
 sidebar_label: Agents
 sidebar_position: 3
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Agents
 An Agent in Llama Stack is a powerful abstraction that allows you to build complex AI applications.
-The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI
+The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
 applications. This document explains the key components and how they work together.
 ## Core Concepts
@ -19,7 +28,6 @@ Agents are configured using the `AgentConfig` class, which includes:
 ```python
 from llama_stack_client import Agent
 # Create the agent
 agent = Agent(
    llama_stack_client,
@ -46,6 +54,9 @@ Each interaction with an agent is called a "turn" and consists of:
 - **Steps**: The agent's internal processing (inference, tool execution, etc.)
 - **Output Message**: The agent's response
 <Tabs>
 <TabItem value="streaming" label="Streaming Response">
 ```python
 from llama_stack_client import AgentEventLogger
@ -57,9 +68,9 @@ turn_response = agent.create_turn(
 for log in AgentEventLogger().log(turn_response):
    log.print()
 ```
 ###  Non-Streaming
 </TabItem>
 <TabItem value="non-streaming" label="Non-Streaming Response">
 ```python
 from rich.pretty import pprint
@ -78,6 +89,9 @@ print("Steps:")
 pprint(response.steps)
 ```
 </TabItem>
 </Tabs>
 ### 4. Steps
 Each turn consists of multiple steps that represent the agent's thought process:
@ -88,5 +102,11 @@ Each turn consists of multiple steps that represent the agent's thought process:
 ## Agent Execution Loop
 Refer to the [Agent Execution Loop](./agent_execution_loop) for more details on what happens within an agent turn.
-Refer to the [Agent Execution Loop](agent_execution_loop) for more details on what happens within an agent turn.
+## Related Resources
 - **[Agent Execution Loop](./agent_execution_loop)** - Understanding the internal processing flow
 - **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced agents
 - **[Tools Integration](./tools)** - Extending agent capabilities with external tools
 - **[Safety Guardrails](./safety)** - Implementing responsible AI practices
--- a/docs/source/building_applications/agent_execution_loop.md
+++ b/docs/source/building_applications/agent_execution_loop.md
@ -1,10 +1,18 @@
-## Agent Execution Loop
+---
 title: Agent Execution Loop
 description: Understanding the internal processing flow of Llama Stack agents
 sidebar_label: Agent Execution Loop
 sidebar_position: 4
 ---
-Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent
+import Tabs from '@theme/Tabs';
-workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage,
+import TabItem from '@theme/TabItem';
 and safety checks.
-### Steps in the Agent Workflow
+# Agent Execution Loop
 Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
 ## Steps in the Agent Workflow
 Each agent turn follows these key steps:
@ -17,7 +25,7 @@ Each agent turn follows these key steps:
 3. **Inference Loop**: The agent enters its main execution loop:
   - The LLM receives a user prompt (with previous tool outputs)
-   - The LLM generates a response, potentially with [tool calls](tools)
+   - The LLM generates a response, potentially with [tool calls](./tools)
   - If tool calls are present:
     - Tool inputs are safety-checked
     - Tools are executed (e.g., web search, code execution)
@ -29,7 +37,9 @@ Each agent turn follows these key steps:
 4. **Final Safety Check**: The agent's final response is screened through safety shields
-```{mermaid}
+## Execution Flow Diagram
 ```mermaid
 sequenceDiagram
    participant U as User
    participant E as Executor
@ -70,12 +80,15 @@ sequenceDiagram
 Each step in this process can be monitored and controlled through configurations.
-### Agent Execution Loop Example
+## Agent Execution Example
 Here's an example that demonstrates monitoring the agent's execution:
 <Tabs>
 <TabItem value="streaming" label="Streaming Execution">
 ```python
 from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
 from rich.pretty import pprint
 # Replace host and port
 client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
@ -120,6 +133,13 @@ response = agent.create_turn(
 # Monitor each step of execution
 for log in AgentEventLogger().log(response):
    log.print()
 ```
 </TabItem>
 <TabItem value="non-streaming" label="Non-Streaming Execution">
 ```python
 from rich.pretty import pprint
 # Using non-streaming API, the response contains input, steps, and output.
 response = agent.create_turn(
@ -131,9 +151,35 @@ response = agent.create_turn(
        }
    ],
    session_id=session_id,
    stream=False,
 )
 pprint(f"Input: {response.input_messages}")
 pprint(f"Output: {response.output_message.content}")
 pprint(f"Steps: {response.steps}")
 ```
 </TabItem>
 </Tabs>
 ## Key Configuration Options
 ### Loop Control
 - **max_infer_iters**: Maximum number of inference iterations (default: 5)
 - **max_tokens**: Token limit for responses
 - **temperature**: Controls response randomness
 ### Safety Configuration
 - **input_shields**: Safety checks for user input
 - **output_shields**: Safety checks for agent responses
 ### Tool Integration
 - **tools**: List of available tools for the agent
 - **tool_choice**: Control over when tools are used
 ## Related Resources
 - **[Agents](./agent)** - Understanding agent fundamentals
 - **[Tools Integration](./tools)** - Adding capabilities to agents
 - **[Safety Guardrails](./safety)** - Implementing safety measures
 - **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced workflows
--- a/docs/docs/building_applications/evals.mdx
+++ b/docs/docs/building_applications/evals.mdx
@ -0,0 +1,256 @@
 ---
 title: Evaluations
 description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
 sidebar_label: Evaluations
 sidebar_position: 7
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.
 :::tip[Interactive Examples]
 Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
 :::
 ## Application Evaluation Example
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
 Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
 In this example, we will show you how to:
 1. **Build an Agent** with Llama Stack
 2. **Query the agent's sessions, turns, and steps** to analyze execution
 3. **Evaluate the results** using scoring functions
 ## Step-by-Step Evaluation Process
 ### 1. Building a Search Agent
 First, let's create an agent that can search the web to answer questions:
 ```python
 from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
 client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
 agent = Agent(
    client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions.",
    tools=["builtin::websearch"],
 )
 # Test prompts for evaluation
 user_prompts = [
    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
 ]
 session_id = agent.create_session("test-session")
 # Execute all prompts in the session
 for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in AgentEventLogger().log(response):
        log.print()
 ```
 ### 2. Query Agent Execution Steps
 Now, let's analyze the agent's execution steps to understand its performance:
 <Tabs>
 <TabItem value="session-analysis" label="Session Analysis">
 ```python
 from rich.pretty import pprint
 # Query the agent's session to get detailed execution data
 session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
 )
 pprint(session_response)
 ```
 </TabItem>
 <TabItem value="tool-validation" label="Tool Usage Validation">
 ```python
 # Sanity check: Verify that all user prompts are followed by tool calls
 num_tool_call = 0
 for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1
 print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
 )
 ```
 </TabItem>
 </Tabs>
 ### 3. Evaluate Agent Responses
 Now we'll evaluate the agent's responses using Llama Stack's scoring API:
 <Tabs>
 <TabItem value="data-preparation" label="Data Preparation">
 ```python
 # Process agent execution history into evaluation rows
 eval_rows = []
 # Define expected answers for our test prompts
 expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
 ]
 # Create evaluation dataset from agent responses
 for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )
 pprint(eval_rows)
 ```
 </TabItem>
 <TabItem value="scoring" label="Scoring & Evaluation">
 ```python
 # Configure scoring parameters
 scoring_params = {
    "basic::subset_of": None,  # Check if generated answer contains expected answer
 }
 # Run evaluation using Llama Stack's scoring API
 scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params
 )
 pprint(scoring_response)
 # Analyze results
 for i, result in enumerate(scoring_response.results):
    print(f"Query {i+1}: {result.score}")
    print(f"  Generated: {eval_rows[i]['generated_answer'][:100]}...")
    print(f"  Expected: {expected_answers[i]}")
    print(f"  Score: {result.score}")
    print()
 ```
 </TabItem>
 </Tabs>
 ## Available Scoring Functions
 Llama Stack provides several built-in scoring functions:
 ### Basic Scoring Functions
 - **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
 - **`basic::exact_match`**: Performs exact string matching between expected and generated answers
 - **`basic::regex_match`**: Uses regular expressions to match patterns in responses
 ### Advanced Scoring Functions
 - **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
 - **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
 - **`llm_as_judge::safety`**: Assesses response safety and appropriateness
 ### Custom Scoring Functions
 You can also create custom scoring functions for domain-specific evaluation needs.
 ## Evaluation Workflow Best Practices
 ### 🎯 **Dataset Preparation**
 - Use diverse test cases that cover edge cases and common scenarios
 - Include clear expected answers or success criteria
 - Balance your dataset across different difficulty levels
 ### 📊 **Metrics Selection**
 - Choose appropriate scoring functions for your use case
 - Combine multiple metrics for comprehensive evaluation
 - Consider both automated and human evaluation metrics
 ### 🔄 **Iterative Improvement**
 - Run evaluations regularly during development
 - Use evaluation results to identify areas for improvement
 - Track performance changes over time
 ### 📈 **Analysis & Reporting**
 - Analyze failures to understand model limitations
 - Generate comprehensive evaluation reports
 - Share results with stakeholders for informed decision-making
 ## Advanced Evaluation Scenarios
 ### Batch Evaluation
 For evaluating large datasets efficiently:
 ```python
 # Prepare large evaluation dataset
 large_eval_dataset = [
    {"input_query": query, "expected_answer": answer}
    for query, answer in zip(queries, expected_answers)
 ]
 # Run batch evaluation
 batch_results = client.scoring.score(
    input_rows=large_eval_dataset,
    scoring_functions={
        "basic::subset_of": None,
        "llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
    }
 )
 ```
 ### Multi-Metric Evaluation
 Combining different scoring approaches:
 ```python
 comprehensive_scoring = {
    "exact_match": "basic::exact_match",
    "subset_match": "basic::subset_of",
    "llm_judge": "llm_as_judge::accuracy",
    "safety_check": "llm_as_judge::safety",
 }
 results = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=comprehensive_scoring
 )
 ```
 ## Related Resources
 - **[Agents](./agent)** - Building agents for evaluation
 - **[Tools Integration](./tools)** - Using tools in evaluated agents
 - **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
 - **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
 - **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios
--- a/docs/docs/building_applications/index.mdx
+++ b/docs/docs/building_applications/index.mdx
@ -0,0 +1,83 @@
 ---
 title: Building Applications
 description: Comprehensive guides for building AI applications with Llama Stack
 sidebar_label: Overview
 sidebar_position: 5
 ---
 # AI Application Examples
 Llama Stack provides all the building blocks needed to create sophisticated AI applications.
 ## Getting Started
 The best way to get started is to look at this comprehensive notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
 **📓 [Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)**
 ## Core Topics
 Here are the key topics that will help you build effective AI applications:
 ### 🤖 **Agent Development**
 - **[Agent Framework](./agent)** - Understand the components and design patterns of the Llama Stack agent framework
 - **[Agent Execution Loop](./agent_execution_loop)** - How agents process information, make decisions, and execute actions
 - **[Agents vs Responses API](./responses_vs_agents)** - Learn when to use each API for different use cases
 ### 📚 **Knowledge Integration**
 - **[RAG (Retrieval-Augmented Generation)](./rag)** - Enhance your agents with external knowledge through retrieval mechanisms
 ### 🛠️ **Capabilities & Extensions**
 - **[Tools](./tools)** - Extend your agents' capabilities by integrating with external tools and APIs
 ### 📊 **Quality & Monitoring**
 - **[Evaluations](./evals)** - Evaluate your agents' effectiveness and identify areas for improvement
 - **[Telemetry](./telemetry)** - Monitor and analyze your agents' performance and behavior
 - **[Safety](./safety)** - Implement guardrails and safety measures to ensure responsible AI behavior
 ### 🎮 **Interactive Development**
 - **[Playground](./playground)** - Interactive environment for testing and developing applications
 ## Application Patterns
 ### 🤖 **Conversational Agents**
 Build intelligent chatbots and assistants that can:
 - Maintain context across conversations
 - Access external knowledge bases
 - Execute actions through tool integrations
 - Apply safety filters and guardrails
 ### 📖 **RAG Applications**
 Create knowledge-augmented applications that:
 - Retrieve relevant information from documents
 - Generate contextually accurate responses
 - Handle large knowledge bases efficiently
 - Provide source attribution
 ### 🔧 **Tool-Enhanced Systems**
 Develop applications that can:
 - Search the web for real-time information
 - Interact with databases and APIs
 - Perform calculations and analysis
 - Execute complex multi-step workflows
 ### 🛡️ **Enterprise Applications**
 Build production-ready systems with:
 - Comprehensive safety measures
 - Performance monitoring and analytics
 - Scalable deployment configurations
 - Evaluation and quality assurance
 ## Next Steps
 1. **📖 Start with the Notebook** - Work through the complete tutorial
 2. **🎯 Choose Your Pattern** - Pick the application type that matches your needs
 3. **🏗️ Build Your Foundation** - Set up your [providers](/docs/providers/) and [distributions](/docs/distributions/)
 4. **🚀 Deploy & Monitor** - Use our [deployment guides](/docs/deploying/) for production
 ## Related Resources
 - **[Getting Started](/docs/getting-started/)** - Basic setup and concepts
 - **[Providers](/docs/providers/)** - Available AI service providers
 - **[Distributions](/docs/distributions/)** - Pre-configured deployment packages
 - **[API Reference](/docs/api/)** - Complete API documentation
--- a/docs/docs/building_applications/playground.mdx
+++ b/docs/docs/building_applications/playground.mdx
@ -0,0 +1,299 @@
 ---
 title: Llama Stack Playground
 description: Interactive interface to explore and experiment with Llama Stack capabilities
 sidebar_label: Playground
 sidebar_position: 10
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Llama Stack Playground
 :::note[Experimental Feature]
 The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
 :::
 The Llama Stack Playground is a simple interface that aims to:
 - **Showcase capabilities and concepts** of Llama Stack in an interactive environment
 - **Demo end-to-end application code** to help users get started building their own applications
 - **Provide a UI** to help users inspect and understand Llama Stack API providers and resources
 ## Key Features
 ### Interactive Playground Pages
 The playground provides interactive pages for users to explore Llama Stack API capabilities:
 #### Chatbot Interface
 <video
  controls
  autoPlay
  playsInline
  muted
  loop
  style={{width: '100%'}}
 >
  <source src="https://github.com/user-attachments/assets/8d2ef802-5812-4a28-96e1-316038c84cbf" type="video/mp4" />
  Your browser does not support the video tag.
 </video>
 <Tabs>
 <TabItem value="chat" label="Chat">
 **Simple Chat Interface**
 - Chat directly with Llama models through an intuitive interface
 - Uses the `/inference/chat-completion` streaming API under the hood
 - Real-time message streaming for responsive interactions
 - Perfect for testing model capabilities and prompt engineering
 </TabItem>
 <TabItem value="rag" label="RAG Chat">
 **Document-Aware Conversations**
 - Upload documents to create memory banks
 - Chat with a RAG-enabled agent that can query your documents
 - Uses Llama Stack's `/agents` API to create and manage RAG sessions
 - Ideal for exploring knowledge-enhanced AI applications
 </TabItem>
 </Tabs>
 #### Evaluation Interface
 <video
  controls
  autoPlay
  playsInline
  muted
  loop
  style={{width: '100%'}}
 >
  <source src="https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5" type="video/mp4" />
  Your browser does not support the video tag.
 </video>
 <Tabs>
 <TabItem value="scoring" label="Scoring Evaluations">
 **Custom Dataset Evaluation**
 - Upload your own evaluation datasets
 - Run evaluations using available scoring functions
 - Uses Llama Stack's `/scoring` API for flexible evaluation workflows
 - Great for testing application performance on custom metrics
 </TabItem>
 <TabItem value="benchmarks" label="Benchmark Evaluations">
 <video
  controls
  autoPlay
  playsInline
  muted
  loop
  style={{width: '100%', marginBottom: '1rem'}}
 >
  <source src="https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf" type="video/mp4" />
  Your browser does not support the video tag.
 </video>
 **Pre-registered Evaluation Tasks**
 - Evaluate models or agents on pre-defined tasks
 - Uses Llama Stack's `/eval` API for comprehensive evaluation
 - Combines datasets and scoring functions for standardized testing
 **Setup Requirements:**
 Register evaluation datasets and benchmarks first:
 ```bash
 # Register evaluation dataset
 llama-stack-client datasets register \
  --dataset-id "mmlu" \
  --provider-id "huggingface" \
  --url "https://huggingface.co/datasets/llamastack/evals" \
  --metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \
  --schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string"}, "chat_completion_input": {"type": "string"}}'
 # Register benchmark task
 llama-stack-client benchmarks register \
  --eval-task-id meta-reference-mmlu \
  --provider-id meta-reference \
  --dataset-id mmlu \
  --scoring-functions basic::regex_parser_multiple_choice_answer
 ```
 </TabItem>
 </Tabs>
 #### Inspection Interface
 <video
  controls
  autoPlay
  playsInline
  muted
  loop
  style={{width: '100%'}}
 >
  <source src="https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99" type="video/mp4" />
  Your browser does not support the video tag.
 </video>
 <Tabs>
 <TabItem value="providers" label="API Providers">
 **Provider Management**
 - Inspect available Llama Stack API providers
 - View provider configurations and capabilities
 - Uses the `/providers` API for real-time provider information
 - Essential for understanding your deployment's capabilities
 </TabItem>
 <TabItem value="resources" label="API Resources">
 **Resource Exploration**
 - Inspect Llama Stack API resources including:
  - **Models**: Available language models
  - **Datasets**: Registered evaluation datasets
  - **Memory Banks**: Vector databases and knowledge stores
  - **Benchmarks**: Evaluation tasks and scoring functions
  - **Shields**: Safety and content moderation tools
 - Uses `/<resources>/list` APIs for comprehensive resource visibility
 - For detailed information about resources, see [Core Concepts](/docs/concepts)
 </TabItem>
 </Tabs>
 ## Getting Started
 ### Quick Start Guide
 <Tabs>
 <TabItem value="setup" label="Setup">
 **1. Start the Llama Stack API Server**
 ```bash
 # Build and run a distribution (example: together)
 llama stack build --distro together --image-type venv
 llama stack run together
 ```
 **2. Start the Streamlit UI**
 ```bash
 # Launch the playground interface
 uv run --with ".[ui]" streamlit run llama_stack.core/ui/app.py
 ```
 </TabItem>
 <TabItem value="usage" label="Usage Tips">
 **Making the Most of the Playground:**
 - **Start with Chat**: Test basic model interactions and prompt engineering
 - **Explore RAG**: Upload sample documents to see knowledge-enhanced responses
 - **Try Evaluations**: Use the scoring interface to understand evaluation metrics
 - **Inspect Resources**: Check what providers and resources are available
 - **Experiment with Settings**: Adjust parameters to see how they affect results
 </TabItem>
 </Tabs>
 ### Available Distributions
 The playground works with any Llama Stack distribution. Popular options include:
 <Tabs>
 <TabItem value="together" label="Together AI">
 ```bash
 llama stack build --distro together --image-type venv
 llama stack run together
 ```
 **Features:**
 - Cloud-hosted models
 - Fast inference
 - Multiple model options
 </TabItem>
 <TabItem value="ollama" label="Ollama (Local)">
 ```bash
 llama stack build --distro ollama --image-type venv
 llama stack run ollama
 ```
 **Features:**
 - Local model execution
 - Privacy-focused
 - No internet required
 </TabItem>
 <TabItem value="meta-reference" label="Meta Reference">
 ```bash
 llama stack build --distro meta-reference --image-type venv
 llama stack run meta-reference
 ```
 **Features:**
 - Reference implementation
 - All API features available
 - Best for development
 </TabItem>
 </Tabs>
 ## Use Cases & Examples
 ### Educational Use Cases
 - **Learning Llama Stack**: Hands-on exploration of API capabilities
 - **Prompt Engineering**: Interactive testing of different prompting strategies
 - **RAG Experimentation**: Understanding how document retrieval affects responses
 - **Evaluation Understanding**: See how different metrics evaluate model performance
 ### Development Use Cases
 - **Prototype Testing**: Quick validation of application concepts
 - **API Exploration**: Understanding available endpoints and parameters
 - **Integration Planning**: Seeing how different components work together
 - **Demo Creation**: Showcasing Llama Stack capabilities to stakeholders
 ### Research Use Cases
 - **Model Comparison**: Side-by-side testing of different models
 - **Evaluation Design**: Understanding how scoring functions work
 - **Safety Testing**: Exploring shield effectiveness with different inputs
 - **Performance Analysis**: Measuring model behavior across different scenarios
 ## Best Practices
 ### 🚀 **Getting Started**
 - Begin with simple chat interactions to understand basic functionality
 - Gradually explore more advanced features like RAG and evaluations
 - Use the inspection tools to understand your deployment's capabilities
 ### 🔧 **Development Workflow**
 - Use the playground to prototype before writing application code
 - Test different parameter settings interactively
 - Validate evaluation approaches before implementing them programmatically
 ### 📊 **Evaluation & Testing**
 - Start with simple scoring functions before trying complex evaluations
 - Use the playground to understand evaluation results before automation
 - Test safety features with various input types
 ### 🎯 **Production Preparation**
 - Use playground insights to inform your production API usage
 - Test edge cases and error conditions interactively
 - Validate resource configurations before deployment
 ## Related Resources
 - **[Getting Started Guide](/docs/getting-started)** - Complete setup and introduction
 - **[Core Concepts](/docs/concepts)** - Understanding Llama Stack fundamentals
 - **[Agents](./agent)** - Building intelligent agents
 - **[RAG (Retrieval Augmented Generation)](./rag)** - Knowledge-enhanced applications
 - **[Evaluations](./evals)** - Comprehensive evaluation framework
 - **[API Reference](/docs/api-reference)** - Complete API documentation
--- a/docs/source/building_applications/rag.md
+++ b/docs/source/building_applications/rag.md
@ -1,36 +1,49 @@
-## Retrieval Augmented Generation (RAG)
+---
 title: Retrieval Augmented Generation (RAG)
 description: Build knowledge-enhanced AI applications with external document retrieval
 sidebar_label: RAG (Retrieval Augmented Generation)
 sidebar_position: 2
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Retrieval Augmented Generation (RAG)
 RAG enables your applications to reference and recall information from previous interactions or external documents.
-Llama Stack organizes the APIs that enable RAG into three layers:
+## Architecture Overview
 1. The lowermost APIs deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon.).
 2. The next is the "Rag Tool", a first-class tool as part of the [Tools API](tools.md) that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly.
 3. Finally, it all comes together with the top-level ["Agents" API](agent.md) that allows you to create agents that can use the tools to answer questions, perform tasks, and more.
-<img src="rag.png" alt="RAG System" width="50%">
+Llama Stack organizes the APIs that enable RAG into three layers:
 1. **Lower-Level APIs**: Deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon)
 2. **RAG Tool**: A first-class tool as part of the [Tools API](./tools) that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly
 3. **Agents API**: The top-level [Agents API](./agent) that allows you to create agents that can use the tools to answer questions, perform tasks, and more
 ![RAG System Architecture](/img/rag.png)
 The RAG system uses lower-level storage for different types of data:
-* **Vector IO**: For semantic search and retrieval
+- **Vector IO**: For semantic search and retrieval
-* **Key-Value and Relational IO**: For structured data storage
+- **Key-Value and Relational IO**: For structured data storage
 :::info[Future Storage Types]
 We may add more storage types like Graph IO in the future.
 :::
-### Setting up Vector DBs
+## Setting up Vector Databases
-For this guide, we will use [Ollama](https://ollama.com/) as the inference provider.
+For this guide, we will use [Ollama](https://ollama.com/) as the inference provider. Ollama is an LLM runtime that allows you to run Llama models locally.
 Ollama is an LLM runtime that allows you to run Llama models locally.
 Here's how to set up a vector database for RAG:
 ```python
-# Create http client
+# Create HTTP client
 import os
 from llama_stack_client import LlamaStackClient
 client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
-
+# Register a vector database
 # Register a vector db
 vector_db_id = "my_documents"
 response = client.vector_dbs.register(
    vector_db_id=vector_db_id,
@ -40,9 +53,15 @@ response = client.vector_dbs.register(
 )
 ```
-### Ingesting Documents
+## Document Ingestion
-You can ingest documents into the vector database using two methods: directly inserting pre-chunked
+
-documents or using the RAG Tool.
+You can ingest documents into the vector database using two methods: directly inserting pre-chunked documents or using the RAG Tool.
 ### Direct Document Insertion
 <Tabs>
 <TabItem value="basic" label="Basic Insertion">
 ```python
 # You can insert a pre-chunked document directly into the vector db
 chunks = [
@ -58,10 +77,11 @@ chunks = [
 client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)
 ```
-#### Using Precomputed Embeddings
+</TabItem>
-If you decide to precompute embeddings for your documents, you can insert them directly into the vector database by
+<TabItem value="embeddings" label="With Precomputed Embeddings">
-including the embedding vectors in the chunk data. This is useful if you have a separate embedding service or if you
+
-want to customize the ingestion process.
+If you decide to precompute embeddings for your documents, you can insert them directly into the vector database by including the embedding vectors in the chunk data. This is useful if you have a separate embedding service or if you want to customize the ingestion process.
 ```python
 chunks_with_embeddings = [
    {
@ -79,44 +99,53 @@ chunks_with_embeddings = [
 ]
 client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks_with_embeddings)
 ```
 When providing precomputed embeddings, ensure the embedding dimension matches the embedding_dimension specified when
 registering the vector database.
-### Retrieval
+:::warning[Embedding Dimensions]
 When providing precomputed embeddings, ensure the embedding dimension matches the `embedding_dimension` specified when registering the vector database.
 :::
 </TabItem>
 </Tabs>
 ### Document Retrieval
 You can query the vector database to retrieve documents based on their embeddings.
 ```python
 # You can then query for these chunks
 chunks_response = client.vector_io.query(
-    vector_db_id=vector_db_id, query="What do you know about..."
+    vector_db_id=vector_db_id,
    query="What do you know about..."
 )
 ```
-### Using the RAG Tool
+## Using the RAG Tool
-> **⚠️ DEPRECATION NOTICE**: The RAG Tool is being deprecated in favor of directly using the OpenAI-compatible Search
+:::danger[Deprecation Notice]
-> API. We recommend migrating to the OpenAI APIs for better compatibility and future support.
+The RAG Tool is being deprecated in favor of directly using the OpenAI-compatible Search API. We recommend migrating to the OpenAI APIs for better compatibility and future support.
 :::
-A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc.
+A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. More examples for how to format a RAGDocument can be found in the [appendix](#more-ragdocument-examples).
 and automatically chunks them into smaller pieces. More examples for how to format a RAGDocument can be found in the
 [appendix](#more-ragdocument-examples).
-#### OpenAI API Integration & Migration
+### OpenAI API Integration & Migration
 The RAG tool has been updated to use OpenAI-compatible APIs. This provides several benefits:
 - **Files API Integration**: Documents are now uploaded using OpenAI's file upload endpoints
 - **Vector Stores API**: Vector storage operations use OpenAI's vector store format with configurable chunking strategies
- **Error Resilience:** When processing multiple documents, individual failures are logged but don't crash the operation. Failed documents are skipped while successful ones continue processing.
+- **Error Resilience**: When processing multiple documents, individual failures are logged but don't crash the operation. Failed documents are skipped while successful ones continue processing.
 ### Migration Path
 **Migration Path:**
 We recommend migrating to the OpenAI-compatible Search API for:
 1. **Better OpenAI Ecosystem Integration**: Direct compatibility with OpenAI tools and workflows including the Responses API
 2**Future-Proof**: Continued support and feature development
 3**Full OpenAI Compatibility**: Vector Stores, Files, and Search APIs are fully compatible with OpenAI's Responses API
-The OpenAI APIs are used under the hood, so you can continue to use your existing RAG Tool code with minimal changes.
+1. **Better OpenAI Ecosystem Integration**: Direct compatibility with OpenAI tools and workflows including the Responses API
-However, we recommend updating your code to use the new OpenAI-compatible APIs for better long-term support. If any
+2. **Future-Proof**: Continued support and feature development
-documents  fail to process, they will be logged in the response but will not cause the entire operation to fail.
+3. **Full OpenAI Compatibility**: Vector Stores, Files, and Search APIs are fully compatible with OpenAI's Responses API
 The OpenAI APIs are used under the hood, so you can continue to use your existing RAG Tool code with minimal changes. However, we recommend updating your code to use the new OpenAI-compatible APIs for better long-term support. If any documents fail to process, they will be logged in the response but will not cause the entire operation to fail.
 ### RAG Tool Example
 ```python
 from llama_stack_client import RAGDocument
@ -145,9 +174,12 @@ results = client.tool_runtime.rag_tool.query(
 )
 ```
-You can configure how the RAG tool adds metadata to the context if you find it useful for your application. Simply add:
+### Custom Context Configuration
 You can configure how the RAG tool adds metadata to the context if you find it useful for your application:
 ```python
-# Query documents
+# Query documents with custom template
 results = client.tool_runtime.rag_tool.query(
    vector_db_ids=[vector_db_id],
    content="What do you know about...",
@ -156,10 +188,13 @@ results = client.tool_runtime.rag_tool.query(
    },
 )
 ```
-### Building RAG-Enhanced Agents
+
 ## Building RAG-Enhanced Agents
 One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
 ### Agent with Knowledge Search
 ```python
 from llama_stack_client import Agent
@ -185,7 +220,6 @@ agent = Agent(
 )
 session_id = agent.create_session("rag_session")
 # Ask questions about documents in the vector db, and the agent will query the db to answer the question.
 response = agent.create_turn(
    messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
@ -193,10 +227,14 @@ response = agent.create_turn(
 )
 ```
-> **NOTE:** the `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
+:::tip[Agent Instructions]
 The `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
 :::
 ### Document-Aware Conversations
 You can also pass documents along with the user's message and ask questions about them:
 You can also pass documents along with the user's message and ask questions about them.
 ```python
 # Initial document ingestion
 response = agent.create_turn(
@ -219,7 +257,10 @@ response = agent.create_turn(
 )
 ```
-You can print the response with below.
+### Viewing Agent Responses
 You can print the response with the following:
 ```python
 from llama_stack_client import AgentEventLogger
@ -227,32 +268,74 @@ for log in AgentEventLogger().log(response):
    log.print()
 ```
 ## Vector Database Management
 ### Unregistering Vector DBs
 If you need to clean up and unregister vector databases, you can do so as follows:
 <Tabs>
 <TabItem value="single" label="Single Database">
 ```python
 # Unregister a specified vector database
 vector_db_id = "my_vector_db_id"
 print(f"Unregistering vector database: {vector_db_id}")
 client.vector_dbs.unregister(vector_db_id=vector_db_id)
 ```
 </TabItem>
 <TabItem value="all" label="All Databases">
 ```python
 # Unregister all vector databases
 for vector_db_id in client.vector_dbs.list():
    print(f"Unregistering vector database: {vector_db_id.identifier}")
    client.vector_dbs.unregister(vector_db_id=vector_db_id.identifier)
 ```
-### Appendix
+</TabItem>
 </Tabs>
 ## Best Practices
 ### 🎯 **Document Chunking**
 - Use appropriate chunk sizes (512 tokens is often a good starting point)
 - Consider overlap between chunks for better context preservation
 - Experiment with different chunking strategies for your content type
 ### 🔍 **Embedding Strategy**
 - Choose embedding models that match your domain
 - Consider the trade-off between embedding dimension and performance
 - Test different embedding models for your specific use case
 ### 📊 **Query Optimization**
 - Use specific, well-formed queries for better retrieval
 - Experiment with different search strategies
 - Consider hybrid approaches (keyword + semantic search)
 ### 🛡️ **Error Handling**
 - Implement proper error handling for failed document processing
 - Monitor ingestion success rates
 - Have fallback strategies for retrieval failures
 ## Appendix
 ### More RAGDocument Examples
 Here are various ways to create RAGDocument objects for different content types:
 #### More RAGDocument Examples
 ```python
 from llama_stack_client import RAGDocument
 import base64
 # File URI
 RAGDocument(document_id="num-0", content={"uri": "file://path/to/file"})
 # Plain text
 RAGDocument(document_id="num-1", content="plain text")
 # Explicit text input
 RAGDocument(
    document_id="num-2",
    content={
@ -260,6 +343,8 @@ RAGDocument(
        "text": "plain text input",
    },  # for inputs that should be treated as text explicitly
 )
 # Image from URL
 RAGDocument(
    document_id="num-3",
    content={
@ -267,6 +352,8 @@ RAGDocument(
        "image": {"url": {"uri": "https://mywebsite.com/image.jpg"}},
    },
 )
 # Base64 encoded image
 B64_ENCODED_IMAGE = base64.b64encode(
    requests.get(
        "https://raw.githubusercontent.com/meta-llama/llama-stack/refs/heads/main/docs/_static/llama-stack.png"
--- a/docs/source/building_applications/rag.png
+++ b/docs/source/building_applications/rag.png
--- a/docs/source/building_applications/responses_vs_agents.md
+++ b/docs/source/building_applications/responses_vs_agents.md
@ -1,10 +1,20 @@
 ---
 title: Agents vs OpenAI Responses API
 description: Compare the Agents API and OpenAI Responses API for building AI applications with tool calling capabilities
 sidebar_label: Agents vs Responses API
 sidebar_position: 5
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Agents vs OpenAI Responses API
 Llama Stack (LLS) provides two different APIs for building AI applications with tool calling capabilities: the **Agents API** and the **OpenAI Responses API**. While both enable AI systems to use tools, and maintain full conversation history, they serve different use cases and have distinct characteristics.
-```{note}
+:::note
- **Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](../providers/openai.md#chat-completions) directly, before progressing to Agents or Responses API.
+**Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](/docs/providers/openai-compatibility#chat-completions) directly, before progressing to Agents or Responses API.
-```
+:::
 ## Overview
@ -21,6 +31,8 @@ Additionally, Agents let you specify input/output shields whereas Responses do n
 Today the Agents and Responses APIs can be used independently depending on the use case. But, it is also productive to treat the APIs as complementary. It is not currently supported, but it is planned for the LLS Agents API to alternatively use the Responses API as its backend instead of the default Chat Completions API, i.e., enabling a combination of the safety features of Agents with the dynamic configuration and branching capabilities of Responses.
 ## Feature Comparison
 | Feature | LLS Agents API | OpenAI Responses API |
 |---------|------------|---------------------|
 | **Conversation Management** | Linear persistent sessions | Can branch from any previous response ID |
@ -34,7 +46,10 @@ Let's compare how both APIs handle a research task where we need to:
 2. Access different information sources dynamically
 3. Continue the conversation based on search results
-### Agents API: Session-based configuration with safety shields
+<Tabs>
 <TabItem value="agents" label="Agents API">
 ### Session-based Configuration with Safety Shields
 ```python
 # Create agent with static session configuration
@ -85,7 +100,10 @@ print(f"First result: {response1.output_message.content}")
 print(f"Optimization: {response2.output_message.content}")
 ```
-### Responses API: Dynamic per-call configuration with branching
+</TabItem>
 <TabItem value="responses" label="Responses API">
 ### Dynamic Per-call Configuration with Branching
 ```python
 # First response: Use web search for latest algorithms
@ -130,50 +148,74 @@ print(f"File search results: {response2.output_message.content}")
 print(f"Alternative web search: {response3.output_message.content}")
 ```
 </TabItem>
 </Tabs>
 Both APIs demonstrate distinct strengths that make them valuable on their own for different scenarios. The Agents API excels in providing structured, safety-conscious workflows with persistent session management, while the Responses API offers flexibility through dynamic configuration and OpenAI compatible tool patterns.
 ## Use Case Examples
-### 1. **Research and Analysis with Safety Controls**
+### 1. Research and Analysis with Safety Controls
 **Best Choice: Agents API**
 **Scenario:** You're building a research assistant for a financial institution that needs to analyze market data, execute code to process financial models, and search through internal compliance documents. The system must ensure all interactions are logged for regulatory compliance and protected by safety shields to prevent malicious code execution or data leaks.
 **Why Agents API?** The Agents API provides persistent session management for iterative research workflows, built-in safety shields to protect against malicious code in financial models, and structured execution logs (session/turn/step) required for regulatory compliance. The static tool configuration ensures consistent access to your knowledge base and code interpreter throughout the entire research session.
-### 2. **Dynamic Information Gathering with Branching Exploration**
+### 2. Dynamic Information Gathering with Branching Exploration
 **Best Choice: Responses API**
 **Scenario:** You're building a competitive intelligence tool that helps businesses research market trends. Users need to dynamically switch between web search for current market data and file search through uploaded industry reports. They also want to branch conversations to explore different market segments simultaneously and experiment with different models for various analysis types.
 **Why Responses API?** The Responses API's branching capability lets users explore multiple market segments from any research point. Dynamic per-call configuration allows switching between web search and file search as needed, while experimenting with different models (faster models for quick searches, more powerful models for deep analysis). The OpenAI-compatible tool patterns make integration straightforward.
-### 3. **OpenAI Migration with Advanced Tool Capabilities**
+### 3. OpenAI Migration with Advanced Tool Capabilities
 **Best Choice: Responses API**
 **Scenario:** You have an existing application built with OpenAI's Assistants API that uses file search and web search capabilities. You want to migrate to Llama Stack for better performance and cost control while maintaining the same tool calling patterns and adding new capabilities like dynamic vector store selection.
 **Why Responses API?** The Responses API provides full OpenAI tool compatibility (`web_search`, `file_search`) with identical syntax, making migration seamless. The dynamic per-call configuration enables advanced features like switching vector stores per query or changing models based on query complexity - capabilities that extend beyond basic OpenAI functionality while maintaining compatibility.
-### 4. **Educational Programming Tutor**
+### 4. Educational Programming Tutor
 **Best Choice: Agents API**
 **Scenario:** You're building a programming tutor that maintains student context across multiple sessions, safely executes code exercises, and tracks learning progress with audit trails for educators.
 **Why Agents API?** Persistent sessions remember student progress across multiple interactions, safety shields prevent malicious code execution while allowing legitimate programming exercises, and structured execution logs help educators track learning patterns.
-### 5. **Advanced Software Debugging Assistant**
+### 5. Advanced Software Debugging Assistant
 **Best Choice: Agents API with Responses Backend**
 **Scenario:** You're building a debugging assistant that helps developers troubleshoot complex issues. It needs to maintain context throughout a debugging session, safely execute diagnostic code, switch between different analysis tools dynamically, and branch conversations to explore multiple potential causes simultaneously.
 **Why Agents + Responses?** The Agent provides safety shields for code execution and session management for the overall debugging workflow. The underlying Responses API enables dynamic model selection and flexible tool configuration per query, while branching lets you explore different theories (memory leak vs. concurrency issue) from the same debugging point and compare results.
-> **Note:** The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
+:::info[Future Enhancement]
 The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
 :::
-## For More Information
+## Decision Framework
- **LLS Agents API**: For detailed information on creating and managing agents, see the [Agents documentation](agent.md)
+Use this framework to choose the right API for your use case:
- **OpenAI Responses API**: For information on using the OpenAI-compatible responses API, see the [OpenAI API documentation](https://platform.openai.com/docs/api-reference/responses)
+
- **Chat Completions API**: For the default backend API used by Agents, see the [Chat Completions providers documentation](../providers/openai.md#chat-completions)
+### Choose Agents API when:
- **Agent Execution Loop**: For understanding how agents process turns and steps in their execution, see the [Agent Execution Loop documentation](agent_execution_loop.md)
+- ✅ You need **safety shields** for input/output validation
 - ✅ Your application requires **linear conversation flow** with persistent context
 - ✅ You need **audit trails** and structured execution logs
 - ✅ Your tool configuration is **static** throughout the session
 - ✅ You're building **educational, financial, or enterprise** applications with compliance requirements
 ### Choose Responses API when:
 - ✅ You need **conversation branching** to explore multiple paths
 - ✅ You want **dynamic per-call configuration** (models, tools, vector stores)
 - ✅ You're **migrating from OpenAI** and want familiar tool patterns
 - ✅ You need **OpenAI compatibility** for existing workflows
 - ✅ Your application benefits from **flexible, experimental** interactions
 ## Related Resources
 - **[Agents](./agent)** - Understanding the Agents API fundamentals
 - **[Agent Execution Loop](./agent_execution_loop)** - How agents process turns and steps
 - **[Tools Integration](./tools)** - Adding capabilities to both APIs
 - **[OpenAI Compatibility](/docs/providers/openai-compatibility)** - Using OpenAI-compatible endpoints
 - **[Safety Guardrails](./safety)** - Implementing safety measures in agents
--- a/docs/docs/building_applications/safety.mdx
+++ b/docs/docs/building_applications/safety.mdx
@ -0,0 +1,395 @@
 ---
 title: Safety Guardrails
 description: Implement safety measures and content moderation in Llama Stack applications
 sidebar_label: Safety
 sidebar_position: 9
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Safety Guardrails
 Safety is a critical component of any AI application. Llama Stack provides a comprehensive Shield system that can be applied at multiple touchpoints to ensure responsible AI behavior and content moderation.
 ## Shield System Overview
 The Shield system in Llama Stack provides:
 - **Content filtering** for both input and output messages
 - **Multi-touchpoint protection** across your application flow
 - **Configurable safety policies** tailored to your use case
 - **Integration with agents** for automated safety enforcement
 ## Basic Shield Usage
 ### Registering a Safety Shield
 <Tabs>
 <TabItem value="registration" label="Shield Registration">
 ```python
 # Register a safety shield
 shield_id = "content_safety"
 client.shields.register(
    shield_id=shield_id,
    provider_shield_id="llama-guard-basic"
 )
 ```
 </TabItem>
 <TabItem value="manual-check" label="Manual Safety Check">
 ```python
 # Run content through shield manually
 response = client.safety.run_shield(
    shield_id=shield_id,
    messages=[{"role": "user", "content": "User message here"}]
 )
 if response.violation:
    print(f"Safety violation detected: {response.violation.user_message}")
    # Handle violation appropriately
 else:
    print("Content passed safety checks")
 ```
 </TabItem>
 </Tabs>
 ## Agent Integration
 Shields can be automatically applied to agent interactions for seamless safety enforcement:
 <Tabs>
 <TabItem value="input-shields" label="Input Shields">
 ```python
 from llama_stack_client import Agent
 # Create agent with input safety shields
 agent = Agent(
    client,
    model="meta-llama/Llama-3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    input_shields=["content_safety"],  # Shield user inputs
    tools=["builtin::websearch"],
 )
 session_id = agent.create_session("safe_session")
 # All user inputs will be automatically screened
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Tell me about AI safety"}],
    session_id=session_id,
 )
 ```
 </TabItem>
 <TabItem value="output-shields" label="Output Shields">
 ```python
 # Create agent with output safety shields
 agent = Agent(
    client,
    model="meta-llama/Llama-3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    output_shields=["content_safety"],  # Shield agent outputs
    tools=["builtin::websearch"],
 )
 session_id = agent.create_session("safe_session")
 # All agent responses will be automatically screened
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Help me with my research"}],
    session_id=session_id,
 )
 ```
 </TabItem>
 <TabItem value="both-shields" label="Input & Output Shields">
 ```python
 # Create agent with comprehensive safety coverage
 agent = Agent(
    client,
    model="meta-llama/Llama-3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    input_shields=["content_safety"],   # Screen user inputs
    output_shields=["content_safety"],  # Screen agent outputs
    tools=["builtin::websearch"],
 )
 session_id = agent.create_session("fully_protected_session")
 # Both input and output are automatically protected
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Research question here"}],
    session_id=session_id,
 )
 ```
 </TabItem>
 </Tabs>
 ## Available Shield Types
 ### Llama Guard Shields
 Llama Guard provides state-of-the-art content safety classification:
 <Tabs>
 <TabItem value="basic" label="Basic Llama Guard">
 ```python
 # Basic Llama Guard for general content safety
 client.shields.register(
    shield_id="llama_guard_basic",
    provider_shield_id="llama-guard-basic"
 )
 ```
 **Use Cases:**
 - General content moderation
 - Harmful content detection
 - Basic safety compliance
 </TabItem>
 <TabItem value="advanced" label="Advanced Llama Guard">
 ```python
 # Advanced Llama Guard with custom categories
 client.shields.register(
    shield_id="llama_guard_advanced",
    provider_shield_id="llama-guard-advanced",
    config={
        "categories": [
            "violence", "hate_speech", "sexual_content",
            "self_harm", "illegal_activity"
        ],
        "threshold": 0.8
    }
 )
 ```
 **Use Cases:**
 - Fine-tuned safety policies
 - Domain-specific content filtering
 - Enterprise compliance requirements
 </TabItem>
 </Tabs>
 ### Custom Safety Shields
 Create domain-specific safety shields for specialized use cases:
 ```python
 # Register custom safety shield
 client.shields.register(
    shield_id="financial_compliance",
    provider_shield_id="custom-financial-shield",
    config={
        "detect_pii": True,
        "financial_advice_warning": True,
        "regulatory_compliance": "FINRA"
    }
 )
 ```
 ## Safety Response Handling
 When safety violations are detected, handle them appropriately:
 <Tabs>
 <TabItem value="basic-handling" label="Basic Handling">
 ```python
 response = client.safety.run_shield(
    shield_id="content_safety",
    messages=[{"role": "user", "content": "Potentially harmful content"}]
 )
 if response.violation:
    violation = response.violation
    print(f"Violation Type: {violation.violation_type}")
    print(f"User Message: {violation.user_message}")
    print(f"Metadata: {violation.metadata}")
    # Log the violation for audit purposes
    logger.warning(f"Safety violation detected: {violation.violation_type}")
    # Provide appropriate user feedback
    return "I can't help with that request. Please try asking something else."
 ```
 </TabItem>
 <TabItem value="advanced-handling" label="Advanced Handling">
 ```python
 def handle_safety_response(safety_response, user_message):
    """Advanced safety response handling with logging and user feedback"""
    if not safety_response.violation:
        return {"safe": True, "message": "Content passed safety checks"}
    violation = safety_response.violation
    # Log violation details
    audit_log = {
        "timestamp": datetime.now().isoformat(),
        "violation_type": violation.violation_type,
        "original_message": user_message,
        "shield_response": violation.user_message,
        "metadata": violation.metadata
    }
    logger.warning(f"Safety violation: {audit_log}")
    # Determine appropriate response based on violation type
    if violation.violation_type == "hate_speech":
        user_feedback = "I can't engage with content that contains hate speech. Let's keep our conversation respectful."
    elif violation.violation_type == "violence":
        user_feedback = "I can't provide information that could promote violence. How else can I help you today?"
    else:
        user_feedback = "I can't help with that request. Please try asking something else."
    return {
        "safe": False,
        "user_feedback": user_feedback,
        "violation_details": audit_log
    }
 # Usage
 safety_result = handle_safety_response(response, user_input)
 if not safety_result["safe"]:
    return safety_result["user_feedback"]
 ```
 </TabItem>
 </Tabs>
 ## Safety Configuration Best Practices
 ### 🛡️ **Multi-Layer Protection**
 - Use both input and output shields for comprehensive coverage
 - Combine multiple shield types for different threat categories
 - Implement fallback mechanisms when shields fail
 ### 📊 **Monitoring & Auditing**
 - Log all safety violations for compliance and analysis
 - Monitor false positive rates to tune shield sensitivity
 - Track safety metrics across different use cases
 ### ⚙️ **Configuration Management**
 - Use environment-specific safety configurations
 - Implement A/B testing for shield effectiveness
 - Regularly update shield models and policies
 ### 🔧 **Integration Patterns**
 - Integrate shields early in the development process
 - Test safety measures with adversarial inputs
 - Provide clear user feedback for violations
 ## Advanced Safety Scenarios
 ### Context-Aware Safety
 ```python
 # Safety shields that consider conversation context
 agent = Agent(
    client,
    model="meta-llama/Llama-3.2-3B-Instruct",
    instructions="You are a healthcare assistant",
    input_shields=["medical_safety"],
    output_shields=["medical_safety"],
    # Context helps shields make better decisions
    safety_context={
        "domain": "healthcare",
        "user_type": "patient",
        "compliance_level": "HIPAA"
    }
 )
 ```
 ### Dynamic Shield Selection
 ```python
 def select_shield_for_user(user_profile):
    """Select appropriate safety shield based on user context"""
    if user_profile.age < 18:
        return "child_safety_shield"
    elif user_profile.context == "enterprise":
        return "enterprise_compliance_shield"
    else:
        return "general_safety_shield"
 # Use dynamic shield selection
 shield_id = select_shield_for_user(current_user)
 response = client.safety.run_shield(
    shield_id=shield_id,
    messages=messages
 )
 ```
 ## Compliance and Regulations
 ### Industry-Specific Safety
 <Tabs>
 <TabItem value="healthcare" label="Healthcare (HIPAA)">
 ```python
 # Healthcare-specific safety configuration
 client.shields.register(
    shield_id="hipaa_compliance",
    provider_shield_id="healthcare-safety-shield",
    config={
        "detect_phi": True,  # Protected Health Information
        "medical_advice_warning": True,
        "regulatory_framework": "HIPAA"
    }
 )
 ```
 </TabItem>
 <TabItem value="financial" label="Financial (FINRA)">
 ```python
 # Financial services safety configuration
 client.shields.register(
    shield_id="finra_compliance",
    provider_shield_id="financial-safety-shield",
    config={
        "detect_financial_advice": True,
        "investment_disclaimers": True,
        "regulatory_framework": "FINRA"
    }
 )
 ```
 </TabItem>
 <TabItem value="education" label="Education (COPPA)">
 ```python
 # Educational platform safety for minors
 client.shields.register(
    shield_id="coppa_compliance",
    provider_shield_id="educational-safety-shield",
    config={
        "child_protection": True,
        "educational_content_only": True,
        "regulatory_framework": "COPPA"
    }
 )
 ```
 </TabItem>
 </Tabs>
 ## Related Resources
 - **[Agents](./agent)** - Integrating safety shields with intelligent agents
 - **[Agent Execution Loop](./agent_execution_loop)** - Understanding safety in the execution flow
 - **[Evaluations](./evals)** - Evaluating safety shield effectiveness
 - **[Telemetry](./telemetry)** - Monitoring safety violations and metrics
 - **[Llama Guard Documentation](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)** - Advanced safety model details
--- a/docs/docs/building_applications/telemetry.mdx
+++ b/docs/docs/building_applications/telemetry.mdx
@ -0,0 +1,342 @@
 ---
 title: Telemetry
 description: Monitor and observe Llama Stack applications with comprehensive telemetry capabilities
 sidebar_label: Telemetry
 sidebar_position: 8
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Telemetry
 The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output for complete observability of your AI applications.
 ## Event Types
 The telemetry system supports three main types of events:
 <Tabs>
 <TabItem value="unstructured" label="Unstructured Logs">
 Free-form log messages with severity levels for general application logging:
 ```python
 unstructured_log_event = UnstructuredLogEvent(
    message="This is a log message",
    severity=LogSeverity.INFO
 )
 ```
 </TabItem>
 <TabItem value="metrics" label="Metric Events">
 Numerical measurements with units for tracking performance and usage:
 ```python
 metric_event = MetricEvent(
    metric="my_metric",
    value=10,
    unit="count"
 )
 ```
 </TabItem>
 <TabItem value="structured" label="Structured Logs">
 System events like span start/end that provide structured operation tracking:
 ```python
 structured_log_event = SpanStartPayload(
    name="my_span",
    parent_span_id="parent_span_id"
 )
 ```
 </TabItem>
 </Tabs>
 ## Spans and Traces
 - **Spans**: Represent individual operations with timing information and hierarchical relationships
 - **Traces**: Collections of related spans that form a complete request flow across your application
 This hierarchical structure allows you to understand the complete execution path of requests through your Llama Stack application.
 ## Automatic Metrics Generation
 Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
 ### Available Metrics
 The following metrics are automatically generated for each inference request:
 | Metric Name | Type | Unit | Description | Labels |
 |-------------|------|------|-------------|--------|
 | `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
 | `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
 | `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
 ### Metric Generation Flow
 1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
 2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
 3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
 4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
 ### Metric Aggregation Level
 All metrics are generated and aggregated at the **inference request level**. This means:
 - Each individual inference request generates its own set of metrics
 - Metrics are not pre-aggregated across multiple requests
 - Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
 - Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
 ### Example Metric Event
 ```python
 MetricEvent(
    trace_id="1234567890abcdef",
    span_id="abcdef1234567890",
    metric="total_tokens",
    value=150,
    timestamp=1703123456.789,
    unit="tokens",
    attributes={
        "model_id": "meta-llama/Llama-3.2-3B-Instruct",
        "provider_id": "tgi"
    },
 )
 ```
 ## Telemetry Sinks
 Choose from multiple sink types based on your observability needs:
 <Tabs>
 <TabItem value="opentelemetry" label="OpenTelemetry">
 Send events to an OpenTelemetry Collector for integration with observability platforms:
 **Use Cases:**
 - Visualizing traces in tools like Jaeger
 - Collecting metrics for Prometheus
 - Integration with enterprise observability stacks
 **Features:**
 - Standard OpenTelemetry format
 - Compatible with all OpenTelemetry collectors
 - Supports both traces and metrics
 </TabItem>
 <TabItem value="sqlite" label="SQLite">
 Store events in a local SQLite database for direct querying:
 **Use Cases:**
 - Local development and debugging
 - Custom analytics and reporting
 - Offline analysis of application behavior
 **Features:**
 - Direct SQL querying capabilities
 - Persistent local storage
 - No external dependencies
 </TabItem>
 <TabItem value="console" label="Console">
 Print events to the console for immediate debugging:
 **Use Cases:**
 - Development and testing
 - Quick debugging sessions
 - Simple logging without external tools
 **Features:**
 - Immediate output visibility
 - No setup required
 - Human-readable format
 </TabItem>
 </Tabs>
 ## Configuration
 ### Meta-Reference Provider
 Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
 ```yaml
 telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: "llama-stack-service"
      sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
      otel_exporter_otlp_endpoint: "http://localhost:4318"
      sqlite_db_path: "/path/to/telemetry.db"
 ```
 ### Environment Variables
 Configure telemetry behavior using environment variables:
 - **`OTEL_EXPORTER_OTLP_ENDPOINT`**: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
 - **`OTEL_SERVICE_NAME`**: Service name for telemetry (default: empty string)
 - **`TELEMETRY_SINKS`**: Comma-separated list of sinks (default: `console,sqlite`)
 ## Visualization with Jaeger
 The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
 ### Starting Jaeger
 Start a Jaeger instance with OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686:
 ```bash
 docker run --pull always --rm --name jaeger \
  -p 16686:16686 -p 4318:4318 \
  jaegertracing/jaeger:2.1.0
 ```
 Once running, you can visualize traces by navigating to [http://localhost:16686/](http://localhost:16686/).
 ## Querying Metrics
 When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:
 <Tabs>
 <TabItem value="prometheus" label="Prometheus Queries">
 Example Prometheus queries for analyzing token usage:
 ```promql
 # Total tokens used across all models
 sum(llama_stack_tokens_total)
 # Tokens per model
 sum by (model_id) (llama_stack_tokens_total)
 # Average tokens per request over 5 minutes
 rate(llama_stack_tokens_total[5m])
 # Token usage by provider
 sum by (provider_id) (llama_stack_tokens_total)
 ```
 </TabItem>
 <TabItem value="grafana" label="Grafana Dashboards">
 Create dashboards using Prometheus as a data source:
 - **Token Usage Over Time**: Line charts showing token consumption trends
 - **Model Performance**: Comparison of different models by token efficiency
 - **Provider Analysis**: Breakdown of usage across different providers
 - **Request Patterns**: Understanding peak usage times and patterns
 </TabItem>
 <TabItem value="otlp" label="OpenTelemetry Collector">
 Forward metrics to other observability systems:
 - Export to multiple backends simultaneously
 - Apply transformations and filtering
 - Integrate with existing monitoring infrastructure
 </TabItem>
 </Tabs>
 ## SQLite Querying
 The `sqlite` sink allows you to query traces without an external system. This is particularly useful for development and custom analytics.
 ### Example Queries
 ```sql
 -- Query recent traces
 SELECT * FROM traces WHERE timestamp > datetime('now', '-1 hour');
 -- Analyze span durations
 SELECT name, AVG(duration_ms) as avg_duration
 FROM spans
 GROUP BY name
 ORDER BY avg_duration DESC;
 -- Find slow operations
 SELECT * FROM spans
 WHERE duration_ms > 1000
 ORDER BY duration_ms DESC;
 ```
 :::tip[Advanced Analytics]
 Refer to the [Getting Started notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on querying traces and spans programmatically.
 :::
 ## Best Practices
 ### 🔍 **Monitoring Strategy**
 - Use OpenTelemetry for production environments
 - Combine multiple sinks for development (console + SQLite)
 - Set up alerts on key metrics like token usage and error rates
 ### 📊 **Metrics Analysis**
 - Track token usage trends to optimize costs
 - Monitor response times across different models
 - Analyze usage patterns to improve resource allocation
 ### 🚨 **Alerting & Debugging**
 - Set up alerts for unusual token consumption spikes
 - Use trace data to debug performance issues
 - Monitor error rates and failure patterns
 ### 🔧 **Configuration Management**
 - Use environment variables for flexible deployment
 - Configure appropriate retention policies for SQLite
 - Ensure proper network access to OpenTelemetry collectors
 ## Integration Examples
 ### Basic Telemetry Setup
 ```python
 from llama_stack_client import LlamaStackClient
 # Client with telemetry headers
 client = LlamaStackClient(
    base_url="http://localhost:8000",
    extra_headers={
        "X-Telemetry-Service": "my-ai-app",
        "X-Telemetry-Version": "1.0.0"
    }
 )
 # All API calls will be automatically traced
 response = client.inference.chat_completion(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
 )
 ```
 ### Custom Telemetry Context
 ```python
 # Add custom span attributes for better tracking
 with tracer.start_as_current_span("custom_operation") as span:
    span.set_attribute("user_id", "user123")
    span.set_attribute("operation_type", "chat_completion")
    response = client.inference.chat_completion(
        model="meta-llama/Llama-3.2-3B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )
 ```
 ## Related Resources
 - **[Agents](./agent)** - Monitoring agent execution with telemetry
 - **[Evaluations](./evals)** - Using telemetry data for performance evaluation
 - **[Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Telemetry examples and queries
 - **[OpenTelemetry Documentation](https://opentelemetry.io/)** - Comprehensive observability framework
 - **[Jaeger Documentation](https://www.jaegertracing.io/)** - Distributed tracing visualization
--- a/docs/source/building_applications/tools.md
+++ b/docs/source/building_applications/tools.md
@ -1,6 +1,17 @@
 ---
 title: Tools
 description: Extend agent capabilities with external tools and function calling
 sidebar_label: Tools
 sidebar_position: 6
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Tools
 Tools are functions that can be invoked by an agent to perform tasks. They are organized into tool groups and registered with specific providers. Each tool group represents a collection of related tools from a single provider. They are organized into groups so that state can be externalized: the collection operates on the same state typically.
 An example of this would be a "db_access" tool group that contains tools for interacting with a database. "list_tables", "query_table", "insert_row" could be examples of tools in this group.
 Tools are treated as any other resource in llama stack like models. You can register them, have providers for them etc.
@ -9,18 +20,15 @@ When instantiating an agent, you can provide it a list of tool groups that it ha
 Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools.
-## Server-side vs. client-side tool execution
+## Server-side vs. Client-side Tool Execution
-Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model
+Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution and optional continuation using the `agent.resume_turn` method.
 transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution
 and optional continuation using the `agent.resume_turn` method.
-
+## Server-side Tools
 ### Server-side tools
 Llama Stack provides built-in providers for some common tools. These include web search, math, and RAG capabilities.
-#### Web Search
+### Web Search
 You have three providers to execute the web search tool calls generated by a model: Brave Search, Bing Search, and Tavily Search.
@ -39,25 +47,26 @@ The tool requires an API key which can be provided either in the configuration o
 {"<provider_name>_api_key": <your api key>}
 ```
-
+### Math
 #### Math
 The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
 ```python
 client.toolgroups.register(
-    toolgroup_id="builtin::wolfram_alpha", provider_id="wolfram-alpha"
+    toolgroup_id="builtin::wolfram_alpha",
    provider_id="wolfram-alpha"
 )
 ```
 Example usage:
 ```python
 result = client.tool_runtime.invoke_tool(
-    tool_name="wolfram_alpha", args={"query": "solve x^2 + 2x + 1 = 0"}
+    tool_name="wolfram_alpha",
    args={"query": "solve x^2 + 2x + 1 = 0"}
 )
 ```
-#### RAG
+### RAG
 The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
@ -75,16 +84,13 @@ Features:
 - Configurable query generation
 - Context retrieval with token limits
-
+:::note[Default Configuration]
 ```{note}
 By default, llama stack run.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers.
-```
+:::
 ## Model Context Protocol (MCP)
-[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered
+[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered from an MCP endpoint and can be used to extend the agent's capabilities.
 from an MCP endpoint and can be used to extend the agent's capabilities.
 ### Using Remote MCP Servers
@ -98,8 +104,7 @@ client.toolgroups.register(
 )
 ```
-Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide authorization headers to send to the MCP server
+Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide authorization headers to send to the MCP server using the "Provider Data" abstraction provided by Llama Stack. When making an agent call,
 using the "Provider Data" abstraction provided by Llama Stack. When making an agent call,
 ```python
 agent = Agent(
@ -120,20 +125,26 @@ agent = Agent(
 agent.create_turn(...)
 ```
-### Running your own MCP server
+### Running Your Own MCP Server
 Here's an example of how to run a simple MCP server that exposes a File System as a set of tools to the Llama Stack agent.
 <Tabs>
 <TabItem value="setup" label="Server Setup">
 ```shell
-# start your MCP server
+# Start your MCP server
 mkdir /tmp/content
 touch /tmp/content/foo
 touch /tmp/content/bar
 npx -y supergateway --port 8000 --stdio 'npx -y @modelcontextprotocol/server-filesystem /tmp/content'
 ```
-Then register the MCP server as a tool group,
+</TabItem>
 <TabItem value="register" label="Registration">
 ```python
 # Register the MCP server as a tool group
 client.toolgroups.register(
    toolgroup_id="mcp::filesystem",
    provider_id="model-context-protocol",
@ -141,12 +152,12 @@ client.toolgroups.register(
 )
 ```
-
+</TabItem>
 </Tabs>
 ## Adding Custom (Client-side) Tools
-When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed
+When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed along to the generative model.
 along to the generative model.
 ```python
 # Example tool definition
@ -158,9 +169,13 @@ def my_tool(input: int) -> int:
    """
    return input * 2
 ```
-> **NOTE:** We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
+
 :::tip[Documentation Best Practices]
 We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
 :::
 Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
 ```python
 # Example agent config with client provided tools
 agent = Agent(client, ..., tools=[my_tool])
@ -168,14 +183,14 @@ agent = Agent(client, ..., tools=[my_tool])
 Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
 ## Tool Invocation
 Tools can be invoked using the `invoke_tool` method:
 ```python
 result = client.tool_runtime.invoke_tool(
-    tool_name="web_search", kwargs={"query": "What is the capital of France?"}
+    tool_name="web_search",
    kwargs={"query": "What is the capital of France?"}
 )
 ```
@ -196,7 +211,13 @@ all_tools = client.tools.list_tools()
 group_tools = client.tools.list_tools(toolgroup_id="search_tools")
 ```
-## Simple Example 2: Using an Agent with the Web Search Tool
+## Complete Examples
 ### Web Search Agent
 <Tabs>
 <TabItem value="setup" label="Setup & Configuration">
 1. Start by registering a Tavily API key at [Tavily](https://tavily.com/).
 2. [Optional] Provide the API key directly to the Llama Stack server
 ```bash
@ -205,7 +226,10 @@ export TAVILY_SEARCH_API_KEY="your key"
 ```bash
 --env TAVILY_SEARCH_API_KEY=${TAVILY_SEARCH_API_KEY}
 ```
-3. Run the following script.
+
 </TabItem>
 <TabItem value="implementation" label="Implementation">
 ```python
 from llama_stack_client.lib.agents.agent import Agent
 from llama_stack_client.types.agent_create_params import AgentConfig
@ -240,7 +264,14 @@ for log in EventLogger().log(response):
    log.print()
 ```
-## Simple Example3: Using an Agent with the WolframAlpha Tool
+</TabItem>
 </Tabs>
 ### WolframAlpha Math Agent
 <Tabs>
 <TabItem value="setup" label="Setup & Configuration">
 1. Start by registering for a WolframAlpha API key at [WolframAlpha Developer Portal](https://developer.wolframalpha.com/access).
 2. Provide the API key either when starting the Llama Stack server:
    ```bash
@ -253,12 +284,57 @@ for log in EventLogger().log(response):
        provider_data={"wolfram_alpha_api_key": wolfram_api_key},
    )
    ```
-3. Configure the tools in the Agent by setting `tools=["builtin::wolfram_alpha"]`.
+
-4. Example user query:
+</TabItem>
-    ```python
+<TabItem value="implementation" label="Implementation">
-    response = agent.create_turn(
+
-        messages=[{"role": "user", "content": "Solve x^2 + 2x + 1 = 0 using WolframAlpha"}],
+```python
-        session_id=session_id,
+# Configure the tools in the Agent by setting tools=["builtin::wolfram_alpha"]
-    )
+agent = Agent(
-    ```
+    client,
    model="meta-llama/Llama-3.2-3B-Instruct",
    instructions="You are a mathematical assistant that can solve complex equations.",
    tools=["builtin::wolfram_alpha"],
 )
 session_id = agent.create_session("math-session")
 # Example user query
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Solve x^2 + 2x + 1 = 0 using WolframAlpha"}],
    session_id=session_id,
 )
 ```
 </TabItem>
 </Tabs>
 ## Best Practices
 ### 🛠️ **Tool Selection**
 - Use **server-side tools** for production applications requiring reliability and security
 - Use **client-side tools** for development, prototyping, or specialized integrations
 - Combine multiple tool types for comprehensive functionality
 ### 📝 **Documentation**
 - Write clear, detailed docstrings for custom tools
 - Include parameter descriptions and expected return types
 - Test tool descriptions with the model to ensure proper usage
 ### 🔐 **Security**
 - Store API keys securely using environment variables or secure configuration
 - Use the `X-LlamaStack-Provider-Data` header for dynamic authentication
 - Validate tool inputs and outputs for security
 ### 🔄 **Error Handling**
 - Implement proper error handling in custom tools
 - Use structured error responses with meaningful messages
 - Monitor tool performance and reliability
 ## Related Resources
 - **[Agents](./agent)** - Building intelligent agents with tools
 - **[RAG (Retrieval Augmented Generation)](./rag)** - Using knowledge retrieval tools
 - **[Agent Execution Loop](./agent_execution_loop)** - Understanding tool execution flow
 - **[Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Comprehensive examples
 - **[Llama Stack Apps Examples](https://github.com/meta-llama/llama-stack-apps)** - Real-world tool implementations
--- a/docs/docs/concepts/apis/api_leveling.mdx
+++ b/docs/docs/concepts/apis/api_leveling.mdx
@ -1,3 +1,10 @@
 ---
 title: API Stability Leveling
 description: Understanding API stability levels and versioning in Llama Stack
 sidebar_label: API Stability
 sidebar_position: 4
 ---
 # Llama Stack API Stability Leveling
 In order to provide a stable experience in Llama Stack, the various APIs need different stability levels indicating the level of support, backwards compatability, and overall production readiness.
@ -91,4 +98,4 @@ The testing of each stable API is already outlined in [issue #3237](https://gith
 ### New APIs going forward
-Any subsequently introduced APIs should be introduced as `/v1alpha`
+Any subsequently introduced APIs should be introduced as `/v1alpha`
--- a/docs/docs/concepts/apis/api_providers.mdx
+++ b/docs/docs/concepts/apis/api_providers.mdx
@ -1,4 +1,11 @@
-## API Providers
+---
 title: API Providers
 description: Understanding remote vs inline provider implementations
 sidebar_label: API Providers
 sidebar_position: 2
 ---
 # API Providers
 The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include:
 - LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, etc.),
--- a/docs/docs/concepts/apis/external.mdx
+++ b/docs/docs/concepts/apis/external.mdx
@ -1,3 +1,9 @@
 ---
 title: External APIs
 description: Understanding external APIs in Llama Stack
 sidebar_label: External APIs
 sidebar_position: 4
 ---
 # External APIs
 Llama Stack supports external APIs that live outside of the main codebase. This allows you to:
--- a/docs/docs/concepts/apis/index.mdx
+++ b/docs/docs/concepts/apis/index.mdx
@ -1,4 +1,11 @@
-## APIs
+---
 title: APIs
 description: Available REST APIs and planned capabilities in Llama Stack
 sidebar_label: APIs
 sidebar_position: 1
 ---
 # APIs
 A Llama Stack API is described as a collection of REST endpoints. We currently support the following APIs:
--- a/docs/source/concepts/architecture.md
+++ b/docs/source/concepts/architecture.md
@ -1,15 +1,19 @@
-## Llama Stack architecture
+---
 title: Llama Stack Architecture
 description: Understanding Llama Stack's service-oriented design and benefits
 sidebar_label: Architecture
 sidebar_position: 2
 ---
 # Llama Stack architecture
 Llama Stack allows you to build different layers of distributions for your AI workloads using various SDKs and API providers.
-```{image} ../../_static/llama-stack.png
+<img src="/img/llama-stack.png" alt="Llama Stack" width="400" />
 :alt: Llama Stack
 :width: 400px
 ```
-### Benefits of Llama stack
+## Benefits of Llama stack
-#### Current challenges in custom AI applications
+### Current challenges in custom AI applications
 Building production AI applications today requires solving multiple challenges:
@ -32,7 +36,7 @@ Building production AI applications today requires solving multiple challenges:
 - Different providers have different APIs and abstractions.
 - Changing providers requires significant code changes.
-#### Our Solution: A Universal Stack
+### Our Solution: A Universal Stack
 Llama Stack addresses these challenges through a service-oriented, API-first approach:
@ -59,7 +63,7 @@ Llama Stack addresses these challenges through a service-oriented, API-first app
 - Ecosystem offers tailored infrastructure, software, and services for deploying a variety of models.
-### Our Philosophy
+## Our Philosophy
 - **Service-Oriented**: REST APIs enforce clean interfaces and enable seamless transitions across different environments.
 - **Composability**: Every component is independent but works together seamlessly
@ -67,4 +71,4 @@ Llama Stack addresses these challenges through a service-oriented, API-first app
 - **Turnkey Solutions**: Easy to deploy built in solutions for popular deployment scenarios
-With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
+With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
--- a/docs/source/concepts/distributions.md
+++ b/docs/source/concepts/distributions.md
@ -1,4 +1,11 @@
-## Distributions
+---
 title: Distributions
 description: Pre-packaged provider configurations for different deployment scenarios
 sidebar_label: Distributions
 sidebar_position: 5
 ---
 # Distributions
 While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples:
@ -6,4 +13,4 @@ While there is a lot of flexibility to mix-and-match providers, often users will
 **Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros.
-**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](../distributions/ondevice_distro/ios_sdk.md) and [Android](../distributions/ondevice_distro/android_sdk.md)
+**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](/docs/distributions/ondevice_distro/ios_sdk) and [Android](/docs/distributions/ondevice_distro/android_sdk)
--- a/docs/docs/concepts/index.mdx
+++ b/docs/docs/concepts/index.mdx
@ -0,0 +1,43 @@
 # Core Concepts
 Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.
 ## Documentation Structure
 This section covers the fundamental concepts of Llama Stack:
 - **[Architecture](./architecture.md)** - Learn about Llama Stack's architectural design and principles
 - **[APIs](./apis/index.mdx)** - Understanding the core APIs and their stability levels
  - [API Overview](./apis/index.mdx) - Core APIs available in Llama Stack
  - [API Providers](./apis/api_providers.mdx) - How providers implement APIs
  - [API Stability Leveling](./apis/api_leveling.mdx) - API stability and versioning
 - **[Distributions](./distributions.md)** - Pre-configured deployment packages
 - **[Resources](./resources.md)** - Understanding Llama Stack resources and their lifecycle
 - **[External Integration](./external.md)** - Integrating with external services and providers
 ## Getting Started
 If you're new to Llama Stack, we recommend starting with:
 1. **[Architecture](./architecture.md)** - Understand the overall system design
 2. **[APIs](./apis/index.mdx)** - Learn about the available APIs and their purpose
 3. **[Distributions](./distributions.md)** - Choose a pre-configured setup for your use case
 Each concept builds upon the previous ones to give you a comprehensive understanding of how Llama Stack works and how to use it effectively.---
 title: Core Concepts
 description: Understanding Llama Stack's service-oriented philosophy and key concepts
 sidebar_label: Overview
 sidebar_position: 1
 ---
 # Core Concepts
 Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.
 This section covers the key concepts you need to understand to work effectively with Llama Stack:
 - **[Architecture](./architecture)** - Llama Stack's service-oriented design and benefits
 - **[APIs](./apis)** - Available REST APIs and planned capabilities
 - **[API Providers](./api_providers)** - Remote vs inline provider implementations
 - **[Distributions](./distributions)** - Pre-packaged provider configurations
 - **[Resources](./resources)** - Resource federation and registration
--- a/docs/source/concepts/resources.md
+++ b/docs/source/concepts/resources.md
@ -1,4 +1,11 @@
-## Resources
+---
 title: Resources
 description: Resource federation and registration in Llama Stack
 sidebar_label: Resources
 sidebar_position: 6
 ---
 # Resources
 Some of these APIs are associated with a set of **Resources**. Here is the mapping of APIs to resources:
@ -12,8 +19,8 @@ Some of these APIs are associated with a set of **Resources**. Here is the mappi
 Furthermore, we allow these resources to be **federated** across multiple providers. For example, you may have some Llama models served by Fireworks while others are served by AWS Bedrock. Regardless, they will all work seamlessly with the same uniform Inference API provided by Llama Stack.
-```{admonition} Registering Resources
+:::tip Registering Resources
 :class: tip
 Given this architecture, it is necessary for the Stack to know which provider to use for a given resource. This means you need to explicitly _register_ resources (including models) before you can use them with the associated APIs.
-```
+
 :::
--- a/docs/docs/contributing/index.mdx
+++ b/docs/docs/contributing/index.mdx
@ -0,0 +1,244 @@
 # Contributing to Llama Stack
 We want to make contributing to this project as easy and transparent as
 possible.
 ## Set up your development environment
 We use [uv](https://github.com/astral-sh/uv) to manage python dependencies and virtual environments.
 You can install `uv` by following this [guide](https://docs.astral.sh/uv/getting-started/installation/).
 You can install the dependencies by running:
 ```bash
 cd llama-stack
 uv sync --group dev
 uv pip install -e .
 source .venv/bin/activate
 ```
 ```{note}
 You can use a specific version of Python with `uv` by adding the `--python <version>` flag (e.g. `--python 3.12`).
 Otherwise, `uv` will automatically select a Python version according to the `requires-python` section of the `pyproject.toml`.
 For more info, see the [uv docs around Python versions](https://docs.astral.sh/uv/concepts/python-versions/).
 ```
 Note that you can create a dotenv file `.env` that includes necessary environment variables:
 ```
 LLAMA_STACK_BASE_URL=http://localhost:8321
 LLAMA_STACK_CLIENT_LOG=debug
 LLAMA_STACK_PORT=8321
 LLAMA_STACK_CONFIG=<provider-name>
 TAVILY_SEARCH_API_KEY=
 BRAVE_SEARCH_API_KEY=
 ```
 And then use this dotenv file when running client SDK tests via the following:
 ```bash
 uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
 ```
 ### Pre-commit Hooks
 We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
 ```bash
 uv run pre-commit install
 ```
 After that, pre-commit hooks will run automatically before each commit.
 Alternatively, if you don't want to install the pre-commit hooks, you can run the checks manually by running:
 ```bash
 uv run pre-commit run --all-files
 ```
 ```{caution}
 Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
 ```
 ## Discussions -> Issues -> Pull Requests
 We actively welcome your pull requests. However, please read the following. This is heavily inspired by [Ghostty](https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md).
 If in doubt, please open a [discussion](https://github.com/meta-llama/llama-stack/discussions); we can always convert that to an issue later.
 ### Issues
 We use GitHub issues to track public bugs. Please ensure your description is
 clear and has sufficient instructions to be able to reproduce the issue.
 Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
 disclosure of security bugs. In those cases, please go through the process
 outlined on that page and do not file a public issue.
 ### Contributor License Agreement ("CLA")
 In order to accept your pull request, we need you to submit a CLA. You only need
 to do this once to work on any of Meta's open source projects.
 Complete your CLA here: <https://code.facebook.com/cla>
 **I'd like to contribute!**
 If you are new to the project, start by looking at the issues tagged with "good first issue". If you're interested
 leave a comment on the issue and a triager will assign it to you.
 Please avoid picking up too many issues at once. This helps you stay focused and ensures that others in the community also have opportunities to contribute.
 - Try to work on only 1–2 issues at a time, especially if you’re still getting familiar with the codebase.
 - Before taking an issue, check if it’s already assigned or being actively discussed.
 - If you’re blocked or can’t continue with an issue, feel free to unassign yourself or leave a comment so others can step in.
 **I have a bug!**
 1. Search the issue tracker and discussions for similar issues.
 2. If you don't have steps to reproduce, open a discussion.
 3. If you have steps to reproduce, open an issue.
 **I have an idea for a feature!**
 1. Open a discussion.
 **I've implemented a feature!**
 1. If there is an issue for the feature, open a pull request.
 2. If there is no issue, open a discussion and link to your branch.
 **I have a question!**
 1. Open a discussion or use [Discord](https://discord.gg/llama-stack).
 **Opening a Pull Request**
 1. Fork the repo and create your branch from `main`.
 2. If you've changed APIs, update the documentation.
 3. Ensure the test suite passes.
 4. Make sure your code lints using `pre-commit`.
 5. If you haven't already, complete the Contributor License Agreement ("CLA").
 6. Ensure your pull request follows the [conventional commits format](https://www.conventionalcommits.org/en/v1.0.0/).
 7. Ensure your pull request follows the [coding style](#coding-style).
 Please keep pull requests (PRs) small and focused. If you have a large set of changes, consider splitting them into logically grouped, smaller PRs to facilitate review and testing.
 ```{tip}
 As a general guideline:
 - Experienced contributors should try to keep no more than 5 open PRs at a time.
 - New contributors are encouraged to have only one open PR at a time until they’re familiar with the codebase and process.
 ```
 ## Repository guidelines
 ### Coding Style
 * Comments should provide meaningful insights into the code. Avoid filler comments that simply
  describe the next step, as they create unnecessary clutter, same goes for docstrings.
 * Prefer comments to clarify surprising behavior and/or relationships between parts of the code
  rather than explain what the next line of code does.
 * Catching exceptions, prefer using a specific exception type rather than a broad catch-all like
  `Exception`.
 * Error messages should be prefixed with "Failed to ..."
 * 4 spaces for indentation rather than tab
 * When using `# noqa` to suppress a style or linter warning, include a comment explaining the
  justification for bypassing the check.
 * When using `# type: ignore` to suppress a mypy warning, include a comment explaining the
  justification for bypassing the check.
 * Don't use unicode characters in the codebase. ASCII-only is preferred for compatibility or
  readability reasons.
 * Providers configuration class should be Pydantic Field class. It should have a `description` field
  that describes the configuration. These descriptions will be used to generate the provider
  documentation.
 * When possible, use keyword arguments only when calling functions.
 * Llama Stack utilizes [custom Exception classes](llama_stack/apis/common/errors.py) for certain Resources that should be used where applicable.
 ### License
 By contributing to Llama, you agree that your contributions will be licensed
 under the LICENSE file in the root directory of this source tree.
 ## Common Tasks
 Some tips about common tasks you work on while contributing to Llama Stack:
 ### Using `llama stack build`
 Building a stack image will use the production version of the `llama-stack` and `llama-stack-client` packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set `LLAMA_STACK_DIR` and `LLAMA_STACK_CLIENT_DIR` to the appropriate checked out directories when running any of the `llama` CLI commands.
 Example:
 ```bash
 cd work/
 git clone https://github.com/meta-llama/llama-stack.git
 git clone https://github.com/meta-llama/llama-stack-client-python.git
 cd llama-stack
 LLAMA_STACK_DIR=$(pwd) LLAMA_STACK_CLIENT_DIR=../llama-stack-client-python llama stack build --distro <...>
 ```
 ### Updating distribution configurations
 If you have made changes to a provider's configuration in any form (introducing a new config key, or
 changing models, etc.), you should run `./scripts/distro_codegen.py` to re-generate various YAML
 files as well as the documentation. You should not change `docs/source/.../distributions/` files
 manually as they are auto-generated.
 ### Updating the provider documentation
 If you have made changes to a provider's configuration, you should run `./scripts/provider_codegen.py`
 to re-generate the documentation. You should not change `docs/source/.../providers/` files manually
 as they are auto-generated.
 Note that the provider "description" field will be used to generate the provider documentation.
 ### Building the Documentation
 If you are making changes to the documentation at [https://llamastack.github.io/latest/](https://llamastack.github.io/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
 ```bash
 # This rebuilds the documentation pages.
 uv run --group docs make -C docs/ html
 # This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
 uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
 ```
 ### Update API Documentation
 If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:
 ```bash
 uv run ./docs/openapi_generator/run_openapi_generator.sh
 ```
 The generated API documentation will be available in `docs/_static/`. Make sure to review the changes before committing.
 ## Adding a New Provider
 See:
 - [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
 - [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
 - [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 ```{toctree}
 :maxdepth: 1
 :hidden:
 new_api_provider
 new_vector_database
 ```
 ## Testing
 ```{include} ../../../tests/README.md
 ```
 ## Advanced Topics
 For developers who need deeper understanding of the testing system internals:
 ```{toctree}
 :maxdepth: 1
 testing/record-replay
 ```
 ### Benchmarking
 ```{include} ../../../benchmarking/k8s-benchmark/README.md
 ```
--- a/docs/source/contributing/new_api_provider.md
+++ b/docs/source/contributing/new_api_provider.md
@ -1,4 +1,12 @@
-# Adding a New API Provider
+---
 title: Adding a New API Provider
 description: Guide for adding new API providers to Llama Stack
 sidebar_label: New API Provider
 sidebar_position: 2
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 This guide will walk you through the process of adding a new API provider to Llama Stack.
--- a/docs/source/contributing/new_vector_database.md
+++ b/docs/source/contributing/new_vector_database.md
@ -1,4 +1,12 @@
-# Adding a New Vector Database
+---
 title: Adding a New Vector Database
 description: Guide for adding new vector database providers to Llama Stack
 sidebar_label: New Vector Database
 sidebar_position: 3
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 This guide will walk you through the process of adding a new vector database to Llama Stack.
@ -72,4 +80,4 @@ InlineProviderSpec(
       - `uv add new_pip_package --group test`
 5. **Update Documentation**: Please update the documentation for end users
   - Generate the provider documentation by running {repopath}`./scripts/provider_codegen.py`.
-   - Update the autogenerated content in the registry/vector_io.py file with information about your provider. Please see other providers for examples.
+   - Update the autogenerated content in the registry/vector_io.py file with information about your provider. Please see other providers for examples.
--- a/docs/source/contributing/testing/record-replay.md
+++ b/docs/source/contributing/testing/record-replay.md
@ -1,3 +1,13 @@
 ---
 title: Record-Replay Testing System
 description: Understanding how Llama Stack captures and replays API interactions for testing
 sidebar_label: Record-Replay System
 sidebar_position: 4
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Record-Replay System
 Understanding how Llama Stack captures and replays API interactions for testing.
@ -228,4 +238,4 @@ Loose hashing (normalizing whitespace, rounding floats) seems convenient but hid
 - **SQLite** - Fast indexed lookups without loading response bodies
 - **Hybrid** - Best of both worlds for different use cases
-This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
+This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.
--- a/docs/docs/deploying/aws_eks_deployment.mdx
+++ b/docs/docs/deploying/aws_eks_deployment.mdx
@ -0,0 +1,30 @@
 ---
 title: AWS EKS Deployment Guide
 description: Deploy Llama Stack on AWS EKS
 sidebar_label: AWS EKS Deployment
 sidebar_position: 3
 ---
 ## AWS EKS Deployment
 ### Prerequisites
 - Set up an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html)
 - Create a [GitHub OAuth app](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/creating-an-oauth-app)
 - Set authorization callback URL to `http://<your-llama-stack-ui-url>/api/auth/callback/`
 ### Automated Deployment
 ```bash
 export HF_TOKEN=<your-huggingface-token>
 export GITHUB_CLIENT_ID=<your-github-client-id>
 export GITHUB_CLIENT_SECRET=<your-github-client-secret>
 export LLAMA_STACK_UI_URL=<your-llama-stack-ui-url>
 cd docs/source/distributions/eks
 ./apply.sh
 ```
 This script will:
 - Set up default storage class for AWS EKS
 - Deploy Llama Stack server in Kubernetes pods and services
--- a/docs/docs/deploying/index.mdx
+++ b/docs/docs/deploying/index.mdx
@ -0,0 +1,14 @@
 ---
 title: Deploying Llama Stack
 description: Production deployment guides for Llama Stack in various environments
 sidebar_label: Overview
 sidebar_position: 1
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 # Deploying Llama Stack
 [**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx)
 [**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx)
--- a/docs/source/deploying/kubernetes_deployment.md
+++ b/docs/source/deploying/kubernetes_deployment.md
@ -1,27 +1,39 @@
-## Kubernetes Deployment Guide
+---
 title: Kubernetes Deployment Guide
 description: Deploy Llama Stack on Kubernetes clusters with vLLM inference service
 sidebar_label: Kubernetes
 sidebar_position: 2
 ---
-Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster.
+import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
-### Prerequisites
+# Kubernetes Deployment Guide
 In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a vLLM inference service in the same cluster for demonstration purposes.
-Note: You can also deploy the Llama Stack server in an AWS EKS cluster. See [Deploying Llama Stack Server in AWS EKS](#deploying-llama-stack-server-in-aws-eks) for more details.
+Deploy Llama Stack and vLLM servers in a Kubernetes cluster instead of running them locally. This guide covers both local development with Kind and production deployment on AWS EKS.
-First, create a local Kubernetes cluster via Kind:
+## Prerequisites
-```
+### Local Kubernetes Setup
 Create a local Kubernetes cluster via Kind:
 ```bash
 kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
 ```
-First set your hugging face token as an environment variable.
+Set your Hugging Face token:
-```
+
 ```bash
 export HF_TOKEN=$(echo -n "your-hf-token" | base64)
 ```
-Now create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
+## Quick Deployment
-```
+### Step 1: Create Storage and Secrets
-cat <<EOF |kubectl apply -f -
+
 ```yaml
 cat <<EOF | kubectl apply -f -
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
@ -44,11 +56,10 @@ data:
 EOF
 ```
 ### Step 2: Deploy vLLM Server
-Next, start the vLLM server as a Kubernetes Deployment and Service:
+```yaml
-
+cat <<EOF | kubectl apply -f -
 ```
 cat <<EOF |kubectl apply -f -
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@ -67,9 +78,7 @@ spec:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
-        args: [
+        args: ["vllm serve meta-llama/Llama-3.2-1B-Instruct"]
          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
@ -101,18 +110,9 @@ spec:
 EOF
 ```
-We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
+### Step 3: Configure Llama Stack
-```
+Update your run configuration:
 $ kubectl logs -l app.kubernetes.io/name=vllm
 ...
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
 ```
 Then we can modify the Llama Stack run configuration YAML with the following inference provider:
 ```yaml
 providers:
@ -125,26 +125,22 @@ providers:
      api_token: fake
 ```
-Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:
+Build container image:
-```
+```bash
 tmp_dir=$(mktemp -d) && cat >$tmp_dir/Containerfile.llama-stack-run-k8s <<EOF
 FROM distribution-myenv:dev
 RUN apt-get update && apt-get install -y git
 RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source
 ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
 EOF
 podman build -f $tmp_dir/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s $tmp_dir
 ```
-### Deploying Llama Stack Server in Kubernetes
+### Step 4: Deploy Llama Stack Server
-We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:
+```yaml
-
+cat <<EOF | kubectl apply -f -
 ```
 cat <<EOF |kubectl apply -f -
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
@ -200,48 +196,29 @@ spec:
 EOF
 ```
-### Verifying the Deployment
+### Step 5: Test Deployment
 We can check that the LlamaStack server has started:
-```
+```bash
-$ kubectl logs -l app.kubernetes.io/name=llama-stack
+# Port forward and test
 ...
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     ASGI 'lifespan' protocol appears unsupported.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
 ```
 Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:
 ```
 kubectl port-forward service/llama-stack-service 5000:5000
 llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
 ```
-## Deploying Llama Stack Server in AWS EKS
+## Troubleshooting
-We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster.
+**Check pod status:**
-
+```bash
-Prerequisites:
+kubectl get pods -l app.kubernetes.io/name=vllm
- Set up an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html).
+kubectl logs -l app.kubernetes.io/name=vllm
 - Create a [Github OAuth app](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/creating-an-oauth-app) and get the client ID and client secret.
  - Set the `Authorization callback URL` to `http://<your-llama-stack-ui-url>/api/auth/callback/`
 Run the following script to deploy the Llama Stack server:
 ```
 export HF_TOKEN=<your-huggingface-token>
 export GITHUB_CLIENT_ID=<your-github-client-id>
 export GITHUB_CLIENT_SECRET=<your-github-client-secret>
 export LLAMA_STACK_UI_URL=<your-llama-stack-ui-url>
 cd docs/source/distributions/eks
 ./apply.sh
 ```
-This script will:
+**Test service connectivity:**
 ```bash
 kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://vllm-server:8000/v1/models
 ```
- Set up a default storage class for AWS EKS
+## Related Resources
- Deploy the Llama Stack server in a Kubernetes Pod and Service
+
 - **[Deployment Overview](./index)** - Overview of deployment options
 - **[Distributions](/docs/distributions)** - Understanding Llama Stack distributions
 - **[Configuration](/docs/distributions/configuration)** - Detailed configuration options
--- a/docs/source/distributions/building_distro.md
+++ b/docs/source/distributions/building_distro.md
@ -1,5 +1,9 @@
-# Build your own Distribution
+---
-
+title: Building Custom Distributions
 description: Building a Llama Stack distribution from scratch
 sidebar_label: Build your own Distribution
 sidebar_position: 3
 ---
 This guide will walk you through the steps to get started with building a Llama Stack distribution from scratch with your choice of API providers.
--- a/docs/source/distributions/configuration.md
+++ b/docs/source/distributions/configuration.md
@ -1,3 +1,9 @@
 ---
 title: Configuring a "Stack"
 description: Configuring a "Stack"
 sidebar_label: Configuring a "Stack"
 sidebar_position: 6
 ---
 # Configuring a "Stack"
 The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
--- a/docs/source/distributions/customizing_run_yaml.md
+++ b/docs/source/distributions/customizing_run_yaml.md
@ -1,3 +1,9 @@
 ---
 title: Customizing run.yaml
 description: Customizing run.yaml files for Llama Stack templates
 sidebar_label: Customizing run.yaml
 sidebar_position: 4
 ---
 # Customizing run.yaml Files
 The `run.yaml` files generated by Llama Stack templates are **starting points** designed to be customized for your specific needs. They are not meant to be used as-is in production environments.
@ -37,4 +43,4 @@ your-project/
 └── README.md
 ```
-The goal is to take the generated template and adapt it to your specific infrastructure and operational needs.
+The goal is to take the generated template and adapt it to your specific infrastructure and operational needs.
--- a/docs/source/distributions/eks/apply.sh
+++ b/docs/source/distributions/eks/apply.sh
--- a/docs/source/distributions/eks/gp3-topology-aware.yaml
+++ b/docs/source/distributions/eks/gp3-topology-aware.yaml
--- a/docs/source/distributions/importing_as_library.md
+++ b/docs/source/distributions/importing_as_library.md
@ -1,3 +1,9 @@
 ---
 title: Using Llama Stack as a Library
 description: How to use Llama Stack as a Python library instead of running a server
 sidebar_label: Importing as Library
 sidebar_position: 5
 ---
 # Using Llama Stack as a Library
 ## Setup Llama Stack without a Server
--- a/docs/docs/distributions/index.mdx
+++ b/docs/docs/distributions/index.mdx
@ -0,0 +1,21 @@
 ---
 title: Distributions Overview
 description: Pre-packaged sets of Llama Stack components for different deployment scenarios
 sidebar_label: Overview
 sidebar_position: 1
 ---
 # Distributions Overview
 A distribution is a pre-packaged set of Llama Stack components that can be deployed together.
 This section provides an overview of the distributions available in Llama Stack.
 ## Distribution Guides
 - **[Available Distributions](./list_of_distributions)** - Complete list and comparison of all distributions
 - **[Building Custom Distributions](./building_distro)** - Create your own distribution from scratch
 - **[Customizing Configuration](./customizing_run_yaml)** - Customize run.yaml for your needs
 - **[Starting Llama Stack Server](./starting_llama_stack_server)** - How to run distributions
 - **[Importing as Library](./importing_as_library)** - Use distributions in your code
 - **[Configuration Reference](./configuration)** - Configuration file format details
--- a/docs/source/distributions/k8s/apply.sh
+++ b/docs/source/distributions/k8s/apply.sh
--- a/docs/source/distributions/k8s/chroma-k8s.yaml.template
+++ b/docs/source/distributions/k8s/chroma-k8s.yaml.template
--- a/docs/source/distributions/k8s/hf-token-secret.yaml.template
+++ b/docs/source/distributions/k8s/hf-token-secret.yaml.template
--- a/docs/source/distributions/k8s/ingress-k8s.yaml.template
+++ b/docs/source/distributions/k8s/ingress-k8s.yaml.template
--- a/docs/source/distributions/k8s/postgres-k8s.yaml.template
+++ b/docs/source/distributions/k8s/postgres-k8s.yaml.template
--- a/docs/source/distributions/k8s/stack-configmap.yaml
+++ b/docs/source/distributions/k8s/stack-configmap.yaml
--- a/docs/source/distributions/k8s/stack-k8s.yaml.template
+++ b/docs/source/distributions/k8s/stack-k8s.yaml.template
--- a/docs/source/distributions/k8s/stack_run_config.yaml
+++ b/docs/source/distributions/k8s/stack_run_config.yaml
--- a/docs/source/distributions/k8s/ui-k8s.yaml.template
+++ b/docs/source/distributions/k8s/ui-k8s.yaml.template
--- a/docs/source/distributions/k8s/vllm-k8s.yaml.template
+++ b/docs/source/distributions/k8s/vllm-k8s.yaml.template
--- a/docs/source/distributions/k8s/vllm-safety-k8s.yaml.template
+++ b/docs/source/distributions/k8s/vllm-safety-k8s.yaml.template
--- a/docs/source/distributions/list_of_distributions.md
+++ b/docs/source/distributions/list_of_distributions.md
@ -1,3 +1,10 @@
 ---
 title: Available Distributions
 description: List of available distributions for Llama Stack
 sidebar_label: Available Distributions
 sidebar_position: 2
 ---
 # Available Distributions
 Llama Stack provides several pre-configured distributions to help you get started quickly. Choose the distribution that best fits your hardware and use case.
--- a/docs/source/distributions/ondevice_distro/android_sdk.md
+++ b/docs/source/distributions/ondevice_distro/android_sdk.md
--- a/docs/source/distributions/ondevice_distro/ios_sdk.md
+++ b/docs/source/distributions/ondevice_distro/ios_sdk.md
--- a/docs/source/distributions/remote_hosted_distro/index.md
+++ b/docs/source/distributions/remote_hosted_distro/index.md
--- a/docs/source/distributions/remote_hosted_distro/watsonx.md
+++ b/docs/source/distributions/remote_hosted_distro/watsonx.md
--- a/docs/source/distributions/self_hosted_distro/dell-tgi.md
+++ b/docs/source/distributions/self_hosted_distro/dell-tgi.md
--- a/docs/source/distributions/self_hosted_distro/dell.md
+++ b/docs/source/distributions/self_hosted_distro/dell.md
--- a/docs/source/distributions/self_hosted_distro/passthrough.md
+++ b/docs/source/distributions/self_hosted_distro/passthrough.md
--- a/docs/source/distributions/self_hosted_distro/starter.md
+++ b/docs/source/distributions/self_hosted_distro/starter.md
--- a/docs/source/distributions/starting_llama_stack_server.md
+++ b/docs/source/distributions/starting_llama_stack_server.md
@ -1,3 +1,10 @@
 ---
 title: Starting a Llama Stack Server
 description: Different ways to run Llama Stack servers - as library, container, or Kubernetes deployment
 sidebar_label: Starting Llama Stack Server
 sidebar_position: 7
 ---
 # Starting a Llama Stack Server
 You can run a Llama Stack server in one of the following ways:
--- a/docs/source/getting_started/demo_script.py
+++ b/docs/source/getting_started/demo_script.py
--- a/docs/source/getting_started/detailed_tutorial.md
+++ b/docs/source/getting_started/detailed_tutorial.md
@ -1,3 +1,13 @@
 ---
 title: Detailed Tutorial
 description: Complete guide to using Llama Stack server and client SDK to build AI agents
 sidebar_label: Detailed Tutorial
 sidebar_position: 3
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 ## Detailed Tutorial
 In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple agent.
--- a/docs/source/getting_started/libraries.md
+++ b/docs/source/getting_started/libraries.md
@ -1,3 +1,9 @@
 ---
 description: We have a number of client-side SDKs available for different languages.
 sidebar_label: Libraries
 sidebar_position: 2
 title: Libraries (SDKs)
 ---
 ## Libraries (SDKs)
 We have a number of client-side SDKs available for different languages.
@ -7,4 +13,4 @@ We have a number of client-side SDKs available for different languages.
 | Python |  [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/)
 | Swift  | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/tree/latest-release) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift)
 | Node   | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
-| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
+| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@ -1,4 +1,9 @@
-## Quickstart
+---
 description: environments.
 sidebar_label: Quickstart
 sidebar_position: 1
 title: Quickstart
 ---
 Get started with Llama Stack in minutes!
@ -6,7 +11,7 @@ Llama Stack is a stateful service with REST APIs to support the seamless transit
 environments. You can build and test using a local server first and deploy to a hosted endpoint for production.
 In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
-as the inference [provider](../providers/inference/index) for a Llama Model.
+as the inference [provider](/docs/providers/inference) for a Llama Model.
 **💡 Notebook Version:** You can also follow this quickstart guide in a Jupyter notebook format: [quick_start.ipynb](https://github.com/meta-llama/llama-stack/blob/main/docs/quick_start.ipynb)
@ -27,8 +32,75 @@ OLLAMA_URL=http://localhost:11434 \
 #### Step 3: Run the demo
 Now open up a new terminal and copy the following script into a file named `demo_script.py`.
-```{literalinclude} ./demo_script.py
+```python title="demo_script.py"
-:language: python
+# Copyright (c) Meta Platforms, Inc. and affiliates.
 # All rights reserved.
 #
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.
 from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
 vector_db_id = "my_demo_vector_db"
 client = LlamaStackClient(base_url="http://localhost:8321")
 models = client.models.list()
 # Select the first LLM and first embedding models
 model_id = next(m for m in models if m.model_type == "llm").identifier
 embedding_model_id = (
    em := next(m for m in models if m.model_type == "embedding")
 ).identifier
 embedding_dimension = em.metadata["embedding_dimension"]
 vector_db = client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=embedding_model_id,
    embedding_dimension=embedding_dimension,
    provider_id="faiss",
 )
 vector_db_id = vector_db.identifier
 source = "https://www.paulgraham.com/greatwork.html"
 print("rag_tool> Ingesting document:", source)
 document = RAGDocument(
    document_id="document_1",
    content=source,
    mime_type="text/html",
    metadata={},
 )
 client.tool_runtime.rag_tool.insert(
    documents=[document],
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=100,
 )
 agent = Agent(
    client,
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {"vector_db_ids": [vector_db_id]},
        }
    ],
 )
 prompt = "How do you do great work?"
 print("prompt>", prompt)
 use_stream = True
 response = agent.create_turn(
    messages=[{"role": "user", "content": prompt}],
    session_id=agent.create_session("rag_session"),
    stream=use_stream,
 )
 # Only call `AgentEventLogger().log(response)` for streaming responses.
 if use_stream:
    for log in AgentEventLogger().log(response):
        log.print()
 else:
    print(response)
 ```
 We will use `uv` to run the script
 ```
@ -59,19 +131,19 @@ Ultimately, great work is about making a meaningful contribution and leaving a l
 ```
 Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳
-```{admonition} HuggingFace access
+:::tip HuggingFace access
 :class: tip
 If you are getting a **401 Client Error** from HuggingFace for the **all-MiniLM-L6-v2** model, try setting **HF_TOKEN** to a valid HuggingFace token in your environment
-```
+
 :::
 ### Next Steps
 Now you're ready to dive deeper into Llama Stack!
- Explore the [Detailed Tutorial](./detailed_tutorial.md).
+- Explore the [Detailed Tutorial](/docs/detailed_tutorial).
 - Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
 - Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
- Learn about Llama Stack [Concepts](../concepts/index.md).
+- Learn about Llama Stack [Concepts](/docs/concepts).
- Discover how to [Build Llama Stacks](../distributions/index.md).
+- Discover how to [Build Llama Stacks](/docs/distributions).
- Refer to our [References](../references/index.md) for details on the Llama CLI and Python SDK.
+- Refer to our [References](/docs/references) for details on the Llama CLI and Python SDK.
 - Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.
--- a/docs/docs/index.mdx
+++ b/docs/docs/index.mdx
@ -0,0 +1,101 @@
 ---
 sidebar_position: 1
 title: Welcome to Llama Stack
 description: Llama Stack is the open-source framework for building generative AI applications
 sidebar_label: Intro
 tags:
  - getting-started
  - overview
 ---
 # Welcome to Llama Stack
 Llama Stack is the open-source framework for building generative AI applications.
 :::tip Llama 4 is here!
 Check out [Getting Started with Llama 4](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started_llama4.ipynb)
 :::
 :::tip News
 Llama Stack is now available! See the [release notes](https://github.com/meta-llama/llama-stack/releases) for more details.
 :::
 ## What is Llama Stack?
 Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. More specifically, it provides:
 - **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
 - **Plugin architecture** to support the rich ecosystem of implementations of the different APIs in different environments like local development, on-premises, cloud, and mobile.
 - **Prepackaged verified distributions** which offer a one-stop solution for developers to get started quickly and reliably in any environment
 - **Multiple developer interfaces** like CLI and SDKs for Python, Node, iOS, and Android
 - **Standalone applications** as examples for how to build production-grade AI applications with Llama Stack
 <img src="/img/llama-stack.png" alt="Llama Stack" width="400px" />
 Our goal is to provide pre-packaged implementations (aka "distributions") which can be run in a variety of deployment environments. LlamaStack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience is available.
 ## How does Llama Stack work?
 Llama Stack consists of a server (with multiple pluggable API providers) and Client SDKs meant to be used in your applications. The server can be run in a variety of environments, including local (inline) development, on-premises, and cloud. The client SDKs are available for Python, Swift, Node, and Kotlin.
 ## Quick Links
 - Ready to build? Check out the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) to get started.
 - Want to contribute? See the [Contributing Guide](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md).
 - Explore [Example Applications](https://github.com/meta-llama/llama-stack-apps) built with Llama Stack.
 ## Rich Ecosystem Support
 Llama Stack provides adapters for popular providers across all API categories:
 - **Inference**: Meta Reference, Ollama, Fireworks, Together, NVIDIA, vLLM, AWS Bedrock, OpenAI, Anthropic, and more
 - **Vector Databases**: FAISS, Chroma, Milvus, Postgres, Weaviate, Qdrant, and others
 - **Safety**: Llama Guard, Prompt Guard, Code Scanner, AWS Bedrock
 - **Training & Evaluation**: HuggingFace, TorchTune, NVIDIA NEMO
 :::info Provider Details
 For complete provider compatibility and setup instructions, see our [Providers Documentation](https://llama-stack.readthedocs.io/en/latest/providers/index.html).
 :::
 ## Get Started Today
 <div style={{display: 'flex', gap: '1rem', flexWrap: 'wrap', margin: '2rem 0'}}>
  <a href="https://llama-stack.readthedocs.io/en/latest/getting_started/index.html"
     style={{
       background: 'var(--ifm-color-primary)',
       color: 'white',
       padding: '0.75rem 1.5rem',
       borderRadius: '0.5rem',
       textDecoration: 'none',
       fontWeight: 'bold'
     }}>
    🚀 Quick Start Guide
  </a>
  <a href="https://github.com/meta-llama/llama-stack-apps"
     style={{
       border: '2px solid var(--ifm-color-primary)',
       color: 'var(--ifm-color-primary)',
       padding: '0.75rem 1.5rem',
       borderRadius: '0.5rem',
       textDecoration: 'none',
       fontWeight: 'bold'
     }}>
    📚 Example Apps
  </a>
  <a href="https://github.com/meta-llama/llama-stack"
     style={{
       border: '2px solid #666',
       color: '#666',
       padding: '0.75rem 1.5rem',
       borderRadius: '0.5rem',
       textDecoration: 'none',
       fontWeight: 'bold'
     }}>
    ⭐ Star on GitHub
  </a>
 </div>
--- a/docs/source/references/evals_reference/index.md
+++ b/docs/source/references/evals_reference/index.md
@ -9,12 +9,11 @@ We introduce a set of APIs in Llama Stack for supporting running evaluations of
 This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
 ## Evaluation Concepts
-The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../../concepts/index.md) guide for better high-level understanding.
+The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/) guide for better high-level understanding.
-![Eval Concepts](./resources/eval-concept.png)
+![Eval Concepts](/img/eval-concept.png)
 - **DatasetIO**: defines interface with datasets and data loaders.
  - Associated with `Dataset` resource.
@ -23,7 +22,6 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
 - **Eval**: generate outputs (via Inference or Agents) and perform scoring.
  - Associated with `Benchmark` resource.
 ## Evaluation Examples Walkthrough
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb)
@ -156,7 +154,6 @@ response = client.eval.evaluate_rows(
 pprint(response)
 ```
 ### 2. Agentic Evaluation
 - In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
 - We will continue to use the SimpleQA dataset we used in previous example.
@ -202,7 +199,7 @@ pprint(response)
 Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
-In this example, we will work with an example RAG dataset you have built previously, label with an annotation, and use LLM-As-Judge with custom judge prompt for scoring. Please checkout our [Llama Stack Playground](../../building_applications/playground/index.md) for an interactive interface to upload datasets and run scorings.
+In this example, we will work with an example RAG dataset you have built previously, label with an annotation, and use LLM-As-Judge with custom judge prompt for scoring. Please checkout our [Llama Stack Playground](../building_applications/playground) for an interactive interface to upload datasets and run scorings.
 ```python
 judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"
@ -268,29 +265,27 @@ response = client.scoring.score(
 ## Running Evaluations via CLI
 The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
-#### Benchmark Evaluation CLI
+### Benchmark Evaluation CLI
 There are 3 necessary input for running a benchmark eval
 - `list of benchmark_ids`: The list of benchmark ids to run evaluation on
 - `model-id`: The model id to evaluate on
- `utput_dir`: Path to store the evaluate results
+- `output_dir`: Path to store the evaluate results
-```
+
 ```bash
 llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
 --model_id <model id to evaluate on> \
 --output_dir <directory to store the evaluate results> \
 ```
 You can run
-```
+```bash
 llama-stack-client eval run-benchmark help
 ```
-to see the description of all the flags to run benckmark eval
+to see the description of all the flags to run benchmark eval
 In the output log, you can find the path to the file that has your evaluation results. Open that file and you can see your aggregate evaluation results over there.
-In the output log, you can find the path to the file that has your evaluation results. Open that file and you can see you aggrgate
+### Application Evaluation CLI
 evaluation results over there.
 #### Application Evaluation CLI
 Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
 - `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
 - `Dataset` used for evaluation:
@ -298,21 +293,19 @@ Usage: For running application evals, you will already have available datasets i
  - (2) `--dataset-id`: pre-registered dataset in Llama Stack
 - (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
-
+```bash
 ```
 llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
 --dataset-path <path-to-local-dataset> \
 --output-dir ./
 ```
-#### Defining BenchmarkConfig
+### Defining BenchmarkConfig
 The `BenchmarkConfig` are user specified config to define:
 1. `EvalCandidate` to run generation on:
   - `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
   - `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack  /agents API.
 2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
 **Example BenchmarkConfig**
 ```json
 {
@ -340,29 +333,25 @@ The `BenchmarkConfig` are user specified config to define:
 }
 ```
 ## Open-benchmark Contributing Guide
 ### Create the new dataset for your new benchmark
 An eval open-benchmark essentially contains 2 parts:
 - `raw data`: The raw dataset associated with the benchmark. You typically need to search the original paper that introduces the benchmark and find the canonical dataset (usually hosted on huggingface)
- `prompt template`: How to ask the candidate model to generate the answer (prompt template plays a critical role to the evaluation results). Tyically, you can find the reference prompt template associated with the benchmark in benchmarks author's repo ([exmaple](https://github.com/idavidrein/gpqa/blob/main/prompts/chain_of_thought.txt)) or some other popular open source repos ([example](https://github.com/openai/simple-evals/blob/0a6e8f62e52bc5ae915f752466be3af596caf392/common.py#L14))
+- `prompt template`: How to ask the candidate model to generate the answer (prompt template plays a critical role to the evaluation results). Typically, you can find the reference prompt template associated with the benchmark in benchmarks author's repo ([example](https://github.com/idavidrein/gpqa/blob/main/prompts/chain_of_thought.txt)) or some other popular open source repos ([example](https://github.com/openai/simple-evals/blob/0a6e8f62e52bc5ae915f752466be3af596caf392/common.py#L14))
-To create new open-benmark in llama stack, you need to combine the prompt template and the raw data into the `chat_completion_input` column in the evaluation dataset.
+To create new open-benchmark in llama stack, you need to combine the prompt template and the raw data into the `chat_completion_input` column in the evaluation dataset.
-Llama stack enforeces the evaluate dataset schema to contain at least 3 columns:
+Llama stack enforces the evaluate dataset schema to contain at least 3 columns:
 - `chat_completion_input`: The actual input to the model to run the generation for eval
 - `input_query`: The raw input from the raw dataset without the prompt template
- `expected_answer`: The ground truth for scoring functions to calcalate the score from.
+- `expected_answer`: The ground truth for scoring functions to calculate the score from.
 You need to write a script [example convert script](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840) to convert the benchmark raw dataset to llama stack format eval dataset and update the dataset to huggingface [example benchmark dataset](https://huggingface.co/datasets/llamastack/mmmu)
 ### Find scoring function for your new benchmark
 The purpose of scoring function is to calculate the score for each example based on candidate model generation result and expected_answer. It also aggregates the scores from all the examples and generate the final evaluate results.
 Firstly, you can see if the existing [llama stack scoring functions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/scoring) can fulfill your need. If not, you need to write a new scoring function based on what benchmark author / other open source repo describe.
 ### Add new benchmark into template
@ -373,17 +362,15 @@ Secondly, you need to add the new benchmark you just created under the `benchmar
 - `dataset_id`: identifier of the dataset associated with your benchmark
 - `scoring_functions`: scoring function to calculate the score based on generation results and expected_answer
 ### Test the new benchmark
 Spin up llama stack server with 'open-benchmark' templates
-```
+```bash
 llama stack run llama_stack/distributions/open-benchmark/run.yaml
 ```
 Run eval benchmark CLI with your new benchmark id
-```
+```bash
 llama-stack-client eval run-benchmark <new_benchmark_id> \
 --model_id <model id to evaluate on> \
 --output_dir <directory to store the evaluate results> \
--- a/docs/source/references/evals_reference/resources/eval-concept.png
+++ b/docs/source/references/evals_reference/resources/eval-concept.png
--- a/docs/source/references/evals_reference/resources/eval-flow.png
+++ b/docs/source/references/evals_reference/resources/eval-flow.png
--- a/docs/docs/references/index.mdx
+++ b/docs/docs/references/index.mdx
@ -0,0 +1,12 @@
 ---
 title: References
 description: Reference documentation for Llama Stack
 sidebar_label: Overview
 sidebar_position: 1
 ---
 # References
 - [Python SDK Reference](python_sdk_reference/index)
 - [Llama CLI](llama_cli_reference/index) for building and running your Llama Stack server
 - [Llama Stack Client CLI](llama_stack_client_cli_reference) for interacting with your Llama Stack server
--- a/docs/source/references/llama_cli_reference/download_models.md
+++ b/docs/source/references/llama_cli_reference/download_models.md
--- a/docs/source/references/llama_cli_reference/index.md
+++ b/docs/source/references/llama_cli_reference/index.md
--- a/docs/source/references/llama_stack_client_cli_reference.md
+++ b/docs/source/references/llama_stack_client_cli_reference.md
--- a/docs/source/references/python_sdk_reference/index.md
+++ b/docs/source/references/python_sdk_reference/index.md
--- a/docs/source/building_applications/evals.md
+++ b/docs/source/building_applications/evals.md
@ -1,125 +0,0 @@
 # Evaluations
 The Llama Stack provides a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
 - `/datasetio` + `/datasets` API
 - `/scoring` + `/scoring_functions` API
 - `/eval` + `/benchmarks` API
 This guides walks you through the process of evaluating an LLM application built using Llama Stack. Checkout the [Evaluation Reference](../references/evals_reference/index.md) guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for benchmark and application use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
 ## Application Evaluation
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
 Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
 In this example, we will show you how to:
 1. Build an Agent with Llama Stack
 2. Query the agent's sessions, turns, and steps
 3. Evaluate the results.
 ##### Building a Search Agent
 ```python
 from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
 client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
 agent = Agent(
    client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions. ",
    tools=["builtin::websearch"],
 )
 user_prompts = [
    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
 ]
 session_id = agent.create_session("test-session")
 for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in AgentEventLogger().log(response):
        log.print()
 ```
 ##### Query Agent Execution Steps
 Now, let's look deeper into the agent's execution steps and see if how well our agent performs.
 ```python
 # query the agents session
 from rich.pretty import pprint
 session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
 )
 pprint(session_response)
 ```
 As a sanity check, we will first check if all user prompts is followed by a tool call to `brave_search`.
 ```python
 num_tool_call = 0
 for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1
 print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
 )
 ```
 ##### Evaluate Agent Responses
 Now, we want to evaluate the agent's responses to the user prompts.
 1. First, we will process the agent's execution history into a list of rows that can be used for evaluation.
 2. Next, we will label the rows with the expected answer.
 3. Finally, we will use the `/scoring` API to score the agent's responses.
 ```python
 eval_rows = []
 expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
 ]
 for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )
 pprint(eval_rows)
 scoring_params = {
    "basic::subset_of": None,
 }
 scoring_response = client.scoring.score(
    input_rows=eval_rows, scoring_functions=scoring_params
 )
 pprint(scoring_response)
 ```
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -1,33 +0,0 @@
 # AI Application Examples
 Llama Stack provides all the building blocks needed to create sophisticated AI applications.
 The best way to get started is to look at this notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
 **Notebook**: [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
 Here are some key topics that will help you build effective agents:
 - **[RAG (Retrieval-Augmented Generation)](rag)**: Learn how to enhance your agents with external knowledge through retrieval mechanisms.
 - **[Agent](agent)**: Understand the components and design patterns of the Llama Stack agent framework.
 - **[Agent Execution Loop](agent_execution_loop)**: Understand how agents process information, make decisions, and execute actions in a continuous loop.
 - **[Agents vs Responses API](responses_vs_agents)**: Learn the differences between the Agents API and Responses API, and when to use each one.
 - **[Tools](tools)**: Extend your agents' capabilities by integrating with external tools and APIs.
 - **[Evals](evals)**: Evaluate your agents' effectiveness and identify areas for improvement.
 - **[Telemetry](telemetry)**: Monitor and analyze your agents' performance and behavior.
 - **[Safety](safety)**: Implement guardrails and safety measures to ensure responsible AI behavior.
 ```{toctree}
 :hidden:
 :maxdepth: 1
 rag
 agent
 agent_execution_loop
 responses_vs_agents
 tools
 evals
 telemetry
 safety
 playground/index
 ```
--- a/docs/source/building_applications/playground/index.md
+++ b/docs/source/building_applications/playground/index.md
@ -1,107 +0,0 @@
 ## Llama Stack Playground
 ```{note}
 The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
 ```
 The Llama Stack Playground is an simple interface which aims to:
 - Showcase **capabilities** and **concepts** of Llama Stack in an interactive environment
 - Demo **end-to-end** application code to help users get started to build their own applications
 - Provide an **UI** to help users inspect and understand Llama Stack API providers and resources
 ### Key Features
 #### Playground
 Interactive pages for users to play with and explore Llama Stack API capabilities.
 ##### Chatbot
 ```{eval-rst}
 .. video:: https://github.com/user-attachments/assets/8d2ef802-5812-4a28-96e1-316038c84cbf
    :autoplay:
    :playsinline:
    :muted:
    :loop:
    :width: 100%
 ```
 - **Chat**: Chat with Llama models.
  - This page is a simple chatbot that allows you to chat with Llama models. Under the hood, it uses the `/inference/chat-completion` streaming API to send messages to the model and receive responses.
 - **RAG**: Uploading documents to memory_banks and chat with RAG agent
  - This page allows you to upload documents as a `memory_bank` and then chat with a RAG agent to query information about the uploaded documents.
  - Under the hood, it uses Llama Stack's `/agents` API to define and create a RAG agent and chat with it in a session.
 ##### Evaluations
 ```{eval-rst}
 .. video:: https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5
    :autoplay:
    :playsinline:
    :muted:
    :loop:
    :width: 100%
 ```
 - **Evaluations (Scoring)**: Run evaluations on your AI application datasets.
  - This page demonstrates the flow evaluation API to run evaluations on your custom AI application datasets. You may upload your own evaluation datasets and run evaluations using available scoring functions.
  - Under the hood, it uses Llama Stack's `/scoring` API to run evaluations on selected scoring functions.
 ```{eval-rst}
 .. video:: https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf
    :autoplay:
    :playsinline:
    :muted:
    :loop:
    :width: 100%
 ```
 - **Evaluations (Generation + Scoring)**: Use pre-registered evaluation tasks to evaluate an model or agent candidate
  - This page demonstrates the flow for evaluation API to evaluate an model or agent candidate on pre-defined evaluation tasks. An evaluation task is a combination of dataset and scoring functions.
  - Under the hood, it uses Llama Stack's `/eval` API to run generations and scorings on specified evaluation configs.
  - In order to run this page, you may need to register evaluation tasks and datasets as resources first through the following commands.
  ```bash
    $ llama-stack-client datasets register \
    --dataset-id "mmlu" \
    --provider-id "huggingface" \
    --url "https://huggingface.co/datasets/llamastack/evals" \
    --metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \
    --schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string"}, "chat_completion_input": {"type": "string"}}'
    ```
    ```bash
    $ llama-stack-client benchmarks register \
    --eval-task-id meta-reference-mmlu \
    --provider-id meta-reference \
    --dataset-id mmlu \
    --scoring-functions basic::regex_parser_multiple_choice_answer
    ```
 ##### Inspect
 ```{eval-rst}
 .. video:: https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99
    :autoplay:
    :playsinline:
    :muted:
    :loop:
    :width: 100%
 ```
 - **API Providers**: Inspect Llama Stack API providers
  - This page allows you to inspect Llama Stack API providers and resources.
  - Under the hood, it uses Llama Stack's `/providers` API to get information about the providers.
 - **API Resources**: Inspect Llama Stack API resources
  - This page allows you to inspect Llama Stack API resources (`models`, `datasets`, `memory_banks`, `benchmarks`, `shields`).
  - Under the hood, it uses Llama Stack's `/<resources>/list` API to get information about each resources.
  - Please visit [Core Concepts](../../concepts/index.md) for more details about the resources.
 ### Starting the Llama Stack Playground
 To start the Llama Stack Playground, run the following commands:
 1. Start up the Llama Stack API server
 ```bash
 llama stack build --distro together --image-type venv
 llama stack run together
 ```
 2. Start Streamlit UI
 ```bash
 uv run --with ".[ui]" streamlit run llama_stack.core/ui/app.py
 ```
--- a/docs/source/building_applications/safety.md
+++ b/docs/source/building_applications/safety.md
@ -1,17 +0,0 @@
 ## Safety Guardrails
 Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
 ```python
 # Register a safety shield
 shield_id = "content_safety"
 client.shields.register(shield_id=shield_id, provider_shield_id="llama-guard-basic")
 # Run content through shield
 response = client.safety.run_shield(
    shield_id=shield_id, messages=[{"role": "user", "content": "User message here"}]
 )
 if response.violation:
    print(f"Safety violation detected: {response.violation.user_message}")
 ```
--- a/docs/source/building_applications/telemetry.md
+++ b/docs/source/building_applications/telemetry.md
@ -1,143 +0,0 @@
 ## Telemetry
 The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.
 ### Events
 The telemetry system supports three main types of events:
 - **Unstructured Log Events**: Free-form log messages with severity levels
 ```python
 unstructured_log_event = UnstructuredLogEvent(
    message="This is a log message", severity=LogSeverity.INFO
 )
 ```
 - **Metric Events**: Numerical measurements with units
 ```python
 metric_event = MetricEvent(metric="my_metric", value=10, unit="count")
 ```
 - **Structured Log Events**: System events like span start/end. Extensible to add more structured log types.
 ```python
 structured_log_event = SpanStartPayload(name="my_span", parent_span_id="parent_span_id")
 ```
 ### Spans and Traces
 - **Spans**: Represent operations with timing and hierarchical relationships
 - **Traces**: Collection of related spans forming a complete request flow
 ### Metrics
 Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
 #### Available Metrics
 The following metrics are automatically generated for each inference request:
 | Metric Name | Type | Unit | Description | Labels |
 |-------------|------|------|-------------|--------|
 | `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
 | `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
 | `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
 #### Metric Generation Flow
 1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
 2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
 3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
 4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
 #### Metric Aggregation Level
 All metrics are generated and aggregated at the **inference request level**. This means:
 - Each individual inference request generates its own set of metrics
 - Metrics are not pre-aggregated across multiple requests
 - Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
 - Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
 #### Example Metric Event
 ```python
 MetricEvent(
    trace_id="1234567890abcdef",
    span_id="abcdef1234567890",
    metric="total_tokens",
    value=150,
    timestamp=1703123456.789,
    unit="tokens",
    attributes={"model_id": "meta-llama/Llama-3.2-3B-Instruct", "provider_id": "tgi"},
 )
 ```
 #### Querying Metrics
 When using the OpenTelemetry sink, metrics are exposed in standard OpenTelemetry format and can be queried through:
 - **Prometheus**: Scrape metrics from the OpenTelemetry Collector's metrics endpoint
 - **Grafana**: Create dashboards using Prometheus as a data source
 - **OpenTelemetry Collector**: Forward metrics to other observability systems
 Example Prometheus queries:
 ```promql
 # Total tokens used across all models
 sum(llama_stack_tokens_total)
 # Tokens per model
 sum by (model_id) (llama_stack_tokens_total)
 # Average tokens per request
 rate(llama_stack_tokens_total[5m])
 ```
 ### Sinks
 - **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger and collecting metrics for Prometheus.
 - **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
 - **Console**: Print events to the console.
 ### Providers
 #### Meta-Reference Provider
 Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
 1) OpenTelemetry Collector (traces and metrics)
 2) SQLite (traces only)
 3) Console (all events)
 #### Configuration
 Here's an example that sends telemetry signals to all sink types. Your configuration might use only one or a subset.
 ```yaml
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: "llama-stack-service"
      sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
      otel_exporter_otlp_endpoint: "http://localhost:4318"
      sqlite_db_path: "/path/to/telemetry.db"
 ```
 **Environment Variables:**
 - `OTEL_EXPORTER_OTLP_ENDPOINT`: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
 - `OTEL_SERVICE_NAME`: Service name for telemetry (default: empty string)
 - `TELEMETRY_SINKS`: Comma-separated list of sinks (default: `console,sqlite`)
 ### Jaeger to visualize traces
 The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
 Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
 ```bash
 $ docker run --pull always --rm --name jaeger \
  -p 16686:16686 -p 4318:4318 \
  jaegertracing/jaeger:2.1.0
 ```
 Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.
 ### Querying Traces Stored in SQLite
 The `sqlite` sink allows you to query traces without an external system. Here are some example
 queries. Refer to the notebook at [Llama Stack Building AI
 Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for
 more examples on how to query traces and spans.
--- a/docs/source/concepts/index.md
+++ b/docs/source/concepts/index.md
@ -1,23 +0,0 @@
 # Core Concepts
 Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.
 ```{include} architecture.md
 :start-after: ## Llama Stack architecture
 ```
 ```{include} apis.md
 :start-after: ## APIs
 ```
 ```{include} api_providers.md
 :start-after: ## API Providers
 ```
 ```{include} distributions.md
 :start-after: ## Distributions
 ```
 ```{include} resources.md
 :start-after: ## Resources
 ```
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@ -1,39 +0,0 @@
 ```{include} ../../../CONTRIBUTING.md
 ```
 ## Adding a New Provider
 See:
 - [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
 - [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
 - [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
 ```{toctree}
 :maxdepth: 1
 :hidden:
 new_api_provider
 new_vector_database
 ```
 ## Testing
 ```{include} ../../../tests/README.md
 ```
 ## Advanced Topics
 For developers who need deeper understanding of the testing system internals:
 ```{toctree}
 :maxdepth: 1
 testing/record-replay
 ```
 ### Benchmarking
 ```{include} ../../../benchmarking/k8s-benchmark/README.md
 ```
--- a/docs/source/deploying/index.md
+++ b/docs/source/deploying/index.md
@ -1,4 +0,0 @@
 # Deployment Examples
 ```{include} kubernetes_deployment.md
 ```
--- a/docs/source/distributions/index.md
+++ b/docs/source/distributions/index.md
@ -1,15 +0,0 @@
 # Distributions Overview
 A distribution is a pre-packaged set of Llama Stack components that can be deployed together.
 This section provides an overview of the distributions available in Llama Stack.
 ```{toctree}
 :maxdepth: 3
 list_of_distributions
 building_distro
 customizing_run_yaml
 starting_llama_stack_server
 importing_as_library
 configuration
 ```
--- a/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
@ -1,125 +0,0 @@
 ---
 orphan: true
 ---
 <!-- This file was auto-generated by distro_codegen.py, please edit source -->
 # Meta Reference GPU Distribution
 ```{toctree}
 :maxdepth: 2
 :hidden:
 self
 ```
 The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
 | API | Provider(s) |
 |-----|-------------|
 | agents | `inline::meta-reference` |
 | datasetio | `remote::huggingface`, `inline::localfs` |
 | eval | `inline::meta-reference` |
 | inference | `inline::meta-reference` |
 | safety | `inline::llama-guard` |
 | scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
 | telemetry | `inline::meta-reference` |
 | tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
 | vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
 Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
 ### Environment Variables
 The following environment variables can be configured:
 - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
 - `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
 - `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
 - `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
 - `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
 ## Prerequisite: Downloading Models
 Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](../../references/llama_cli_reference/download_models.md) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
 ```
 $ llama model list --downloaded
 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Model                                   ┃ Size     ┃ Modified Time       ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
 │ Llama3.2-1B-Instruct:int4-qlora-eo8     │ 1.53 GB  │ 2025-02-26 11:22:28 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama3.2-1B                             │ 2.31 GB  │ 2025-02-18 21:48:52 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Prompt-Guard-86M                        │ 0.02 GB  │ 2025-02-26 11:29:28 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB  │ 2025-02-26 11:37:41 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama3.2-3B                             │ 5.99 GB  │ 2025-02-18 21:51:26 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama3.1-8B                             │ 14.97 GB │ 2025-02-16 10:36:37 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB  │ 2025-02-26 11:35:02 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama-Guard-3-1B                        │ 2.80 GB  │ 2025-02-26 11:20:46 │
 ├─────────────────────────────────────────┼──────────┼─────────────────────┤
 │ Llama-Guard-3-1B:int4                   │ 0.43 GB  │ 2025-02-26 11:33:33 │
 └─────────────────────────────────────────┴──────────┴─────────────────────┘
 ```
 ## Running the Distribution
 You can do this via venv or Docker which has a pre-built image.
 ### Via Docker
 This method allows you to get started quickly without having to build the distribution code.
 ```bash
 LLAMA_STACK_PORT=8321
 docker run \
  -it \
  --pull always \
  --gpu all \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-meta-reference-gpu \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
 ```
 If you are using Llama Stack Safety / Shield APIs, use:
 ```bash
 docker run \
  -it \
  --pull always \
  --gpu all \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-meta-reference-gpu \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
 ```
 ### Via venv
 Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
 ```bash
 llama stack build --distro meta-reference-gpu --image-type venv
 llama stack run distributions/meta-reference-gpu/run.yaml \
  --port 8321 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
 ```
 If you are using Llama Stack Safety / Shield APIs, use:
 ```bash
 llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
  --port 8321 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
 ```
--- a/docs/source/distributions/self_hosted_distro/nvidia.md
+++ b/docs/source/distributions/self_hosted_distro/nvidia.md
@ -1,171 +0,0 @@
 ---
 orphan: true
 ---
 <!-- This file was auto-generated by distro_codegen.py, please edit source -->
 # NVIDIA Distribution
 The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
 | API | Provider(s) |
 |-----|-------------|
 | agents | `inline::meta-reference` |
 | datasetio | `inline::localfs`, `remote::nvidia` |
 | eval | `remote::nvidia` |
 | files | `inline::localfs` |
 | inference | `remote::nvidia` |
 | post_training | `remote::nvidia` |
 | safety | `remote::nvidia` |
 | scoring | `inline::basic` |
 | telemetry | `inline::meta-reference` |
 | tool_runtime | `inline::rag-runtime` |
 | vector_io | `inline::faiss` |
 ### Environment Variables
 The following environment variables can be configured:
 - `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
 - `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
 - `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
 - `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
 - `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
 - `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
 - `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
 - `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
 - `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
 - `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
 - `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
 ### Models
 The following models are available by default:
 - `meta/llama3-8b-instruct `
 - `meta/llama3-70b-instruct `
 - `meta/llama-3.1-8b-instruct `
 - `meta/llama-3.1-70b-instruct `
 - `meta/llama-3.1-405b-instruct `
 - `meta/llama-3.2-1b-instruct `
 - `meta/llama-3.2-3b-instruct `
 - `meta/llama-3.2-11b-vision-instruct `
 - `meta/llama-3.2-90b-vision-instruct `
 - `meta/llama-3.3-70b-instruct `
 - `nvidia/vila `
 - `nvidia/llama-3.2-nv-embedqa-1b-v2 `
 - `nvidia/nv-embedqa-e5-v5 `
 - `nvidia/nv-embedqa-mistral-7b-v2 `
 - `snowflake/arctic-embed-l `
 ## Prerequisites
 ### NVIDIA API Keys
 Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
 ### Deploy NeMo Microservices Platform
 The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
 ## Supported Services
 Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
 ### Inference: NVIDIA NIM
 NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
 The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
 ### Datasetio API: NeMo Data Store
 The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
 See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
 ### Eval API: NeMo Evaluator
 The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
 See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
 ### Post-Training API: NeMo Customizer
 The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
 See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
 ### Safety API: NeMo Guardrails
 The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
 See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
 ## Deploying models
 In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
 Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
 ```sh
 # URL to NeMo NIM Proxy service
 export NEMO_URL="http://nemo.test"
 curl --location "$NEMO_URL/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "llama-3.2-1b-instruct",
      "namespace": "meta",
      "config": {
         "model": "meta/llama-3.2-1b-instruct",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
            "image_tag": "1.8.3",
            "pvc_size": "25Gi",
            "gpu": 1,
            "additional_envs": {
               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
         }
      }
   }'
 ```
 This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
 You can also remove a deployed NIM to free up GPU resources, if needed.
 ```sh
 export NEMO_URL="http://nemo.test"
 curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
 ```
 ## Running Llama Stack with NVIDIA
 You can do this via venv (build code), or Docker which has a pre-built image.
 ### Via Docker
 This method allows you to get started quickly without having to build the distribution code.
 ```bash
 LLAMA_STACK_PORT=8321
 docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ./run.yaml:/root/my-run.yaml \
  llamastack/distribution-nvidia \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
 ```
 ### Via venv
 If you've set up your local development environment, you can also build the image using your local virtual environment.
 ```bash
 INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
 llama stack build --distro nvidia --image-type venv
 llama stack run ./run.yaml \
  --port 8321 \
  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
  --env INFERENCE_MODEL=$INFERENCE_MODEL
 ```
 ## Example Notebooks
 For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -1,13 +0,0 @@
 # Getting Started
 ```{include} quickstart.md
 :start-after: ## Quickstart
 ```
 ```{include} libraries.md
 :start-after: ## Libraries (SDKs)
 ```
 ```{include} detailed_tutorial.md
 :start-after: ## Detailed Tutorial
 ```
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -1,133 +0,0 @@
 # Llama Stack
 Welcome to Llama Stack, the open-source framework for building generative AI applications.
 ```{admonition} Llama 4 is here!
 :class: tip
 Check out [Getting Started with Llama 4](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started_llama4.ipynb)
 ```
 ```{admonition} News
 :class: tip
 Llama Stack {{ llama_stack_version }} is now available! See the {{ llama_stack_version_link }} for more details.
 ```
 ## What is Llama Stack?
 Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. More specifically, it provides
 - **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
 - **Plugin architecture** to support the rich ecosystem of implementations of the different APIs in different environments like local development, on-premises, cloud, and mobile.
 - **Prepackaged verified distributions** which offer a one-stop solution for developers to get started quickly and reliably in any environment
 - **Multiple developer interfaces** like CLI and SDKs for Python, Node, iOS, and Android
 - **Standalone applications** as examples for how to build production-grade AI applications with Llama Stack
 ```{image} ../_static/llama-stack.png
 :alt: Llama Stack
 :width: 400px
 ```
 Our goal is to provide pre-packaged implementations (aka "distributions") which can be run in a variety of deployment environments. LlamaStack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience is available.
 ## How does Llama Stack work?
 Llama Stack consists of a [server](./distributions/index.md) (with multiple pluggable API [providers](./providers/index.md)) and Client SDKs (see below) meant to
 be used in your applications. The server can be run in a variety of environments, including local (inline)
 development, on-premises, and cloud. The client SDKs are available for Python, Swift, Node, and
 Kotlin.
 ## Quick Links
 - Ready to build? Check out the [Quick Start](getting_started/index) to get started.
 - Want to contribute? See the [Contributing](contributing/index) guide.
 ## Supported Llama Stack Implementations
 A number of "adapters" are available for some popular Inference and Vector Store providers. For other APIs (particularly Safety and Agents), we provide *reference implementations* you can use to get started. We expect this list to grow over time. We are slowly onboarding more providers to the ecosystem as we get more confidence in the APIs.
 **Inference API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Meta Reference  |  Single Node |
 |  Ollama  | Single Node   |
 |  Fireworks  |  Hosted  |
 |  Together  |  Hosted  |
 |  NVIDIA NIM  |  Hosted and Single Node  |
 |  vLLM  | Hosted and Single Node |
 |  TGI  |  Hosted and Single Node  |
 |  AWS Bedrock  |  Hosted  |
 |  Cerebras  |  Hosted  |
 |  Groq  |  Hosted  |
 |  SambaNova  |  Hosted  |
 | PyTorch ExecuTorch | On-device iOS, Android |
 |  OpenAI  |  Hosted  |
 |  Anthropic  |  Hosted  |
 |  Gemini  |  Hosted  |
 |  WatsonX  |  Hosted  |
 **Agents API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Meta Reference  |  Single Node |
 |  Fireworks  |  Hosted  |
 |  Together  |  Hosted  |
 |  PyTorch ExecuTorch | On-device iOS |
 **Vector IO API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  FAISS | Single Node |
 |  SQLite-Vec | Single Node |
 |  Chroma | Hosted and Single Node |
 |  Milvus | Hosted and Single Node |
 |  Postgres (PGVector) | Hosted and Single Node |
 |  Weaviate | Hosted |
 |  Qdrant  | Hosted and Single Node |
 **Safety API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Llama Guard | Depends on Inference Provider |
 |  Prompt Guard | Single Node |
 |  Code Scanner | Single Node |
 |  AWS Bedrock | Hosted |
 **Post Training API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Meta Reference  |  Single Node |
 |  HuggingFace  |  Single Node |
 |  TorchTune  |  Single Node |
 |  NVIDIA NEMO  |  Hosted |
 **Eval API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Meta Reference  |  Single Node |
 |  NVIDIA NEMO  |  Hosted |
 **Telemetry API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Meta Reference  |  Single Node |
 **Tool Runtime API**
 |  **Provider** |  **Environments** |
 | :----: | :----: |
 |  Brave Search | Hosted |
 |  RAG Runtime | Single Node |
 ```{toctree}
 :hidden:
 :maxdepth: 3
 self
 getting_started/index
 concepts/index
 providers/index
 distributions/index
 advanced_apis/index
 building_applications/index
 deploying/index
 contributing/index
 references/index
 ```
--- a/docs/source/references/api_reference/index.md
+++ b/docs/source/references/api_reference/index.md
@ -1,6 +0,0 @@
 {.hide-title}
 # API Reference
 ```{raw} html
   :file: ../../../_static/llama-stack-spec.html
 ```
--- a/docs/source/references/index.md
+++ b/docs/source/references/index.md
@ -1,18 +0,0 @@
 # References
 - [API Reference](api_reference/index) for the Llama Stack API specification
 - [Python SDK Reference](python_sdk_reference/index)
 - [Llama CLI](llama_cli_reference/index) for building and running your Llama Stack server
 - [Llama Stack Client CLI](llama_stack_client_cli_reference) for interacting with your Llama Stack server
 ```{toctree}
 :maxdepth: 1
 :hidden:
 api_reference/index
 python_sdk_reference/index
 llama_cli_reference/index
 llama_stack_client_cli_reference
 llama_cli_reference/download_models
 evals_reference/index
 ```