docs: concepts and building_applications migration (#3534)

# What does this PR do? - Migrates the remaining documentation sections to the new documentation format    ## Test Plan - Partial migration
2025-10-08 13:00:52 +00:00 · 2025-09-24 14:05:30 -07:00 · 2025-09-24 14:05:30 -07:00 · c71ce8df61
commit c71ce8df61
parent 05ff4c4420
82 changed files with 2535 additions and 1237 deletions
--- a/docs/docs/building_applications/agent.mdx
+++ b/docs/docs/building_applications/agent.mdx
@ -0,0 +1,112 @@
+---
+title: Agents
+description: Build powerful AI applications with the Llama Stack agent framework
+sidebar_label: Agents
+sidebar_position: 3
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Agents
+
+An Agent in Llama Stack is a powerful abstraction that allows you to build complex AI applications.
+
+The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
+
+## Core Concepts
+
+### 1. Agent Configuration
+
+Agents are configured using the `AgentConfig` class, which includes:
+
+- **Model**: The underlying LLM to power the agent
+- **Instructions**: System prompt that defines the agent's behavior
+- **Tools**: Capabilities the agent can use to interact with external systems
+- **Safety Shields**: Guardrails to ensure responsible AI behavior
+
+```python
+from llama_stack_client import Agent
+
+# Create the agent
+agent = Agent(
+    llama_stack_client,
+    model="meta-llama/Llama-3-70b-chat",
+    instructions="You are a helpful assistant that can use tools to answer questions.",
+    tools=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
+)
+```
+
+### 2. Sessions
+
+Agents maintain state through sessions, which represent a conversation thread:
+
+```python
+# Create a session
+session_id = agent.create_session(session_name="My conversation")
+```
+
+### 3. Turns
+
+Each interaction with an agent is called a "turn" and consists of:
+
+- **Input Messages**: What the user sends to the agent
+- **Steps**: The agent's internal processing (inference, tool execution, etc.)
+- **Output Message**: The agent's response
+
+<Tabs>
+<TabItem value="streaming" label="Streaming Response">
+
+```python
+from llama_stack_client import AgentEventLogger
+
+# Create a turn with streaming response
+turn_response = agent.create_turn(
+    session_id=session_id,
+    messages=[{"role": "user", "content": "Tell me about Llama models"}],
+)
+for log in AgentEventLogger().log(turn_response):
+    log.print()
+```
+
+</TabItem>
+<TabItem value="non-streaming" label="Non-Streaming Response">
+
+```python
+from rich.pretty import pprint
+
+# Non-streaming API
+response = agent.create_turn(
+    session_id=session_id,
+    messages=[{"role": "user", "content": "Tell me about Llama models"}],
+    stream=False,
+)
+print("Inputs:")
+pprint(response.input_messages)
+print("Output:")
+pprint(response.output_message.content)
+print("Steps:")
+pprint(response.steps)
+```
+
+</TabItem>
+</Tabs>
+
+### 4. Steps
+
+Each turn consists of multiple steps that represent the agent's thought process:
+
+- **Inference Steps**: The agent generating text responses
+- **Tool Execution Steps**: The agent using tools to gather information
+- **Shield Call Steps**: Safety checks being performed
+
+## Agent Execution Loop
+
+Refer to the [Agent Execution Loop](./agent_execution_loop) for more details on what happens within an agent turn.
+
+## Related Resources
+
+- **[Agent Execution Loop](./agent_execution_loop)** - Understanding the internal processing flow
+- **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced agents
+- **[Tools Integration](./tools)** - Extending agent capabilities with external tools
+- **[Safety Guardrails](./safety)** - Implementing responsible AI practices
--- a/docs/docs/building_applications/agent_execution_loop.mdx
+++ b/docs/docs/building_applications/agent_execution_loop.mdx
@ -0,0 +1,185 @@
+---
+title: Agent Execution Loop
+description: Understanding the internal processing flow of Llama Stack agents
+sidebar_label: Agent Execution Loop
+sidebar_position: 4
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Agent Execution Loop
+
+Agents are the heart of Llama Stack applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
+
+## Steps in the Agent Workflow
+
+Each agent turn follows these key steps:
+
+1. **Initial Safety Check**: The user's input is first screened through configured safety shields
+
+2. **Context Retrieval**:
+   - If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the `instructions` field to steer the agent.
+   - For new documents, they are first inserted into the memory bank.
+   - Retrieved context is provided to the LLM as a tool response in the message history.
+
+3. **Inference Loop**: The agent enters its main execution loop:
+   - The LLM receives a user prompt (with previous tool outputs)
+   - The LLM generates a response, potentially with [tool calls](./tools)
+   - If tool calls are present:
+     - Tool inputs are safety-checked
+     - Tools are executed (e.g., web search, code execution)
+     - Tool responses are fed back to the LLM for synthesis
+   - The loop continues until:
+     - The LLM provides a final response without tool calls
+     - Maximum iterations are reached
+     - Token limit is exceeded
+
+4. **Final Safety Check**: The agent's final response is screened through safety shields
+
+## Execution Flow Diagram
+
+```mermaid
+sequenceDiagram
+    participant U as User
+    participant E as Executor
+    participant M as Memory Bank
+    participant L as LLM
+    participant T as Tools
+    participant S as Safety Shield
+
+    Note over U,S: Agent Turn Start
+    U->>S: 1. Submit Prompt
+    activate S
+    S->>E: Input Safety Check
+    deactivate S
+
+    loop Inference Loop
+        E->>L: 2.1 Augment with Context
+        L-->>E: 2.2 Response (with/without tool calls)
+
+        alt Has Tool Calls
+            E->>S: Check Tool Input
+            S->>T: 3.1 Execute Tool
+            T-->>E: 3.2 Tool Response
+            E->>L: 4.1 Tool Response
+            L-->>E: 4.2 Synthesized Response
+        end
+
+        opt Stop Conditions
+            Note over E: Break if:
+            Note over E: - No tool calls
+            Note over E: - Max iterations reached
+            Note over E: - Token limit exceeded
+        end
+    end
+
+    E->>S: Output Safety Check
+    S->>U: 5. Final Response
+```
+
+Each step in this process can be monitored and controlled through configurations.
+
+## Agent Execution Example
+
+Here's an example that demonstrates monitoring the agent's execution:
+
+<Tabs>
+<TabItem value="streaming" label="Streaming Execution">
+
+```python
+from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
+
+# Replace host and port
+client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
+
+agent = Agent(
+    client,
+    # Check with `llama-stack-client models list`
+    model="Llama3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    # Enable both RAG and tool usage
+    tools=[
+        {
+            "name": "builtin::rag/knowledge_search",
+            "args": {"vector_db_ids": ["my_docs"]},
+        },
+        "builtin::code_interpreter",
+    ],
+    # Configure safety (optional)
+    input_shields=["llama_guard"],
+    output_shields=["llama_guard"],
+    # Control the inference loop
+    max_infer_iters=5,
+    sampling_params={
+        "strategy": {"type": "top_p", "temperature": 0.7, "top_p": 0.95},
+        "max_tokens": 2048,
+    },
+)
+session_id = agent.create_session("monitored_session")
+
+# Stream the agent's execution steps
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Analyze this code and run it"}],
+    documents=[
+        {
+            "content": "https://raw.githubusercontent.com/example/code.py",
+            "mime_type": "text/plain",
+        }
+    ],
+    session_id=session_id,
+)
+
+# Monitor each step of execution
+for log in AgentEventLogger().log(response):
+    log.print()
+```
+
+</TabItem>
+<TabItem value="non-streaming" label="Non-Streaming Execution">
+
+```python
+from rich.pretty import pprint
+
+# Using non-streaming API, the response contains input, steps, and output.
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Analyze this code and run it"}],
+    documents=[
+        {
+            "content": "https://raw.githubusercontent.com/example/code.py",
+            "mime_type": "text/plain",
+        }
+    ],
+    session_id=session_id,
+    stream=False,
+)
+
+pprint(f"Input: {response.input_messages}")
+pprint(f"Output: {response.output_message.content}")
+pprint(f"Steps: {response.steps}")
+```
+
+</TabItem>
+</Tabs>
+
+## Key Configuration Options
+
+### Loop Control
+- **max_infer_iters**: Maximum number of inference iterations (default: 5)
+- **max_tokens**: Token limit for responses
+- **temperature**: Controls response randomness
+
+### Safety Configuration
+- **input_shields**: Safety checks for user input
+- **output_shields**: Safety checks for agent responses
+
+### Tool Integration
+- **tools**: List of available tools for the agent
+- **tool_choice**: Control over when tools are used
+
+## Related Resources
+
+- **[Agents](./agent)** - Understanding agent fundamentals
+- **[Tools Integration](./tools)** - Adding capabilities to agents
+- **[Safety Guardrails](./safety)** - Implementing safety measures
+- **[RAG (Retrieval Augmented Generation)](./rag)** - Building knowledge-enhanced workflows
--- a/docs/docs/building_applications/evals.mdx
+++ b/docs/docs/building_applications/evals.mdx
@ -0,0 +1,256 @@
+---
+title: Evaluations
+description: Evaluate LLM applications with Llama Stack's comprehensive evaluation framework
+sidebar_label: Evaluations
+sidebar_position: 7
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the [Evaluation Reference](/docs/references/evals-reference) guide that covers the complete set of APIs and developer experience flow.
+
+:::tip[Interactive Examples]
+Check out our [Colab notebook](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing) for working examples with evaluations, or try the [Getting Started notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
+:::
+
+## Application Evaluation Example
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
+
+Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
+
+In this example, we will show you how to:
+1. **Build an Agent** with Llama Stack
+2. **Query the agent's sessions, turns, and steps** to analyze execution
+3. **Evaluate the results** using scoring functions
+
+## Step-by-Step Evaluation Process
+
+### 1. Building a Search Agent
+
+First, let's create an agent that can search the web to answer questions:
+
+```python
+from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
+
+client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
+
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.3-70B-Instruct",
+    instructions="You are a helpful assistant. Use search tool to answer the questions.",
+    tools=["builtin::websearch"],
+)
+
+# Test prompts for evaluation
+user_prompts = [
+    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
+    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
+    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
+]
+
+session_id = agent.create_session("test-session")
+
+# Execute all prompts in the session
+for prompt in user_prompts:
+    response = agent.create_turn(
+        messages=[
+            {
+                "role": "user",
+                "content": prompt,
+            }
+        ],
+        session_id=session_id,
+    )
+
+    for log in AgentEventLogger().log(response):
+        log.print()
+```
+
+### 2. Query Agent Execution Steps
+
+Now, let's analyze the agent's execution steps to understand its performance:
+
+<Tabs>
+<TabItem value="session-analysis" label="Session Analysis">
+
+```python
+from rich.pretty import pprint
+
+# Query the agent's session to get detailed execution data
+session_response = client.agents.session.retrieve(
+    session_id=session_id,
+    agent_id=agent.agent_id,
+)
+
+pprint(session_response)
+```
+
+</TabItem>
+<TabItem value="tool-validation" label="Tool Usage Validation">
+
+```python
+# Sanity check: Verify that all user prompts are followed by tool calls
+num_tool_call = 0
+for turn in session_response.turns:
+    for step in turn.steps:
+        if (
+            step.step_type == "tool_execution"
+            and step.tool_calls[0].tool_name == "brave_search"
+        ):
+            num_tool_call += 1
+
+print(
+    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
+)
+```
+
+</TabItem>
+</Tabs>
+
+### 3. Evaluate Agent Responses
+
+Now we'll evaluate the agent's responses using Llama Stack's scoring API:
+
+<Tabs>
+<TabItem value="data-preparation" label="Data Preparation">
+
+```python
+# Process agent execution history into evaluation rows
+eval_rows = []
+
+# Define expected answers for our test prompts
+expected_answers = [
+    "Dallas Mavericks and the Minnesota Timberwolves",
+    "Season 4, Episode 12",
+    "King Cobra",
+]
+
+# Create evaluation dataset from agent responses
+for i, turn in enumerate(session_response.turns):
+    eval_rows.append(
+        {
+            "input_query": turn.input_messages[0].content,
+            "generated_answer": turn.output_message.content,
+            "expected_answer": expected_answers[i],
+        }
+    )
+
+pprint(eval_rows)
+```
+
+</TabItem>
+<TabItem value="scoring" label="Scoring & Evaluation">
+
+```python
+# Configure scoring parameters
+scoring_params = {
+    "basic::subset_of": None,  # Check if generated answer contains expected answer
+}
+
+# Run evaluation using Llama Stack's scoring API
+scoring_response = client.scoring.score(
+    input_rows=eval_rows,
+    scoring_functions=scoring_params
+)
+
+pprint(scoring_response)
+
+# Analyze results
+for i, result in enumerate(scoring_response.results):
+    print(f"Query {i+1}: {result.score}")
+    print(f"  Generated: {eval_rows[i]['generated_answer'][:100]}...")
+    print(f"  Expected: {expected_answers[i]}")
+    print(f"  Score: {result.score}")
+    print()
+```
+
+</TabItem>
+</Tabs>
+
+## Available Scoring Functions
+
+Llama Stack provides several built-in scoring functions:
+
+### Basic Scoring Functions
+- **`basic::subset_of`**: Checks if the expected answer is contained in the generated response
+- **`basic::exact_match`**: Performs exact string matching between expected and generated answers
+- **`basic::regex_match`**: Uses regular expressions to match patterns in responses
+
+### Advanced Scoring Functions
+- **`llm_as_judge::accuracy`**: Uses an LLM to judge response accuracy
+- **`llm_as_judge::helpfulness`**: Evaluates how helpful the response is
+- **`llm_as_judge::safety`**: Assesses response safety and appropriateness
+
+### Custom Scoring Functions
+You can also create custom scoring functions for domain-specific evaluation needs.
+
+## Evaluation Workflow Best Practices
+
+### 🎯 **Dataset Preparation**
+- Use diverse test cases that cover edge cases and common scenarios
+- Include clear expected answers or success criteria
+- Balance your dataset across different difficulty levels
+
+### 📊 **Metrics Selection**
+- Choose appropriate scoring functions for your use case
+- Combine multiple metrics for comprehensive evaluation
+- Consider both automated and human evaluation metrics
+
+### 🔄 **Iterative Improvement**
+- Run evaluations regularly during development
+- Use evaluation results to identify areas for improvement
+- Track performance changes over time
+
+### 📈 **Analysis & Reporting**
+- Analyze failures to understand model limitations
+- Generate comprehensive evaluation reports
+- Share results with stakeholders for informed decision-making
+
+## Advanced Evaluation Scenarios
+
+### Batch Evaluation
+For evaluating large datasets efficiently:
+
+```python
+# Prepare large evaluation dataset
+large_eval_dataset = [
+    {"input_query": query, "expected_answer": answer}
+    for query, answer in zip(queries, expected_answers)
+]
+
+# Run batch evaluation
+batch_results = client.scoring.score(
+    input_rows=large_eval_dataset,
+    scoring_functions={
+        "basic::subset_of": None,
+        "llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
+    }
+)
+```
+
+### Multi-Metric Evaluation
+Combining different scoring approaches:
+
+```python
+comprehensive_scoring = {
+    "exact_match": "basic::exact_match",
+    "subset_match": "basic::subset_of",
+    "llm_judge": "llm_as_judge::accuracy",
+    "safety_check": "llm_as_judge::safety",
+}
+
+results = client.scoring.score(
+    input_rows=eval_rows,
+    scoring_functions=comprehensive_scoring
+)
+```
+
+## Related Resources
+
+- **[Agents](./agent)** - Building agents for evaluation
+- **[Tools Integration](./tools)** - Using tools in evaluated agents
+- **[Evaluation Reference](/docs/references/evals-reference)** - Complete API reference for evaluations
+- **[Getting Started Notebook](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Interactive examples
+- **[Evaluation Examples](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)** - Additional evaluation scenarios
--- a/docs/docs/building_applications/index.mdx
+++ b/docs/docs/building_applications/index.mdx
@ -0,0 +1,83 @@
+---
+title: Building Applications
+description: Comprehensive guides for building AI applications with Llama Stack
+sidebar_label: Overview
+sidebar_position: 5
+---
+
+# AI Application Examples
+
+Llama Stack provides all the building blocks needed to create sophisticated AI applications.
+
+## Getting Started
+
+The best way to get started is to look at this comprehensive notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
+
+**📓 [Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)**
+
+## Core Topics
+
+Here are the key topics that will help you build effective AI applications:
+
+### 🤖 **Agent Development**
+- **[Agent Framework](./agent)** - Understand the components and design patterns of the Llama Stack agent framework
+- **[Agent Execution Loop](./agent_execution_loop)** - How agents process information, make decisions, and execute actions
+- **[Agents vs Responses API](./responses_vs_agents)** - Learn when to use each API for different use cases
+
+### 📚 **Knowledge Integration**
+- **[RAG (Retrieval-Augmented Generation)](./rag)** - Enhance your agents with external knowledge through retrieval mechanisms
+
+### 🛠️ **Capabilities & Extensions**
+- **[Tools](./tools)** - Extend your agents' capabilities by integrating with external tools and APIs
+
+### 📊 **Quality & Monitoring**
+- **[Evaluations](./evals)** - Evaluate your agents' effectiveness and identify areas for improvement
+- **[Telemetry](./telemetry)** - Monitor and analyze your agents' performance and behavior
+- **[Safety](./safety)** - Implement guardrails and safety measures to ensure responsible AI behavior
+
+### 🎮 **Interactive Development**
+- **[Playground](./playground)** - Interactive environment for testing and developing applications
+
+## Application Patterns
+
+### 🤖 **Conversational Agents**
+Build intelligent chatbots and assistants that can:
+- Maintain context across conversations
+- Access external knowledge bases
+- Execute actions through tool integrations
+- Apply safety filters and guardrails
+
+### 📖 **RAG Applications**
+Create knowledge-augmented applications that:
+- Retrieve relevant information from documents
+- Generate contextually accurate responses
+- Handle large knowledge bases efficiently
+- Provide source attribution
+
+### 🔧 **Tool-Enhanced Systems**
+Develop applications that can:
+- Search the web for real-time information
+- Interact with databases and APIs
+- Perform calculations and analysis
+- Execute complex multi-step workflows
+
+### 🛡️ **Enterprise Applications**
+Build production-ready systems with:
+- Comprehensive safety measures
+- Performance monitoring and analytics
+- Scalable deployment configurations
+- Evaluation and quality assurance
+
+## Next Steps
+
+1. **📖 Start with the Notebook** - Work through the complete tutorial
+2. **🎯 Choose Your Pattern** - Pick the application type that matches your needs
+3. **🏗️ Build Your Foundation** - Set up your [providers](/docs/providers/) and [distributions](/docs/distributions/)
+4. **🚀 Deploy & Monitor** - Use our [deployment guides](/docs/deploying/) for production
+
+## Related Resources
+
+- **[Getting Started](/docs/getting-started/)** - Basic setup and concepts
+- **[Providers](/docs/providers/)** - Available AI service providers
+- **[Distributions](/docs/distributions/)** - Pre-configured deployment packages
+- **[API Reference](/docs/api/)** - Complete API documentation
--- a/docs/docs/building_applications/playground.mdx
+++ b/docs/docs/building_applications/playground.mdx
@ -0,0 +1,299 @@
+---
+title: Llama Stack Playground
+description: Interactive interface to explore and experiment with Llama Stack capabilities
+sidebar_label: Playground
+sidebar_position: 10
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Llama Stack Playground
+
+:::note[Experimental Feature]
+The Llama Stack Playground is currently experimental and subject to change. We welcome feedback and contributions to help improve it.
+:::
+
+The Llama Stack Playground is a simple interface that aims to:
+- **Showcase capabilities and concepts** of Llama Stack in an interactive environment
+- **Demo end-to-end application code** to help users get started building their own applications
+- **Provide a UI** to help users inspect and understand Llama Stack API providers and resources
+
+## Key Features
+
+### Interactive Playground Pages
+
+The playground provides interactive pages for users to explore Llama Stack API capabilities:
+
+#### Chatbot Interface
+
+<video
+  controls
+  autoPlay
+  playsInline
+  muted
+  loop
+  style={{width: '100%'}}
+>
+  <source src="https://github.com/user-attachments/assets/8d2ef802-5812-4a28-96e1-316038c84cbf" type="video/mp4" />
+  Your browser does not support the video tag.
+</video>
+
+<Tabs>
+<TabItem value="chat" label="Chat">
+
+**Simple Chat Interface**
+- Chat directly with Llama models through an intuitive interface
+- Uses the `/inference/chat-completion` streaming API under the hood
+- Real-time message streaming for responsive interactions
+- Perfect for testing model capabilities and prompt engineering
+
+</TabItem>
+<TabItem value="rag" label="RAG Chat">
+
+**Document-Aware Conversations**
+- Upload documents to create memory banks
+- Chat with a RAG-enabled agent that can query your documents
+- Uses Llama Stack's `/agents` API to create and manage RAG sessions
+- Ideal for exploring knowledge-enhanced AI applications
+
+</TabItem>
+</Tabs>
+
+#### Evaluation Interface
+
+<video
+  controls
+  autoPlay
+  playsInline
+  muted
+  loop
+  style={{width: '100%'}}
+>
+  <source src="https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5" type="video/mp4" />
+  Your browser does not support the video tag.
+</video>
+
+<Tabs>
+<TabItem value="scoring" label="Scoring Evaluations">
+
+**Custom Dataset Evaluation**
+- Upload your own evaluation datasets
+- Run evaluations using available scoring functions
+- Uses Llama Stack's `/scoring` API for flexible evaluation workflows
+- Great for testing application performance on custom metrics
+
+</TabItem>
+<TabItem value="benchmarks" label="Benchmark Evaluations">
+
+<video
+  controls
+  autoPlay
+  playsInline
+  muted
+  loop
+  style={{width: '100%', marginBottom: '1rem'}}
+>
+  <source src="https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf" type="video/mp4" />
+  Your browser does not support the video tag.
+</video>
+
+**Pre-registered Evaluation Tasks**
+- Evaluate models or agents on pre-defined tasks
+- Uses Llama Stack's `/eval` API for comprehensive evaluation
+- Combines datasets and scoring functions for standardized testing
+
+**Setup Requirements:**
+Register evaluation datasets and benchmarks first:
+
+```bash
+# Register evaluation dataset
+llama-stack-client datasets register \
+  --dataset-id "mmlu" \
+  --provider-id "huggingface" \
+  --url "https://huggingface.co/datasets/llamastack/evals" \
+  --metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \
+  --schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string"}, "chat_completion_input": {"type": "string"}}'
+
+# Register benchmark task
+llama-stack-client benchmarks register \
+  --eval-task-id meta-reference-mmlu \
+  --provider-id meta-reference \
+  --dataset-id mmlu \
+  --scoring-functions basic::regex_parser_multiple_choice_answer
+```
+
+</TabItem>
+</Tabs>
+
+#### Inspection Interface
+
+<video
+  controls
+  autoPlay
+  playsInline
+  muted
+  loop
+  style={{width: '100%'}}
+>
+  <source src="https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99" type="video/mp4" />
+  Your browser does not support the video tag.
+</video>
+
+<Tabs>
+<TabItem value="providers" label="API Providers">
+
+**Provider Management**
+- Inspect available Llama Stack API providers
+- View provider configurations and capabilities
+- Uses the `/providers` API for real-time provider information
+- Essential for understanding your deployment's capabilities
+
+</TabItem>
+<TabItem value="resources" label="API Resources">
+
+**Resource Exploration**
+- Inspect Llama Stack API resources including:
+  - **Models**: Available language models
+  - **Datasets**: Registered evaluation datasets
+  - **Memory Banks**: Vector databases and knowledge stores
+  - **Benchmarks**: Evaluation tasks and scoring functions
+  - **Shields**: Safety and content moderation tools
+- Uses `/<resources>/list` APIs for comprehensive resource visibility
+- For detailed information about resources, see [Core Concepts](/docs/concepts)
+
+</TabItem>
+</Tabs>
+
+## Getting Started
+
+### Quick Start Guide
+
+<Tabs>
+<TabItem value="setup" label="Setup">
+
+**1. Start the Llama Stack API Server**
+
+```bash
+# Build and run a distribution (example: together)
+llama stack build --distro together --image-type venv
+llama stack run together
+```
+
+**2. Start the Streamlit UI**
+
+```bash
+# Launch the playground interface
+uv run --with ".[ui]" streamlit run llama_stack.core/ui/app.py
+```
+
+</TabItem>
+<TabItem value="usage" label="Usage Tips">
+
+**Making the Most of the Playground:**
+
+- **Start with Chat**: Test basic model interactions and prompt engineering
+- **Explore RAG**: Upload sample documents to see knowledge-enhanced responses
+- **Try Evaluations**: Use the scoring interface to understand evaluation metrics
+- **Inspect Resources**: Check what providers and resources are available
+- **Experiment with Settings**: Adjust parameters to see how they affect results
+
+</TabItem>
+</Tabs>
+
+### Available Distributions
+
+The playground works with any Llama Stack distribution. Popular options include:
+
+<Tabs>
+<TabItem value="together" label="Together AI">
+
+```bash
+llama stack build --distro together --image-type venv
+llama stack run together
+```
+
+**Features:**
+- Cloud-hosted models
+- Fast inference
+- Multiple model options
+
+</TabItem>
+<TabItem value="ollama" label="Ollama (Local)">
+
+```bash
+llama stack build --distro ollama --image-type venv
+llama stack run ollama
+```
+
+**Features:**
+- Local model execution
+- Privacy-focused
+- No internet required
+
+</TabItem>
+<TabItem value="meta-reference" label="Meta Reference">
+
+```bash
+llama stack build --distro meta-reference --image-type venv
+llama stack run meta-reference
+```
+
+**Features:**
+- Reference implementation
+- All API features available
+- Best for development
+
+</TabItem>
+</Tabs>
+
+## Use Cases & Examples
+
+### Educational Use Cases
+- **Learning Llama Stack**: Hands-on exploration of API capabilities
+- **Prompt Engineering**: Interactive testing of different prompting strategies
+- **RAG Experimentation**: Understanding how document retrieval affects responses
+- **Evaluation Understanding**: See how different metrics evaluate model performance
+
+### Development Use Cases
+- **Prototype Testing**: Quick validation of application concepts
+- **API Exploration**: Understanding available endpoints and parameters
+- **Integration Planning**: Seeing how different components work together
+- **Demo Creation**: Showcasing Llama Stack capabilities to stakeholders
+
+### Research Use Cases
+- **Model Comparison**: Side-by-side testing of different models
+- **Evaluation Design**: Understanding how scoring functions work
+- **Safety Testing**: Exploring shield effectiveness with different inputs
+- **Performance Analysis**: Measuring model behavior across different scenarios
+
+## Best Practices
+
+### 🚀 **Getting Started**
+- Begin with simple chat interactions to understand basic functionality
+- Gradually explore more advanced features like RAG and evaluations
+- Use the inspection tools to understand your deployment's capabilities
+
+### 🔧 **Development Workflow**
+- Use the playground to prototype before writing application code
+- Test different parameter settings interactively
+- Validate evaluation approaches before implementing them programmatically
+
+### 📊 **Evaluation & Testing**
+- Start with simple scoring functions before trying complex evaluations
+- Use the playground to understand evaluation results before automation
+- Test safety features with various input types
+
+### 🎯 **Production Preparation**
+- Use playground insights to inform your production API usage
+- Test edge cases and error conditions interactively
+- Validate resource configurations before deployment
+
+## Related Resources
+
+- **[Getting Started Guide](/docs/getting-started)** - Complete setup and introduction
+- **[Core Concepts](/docs/concepts)** - Understanding Llama Stack fundamentals
+- **[Agents](./agent)** - Building intelligent agents
+- **[RAG (Retrieval Augmented Generation)](./rag)** - Knowledge-enhanced applications
+- **[Evaluations](./evals)** - Comprehensive evaluation framework
+- **[API Reference](/docs/api-reference)** - Complete API documentation
--- a/docs/docs/building_applications/rag.mdx
+++ b/docs/docs/building_applications/rag.mdx
@ -0,0 +1,367 @@
+---
+title: Retrieval Augmented Generation (RAG)
+description: Build knowledge-enhanced AI applications with external document retrieval
+sidebar_label: RAG (Retrieval Augmented Generation)
+sidebar_position: 2
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Retrieval Augmented Generation (RAG)
+
+RAG enables your applications to reference and recall information from previous interactions or external documents.
+
+## Architecture Overview
+
+Llama Stack organizes the APIs that enable RAG into three layers:
+
+1. **Lower-Level APIs**: Deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon)
+2. **RAG Tool**: A first-class tool as part of the [Tools API](./tools) that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly
+3. **Agents API**: The top-level [Agents API](./agent) that allows you to create agents that can use the tools to answer questions, perform tasks, and more
+
+![RAG System Architecture](/img/rag.png)
+
+The RAG system uses lower-level storage for different types of data:
+- **Vector IO**: For semantic search and retrieval
+- **Key-Value and Relational IO**: For structured data storage
+
+:::info[Future Storage Types]
+We may add more storage types like Graph IO in the future.
+:::
+
+## Setting up Vector Databases
+
+For this guide, we will use [Ollama](https://ollama.com/) as the inference provider. Ollama is an LLM runtime that allows you to run Llama models locally.
+
+Here's how to set up a vector database for RAG:
+
+```python
+# Create HTTP client
+import os
+from llama_stack_client import LlamaStackClient
+
+client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
+
+# Register a vector database
+vector_db_id = "my_documents"
+response = client.vector_dbs.register(
+    vector_db_id=vector_db_id,
+    embedding_model="all-MiniLM-L6-v2",
+    embedding_dimension=384,
+    provider_id="faiss",
+)
+```
+
+## Document Ingestion
+
+You can ingest documents into the vector database using two methods: directly inserting pre-chunked documents or using the RAG Tool.
+
+### Direct Document Insertion
+
+<Tabs>
+<TabItem value="basic" label="Basic Insertion">
+
+```python
+# You can insert a pre-chunked document directly into the vector db
+chunks = [
+    {
+        "content": "Your document text here",
+        "mime_type": "text/plain",
+        "metadata": {
+            "document_id": "doc1",
+            "author": "Jane Doe",
+        },
+    },
+]
+client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)
+```
+
+</TabItem>
+<TabItem value="embeddings" label="With Precomputed Embeddings">
+
+If you decide to precompute embeddings for your documents, you can insert them directly into the vector database by including the embedding vectors in the chunk data. This is useful if you have a separate embedding service or if you want to customize the ingestion process.
+
+```python
+chunks_with_embeddings = [
+    {
+        "content": "First chunk of text",
+        "mime_type": "text/plain",
+        "embedding": [0.1, 0.2, 0.3, ...],  # Your precomputed embedding vector
+        "metadata": {"document_id": "doc1", "section": "introduction"},
+    },
+    {
+        "content": "Second chunk of text",
+        "mime_type": "text/plain",
+        "embedding": [0.2, 0.3, 0.4, ...],  # Your precomputed embedding vector
+        "metadata": {"document_id": "doc1", "section": "methodology"},
+    },
+]
+client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks_with_embeddings)
+```
+
+:::warning[Embedding Dimensions]
+When providing precomputed embeddings, ensure the embedding dimension matches the `embedding_dimension` specified when registering the vector database.
+:::
+
+</TabItem>
+</Tabs>
+
+### Document Retrieval
+
+You can query the vector database to retrieve documents based on their embeddings.
+
+```python
+# You can then query for these chunks
+chunks_response = client.vector_io.query(
+    vector_db_id=vector_db_id,
+    query="What do you know about..."
+)
+```
+
+## Using the RAG Tool
+
+:::danger[Deprecation Notice]
+The RAG Tool is being deprecated in favor of directly using the OpenAI-compatible Search API. We recommend migrating to the OpenAI APIs for better compatibility and future support.
+:::
+
+A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. More examples for how to format a RAGDocument can be found in the [appendix](#more-ragdocument-examples).
+
+### OpenAI API Integration & Migration
+
+The RAG tool has been updated to use OpenAI-compatible APIs. This provides several benefits:
+
+- **Files API Integration**: Documents are now uploaded using OpenAI's file upload endpoints
+- **Vector Stores API**: Vector storage operations use OpenAI's vector store format with configurable chunking strategies
+- **Error Resilience**: When processing multiple documents, individual failures are logged but don't crash the operation. Failed documents are skipped while successful ones continue processing.
+
+### Migration Path
+
+We recommend migrating to the OpenAI-compatible Search API for:
+
+1. **Better OpenAI Ecosystem Integration**: Direct compatibility with OpenAI tools and workflows including the Responses API
+2. **Future-Proof**: Continued support and feature development
+3. **Full OpenAI Compatibility**: Vector Stores, Files, and Search APIs are fully compatible with OpenAI's Responses API
+
+The OpenAI APIs are used under the hood, so you can continue to use your existing RAG Tool code with minimal changes. However, we recommend updating your code to use the new OpenAI-compatible APIs for better long-term support. If any documents fail to process, they will be logged in the response but will not cause the entire operation to fail.
+
+### RAG Tool Example
+
+```python
+from llama_stack_client import RAGDocument
+
+urls = ["memory_optimizations.rst", "chat.rst", "llama3.rst"]
+documents = [
+    RAGDocument(
+        document_id=f"num-{i}",
+        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
+        mime_type="text/plain",
+        metadata={},
+    )
+    for i, url in enumerate(urls)
+]
+
+client.tool_runtime.rag_tool.insert(
+    documents=documents,
+    vector_db_id=vector_db_id,
+    chunk_size_in_tokens=512,
+)
+
+# Query documents
+results = client.tool_runtime.rag_tool.query(
+    vector_db_ids=[vector_db_id],
+    content="What do you know about...",
+)
+```
+
+### Custom Context Configuration
+
+You can configure how the RAG tool adds metadata to the context if you find it useful for your application:
+
+```python
+# Query documents with custom template
+results = client.tool_runtime.rag_tool.query(
+    vector_db_ids=[vector_db_id],
+    content="What do you know about...",
+    query_config={
+        "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
+    },
+)
+```
+
+## Building RAG-Enhanced Agents
+
+One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
+
+### Agent with Knowledge Search
+
+```python
+from llama_stack_client import Agent
+
+# Create agent with memory
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.3-70B-Instruct",
+    instructions="You are a helpful assistant",
+    tools=[
+        {
+            "name": "builtin::rag/knowledge_search",
+            "args": {
+                "vector_db_ids": [vector_db_id],
+                # Defaults
+                "query_config": {
+                    "chunk_size_in_tokens": 512,
+                    "chunk_overlap_in_tokens": 0,
+                    "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
+                },
+            },
+        }
+    ],
+)
+session_id = agent.create_session("rag_session")
+
+# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
+    session_id=session_id,
+)
+```
+
+:::tip[Agent Instructions]
+The `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
+:::
+
+### Document-Aware Conversations
+
+You can also pass documents along with the user's message and ask questions about them:
+
+```python
+# Initial document ingestion
+response = agent.create_turn(
+    messages=[
+        {"role": "user", "content": "I am providing some documents for reference."}
+    ],
+    documents=[
+        {
+            "content": "https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/memory_optimizations.rst",
+            "mime_type": "text/plain",
+        }
+    ],
+    session_id=session_id,
+)
+
+# Query with RAG
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "What are the key topics in the documents?"}],
+    session_id=session_id,
+)
+```
+
+### Viewing Agent Responses
+
+You can print the response with the following:
+
+```python
+from llama_stack_client import AgentEventLogger
+
+for log in AgentEventLogger().log(response):
+    log.print()
+```
+
+## Vector Database Management
+
+### Unregistering Vector DBs
+
+If you need to clean up and unregister vector databases, you can do so as follows:
+
+<Tabs>
+<TabItem value="single" label="Single Database">
+
+```python
+# Unregister a specified vector database
+vector_db_id = "my_vector_db_id"
+print(f"Unregistering vector database: {vector_db_id}")
+client.vector_dbs.unregister(vector_db_id=vector_db_id)
+```
+
+</TabItem>
+<TabItem value="all" label="All Databases">
+
+```python
+# Unregister all vector databases
+for vector_db_id in client.vector_dbs.list():
+    print(f"Unregistering vector database: {vector_db_id.identifier}")
+    client.vector_dbs.unregister(vector_db_id=vector_db_id.identifier)
+```
+
+</TabItem>
+</Tabs>
+
+## Best Practices
+
+### 🎯 **Document Chunking**
+- Use appropriate chunk sizes (512 tokens is often a good starting point)
+- Consider overlap between chunks for better context preservation
+- Experiment with different chunking strategies for your content type
+
+### 🔍 **Embedding Strategy**
+- Choose embedding models that match your domain
+- Consider the trade-off between embedding dimension and performance
+- Test different embedding models for your specific use case
+
+### 📊 **Query Optimization**
+- Use specific, well-formed queries for better retrieval
+- Experiment with different search strategies
+- Consider hybrid approaches (keyword + semantic search)
+
+### 🛡️ **Error Handling**
+- Implement proper error handling for failed document processing
+- Monitor ingestion success rates
+- Have fallback strategies for retrieval failures
+
+## Appendix
+
+### More RAGDocument Examples
+
+Here are various ways to create RAGDocument objects for different content types:
+
+```python
+from llama_stack_client import RAGDocument
+import base64
+
+# File URI
+RAGDocument(document_id="num-0", content={"uri": "file://path/to/file"})
+
+# Plain text
+RAGDocument(document_id="num-1", content="plain text")
+
+# Explicit text input
+RAGDocument(
+    document_id="num-2",
+    content={
+        "type": "text",
+        "text": "plain text input",
+    },  # for inputs that should be treated as text explicitly
+)
+
+# Image from URL
+RAGDocument(
+    document_id="num-3",
+    content={
+        "type": "image",
+        "image": {"url": {"uri": "https://mywebsite.com/image.jpg"}},
+    },
+)
+
+# Base64 encoded image
+B64_ENCODED_IMAGE = base64.b64encode(
+    requests.get(
+        "https://raw.githubusercontent.com/meta-llama/llama-stack/refs/heads/main/docs/_static/llama-stack.png"
+    ).content
+)
+RAGDocument(
+    document_id="num-4",
+    content={"type": "image", "image": {"data": B64_ENCODED_IMAGE}},
+)
+```
+For more strongly typed interaction use the typed dicts found [here](https://github.com/meta-llama/llama-stack-client-python/blob/38cd91c9e396f2be0bec1ee96a19771582ba6f17/src/llama_stack_client/types/shared_params/document.py).
--- a/docs/docs/building_applications/rag.png
+++ b/docs/docs/building_applications/rag.png
--- a/docs/docs/building_applications/responses_vs_agents.mdx
+++ b/docs/docs/building_applications/responses_vs_agents.mdx
@ -0,0 +1,221 @@
+---
+title: Agents vs OpenAI Responses API
+description: Compare the Agents API and OpenAI Responses API for building AI applications with tool calling capabilities
+sidebar_label: Agents vs Responses API
+sidebar_position: 5
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Agents vs OpenAI Responses API
+
+Llama Stack (LLS) provides two different APIs for building AI applications with tool calling capabilities: the **Agents API** and the **OpenAI Responses API**. While both enable AI systems to use tools, and maintain full conversation history, they serve different use cases and have distinct characteristics.
+
+:::note
+**Note:** For simple and basic inferencing, you may want to use the [Chat Completions API](/docs/providers/openai-compatibility#chat-completions) directly, before progressing to Agents or Responses API.
+:::
+
+## Overview
+
+### LLS Agents API
+The Agents API is a full-featured, stateful system designed for complex, multi-turn conversations. It maintains conversation state through persistent sessions identified by a unique session ID. The API supports comprehensive agent lifecycle management, detailed execution tracking, and rich metadata about each interaction through a structured session/turn/step hierarchy. The API can orchestrate multiple tool calls within a single turn.
+
+### OpenAI Responses API
+The OpenAI Responses API is a full-featured, stateful system designed for complex, multi-turn conversations, with direct compatibility with OpenAI's conversational patterns enhanced by LLama Stack's tool calling capabilities. It maintains conversation state by chaining responses through a `previous_response_id`, allowing interactions to branch or continue from any prior point. Each response can perform multiple tool calls within a single turn.
+
+### Key Differences
+The LLS Agents API uses the Chat Completions API on the backend for inference as it's the industry standard for building AI applications and most LLM providers are compatible with this API. For a detailed comparison between Responses and Chat Completions, see [OpenAI's documentation](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
+
+Additionally, Agents let you specify input/output shields whereas Responses do not (though support is planned). Agents use a linear conversation model referenced by a single session ID. Responses, on the other hand, support branching, where each response can serve as a fork point, and conversations are tracked by the latest response ID. Responses also lets you dynamically choose the model, vector store, files, MCP servers, and more on each inference call, enabling more complex workflows. Agents require a static configuration for these components at the start of the session.
+
+Today the Agents and Responses APIs can be used independently depending on the use case. But, it is also productive to treat the APIs as complementary. It is not currently supported, but it is planned for the LLS Agents API to alternatively use the Responses API as its backend instead of the default Chat Completions API, i.e., enabling a combination of the safety features of Agents with the dynamic configuration and branching capabilities of Responses.
+
+## Feature Comparison
+
+| Feature | LLS Agents API | OpenAI Responses API |
+|---------|------------|---------------------|
+| **Conversation Management** | Linear persistent sessions | Can branch from any previous response ID |
+| **Input/Output Safety Shields** | Supported | Not yet supported |
+| **Per-call Flexibility** | Static per-session configuration | Dynamic per-call configuration |
+
+## Use Case Example: Research with Multiple Search Methods
+
+Let's compare how both APIs handle a research task where we need to:
+1. Search for current information and examples
+2. Access different information sources dynamically
+3. Continue the conversation based on search results
+
+<Tabs>
+<TabItem value="agents" label="Agents API">
+
+### Session-based Configuration with Safety Shields
+
+```python
+# Create agent with static session configuration
+agent = Agent(
+    client,
+    model="Llama3.2-3B-Instruct",
+    instructions="You are a helpful coding assistant",
+    tools=[
+        {
+            "name": "builtin::rag/knowledge_search",
+            "args": {"vector_db_ids": ["code_docs"]},
+        },
+        "builtin::code_interpreter",
+    ],
+    input_shields=["llama_guard"],
+    output_shields=["llama_guard"],
+)
+
+session_id = agent.create_session("code_session")
+
+# First turn: Search and execute
+response1 = agent.create_turn(
+    messages=[
+        {
+            "role": "user",
+            "content": "Find examples of sorting algorithms and run a bubble sort on [3,1,4,1,5]",
+        },
+    ],
+    session_id=session_id,
+)
+
+# Continue conversation in same session
+response2 = agent.create_turn(
+    messages=[
+        {
+            "role": "user",
+            "content": "Now optimize that code and test it with a larger dataset",
+        },
+    ],
+    session_id=session_id,  # Same session, maintains full context
+)
+
+# Agents API benefits:
+# ✅ Safety shields protect against malicious code execution
+# ✅ Session maintains context between code executions
+# ✅ Consistent tool configuration throughout conversation
+print(f"First result: {response1.output_message.content}")
+print(f"Optimization: {response2.output_message.content}")
+```
+
+</TabItem>
+<TabItem value="responses" label="Responses API">
+
+### Dynamic Per-call Configuration with Branching
+
+```python
+# First response: Use web search for latest algorithms
+response1 = client.responses.create(
+    model="Llama3.2-3B-Instruct",
+    input="Search for the latest efficient sorting algorithms and their performance comparisons",
+    tools=[
+        {
+            "type": "web_search",
+        },
+    ],  # Web search for current information
+)
+
+# Continue conversation: Switch to file search for local docs
+response2 = client.responses.create(
+    model="Llama3.2-1B-Instruct",  # Switch to faster model
+    input="Now search my uploaded files for existing sorting implementations",
+    tools=[
+        {  # Using Responses API built-in tools
+            "type": "file_search",
+            "vector_store_ids": ["vs_abc123"],  # Vector store containing uploaded files
+        },
+    ],
+    previous_response_id=response1.id,
+)
+
+# Branch from first response: Try different search approach
+response3 = client.responses.create(
+    model="Llama3.2-3B-Instruct",
+    input="Instead, search the web for Python-specific sorting best practices",
+    tools=[{"type": "web_search"}],  # Different web search query
+    previous_response_id=response1.id,  # Branch from response1
+)
+
+# Responses API benefits:
+# ✅ Dynamic tool switching (web search ↔ file search per call)
+# ✅ OpenAI-compatible tool patterns (web_search, file_search)
+# ✅ Branch conversations to explore different information sources
+# ✅ Model flexibility per search type
+print(f"Web search results: {response1.output_message.content}")
+print(f"File search results: {response2.output_message.content}")
+print(f"Alternative web search: {response3.output_message.content}")
+```
+
+</TabItem>
+</Tabs>
+
+Both APIs demonstrate distinct strengths that make them valuable on their own for different scenarios. The Agents API excels in providing structured, safety-conscious workflows with persistent session management, while the Responses API offers flexibility through dynamic configuration and OpenAI compatible tool patterns.
+
+## Use Case Examples
+
+### 1. Research and Analysis with Safety Controls
+**Best Choice: Agents API**
+
+**Scenario:** You're building a research assistant for a financial institution that needs to analyze market data, execute code to process financial models, and search through internal compliance documents. The system must ensure all interactions are logged for regulatory compliance and protected by safety shields to prevent malicious code execution or data leaks.
+
+**Why Agents API?** The Agents API provides persistent session management for iterative research workflows, built-in safety shields to protect against malicious code in financial models, and structured execution logs (session/turn/step) required for regulatory compliance. The static tool configuration ensures consistent access to your knowledge base and code interpreter throughout the entire research session.
+
+### 2. Dynamic Information Gathering with Branching Exploration
+**Best Choice: Responses API**
+
+**Scenario:** You're building a competitive intelligence tool that helps businesses research market trends. Users need to dynamically switch between web search for current market data and file search through uploaded industry reports. They also want to branch conversations to explore different market segments simultaneously and experiment with different models for various analysis types.
+
+**Why Responses API?** The Responses API's branching capability lets users explore multiple market segments from any research point. Dynamic per-call configuration allows switching between web search and file search as needed, while experimenting with different models (faster models for quick searches, more powerful models for deep analysis). The OpenAI-compatible tool patterns make integration straightforward.
+
+### 3. OpenAI Migration with Advanced Tool Capabilities
+**Best Choice: Responses API**
+
+**Scenario:** You have an existing application built with OpenAI's Assistants API that uses file search and web search capabilities. You want to migrate to Llama Stack for better performance and cost control while maintaining the same tool calling patterns and adding new capabilities like dynamic vector store selection.
+
+**Why Responses API?** The Responses API provides full OpenAI tool compatibility (`web_search`, `file_search`) with identical syntax, making migration seamless. The dynamic per-call configuration enables advanced features like switching vector stores per query or changing models based on query complexity - capabilities that extend beyond basic OpenAI functionality while maintaining compatibility.
+
+### 4. Educational Programming Tutor
+**Best Choice: Agents API**
+
+**Scenario:** You're building a programming tutor that maintains student context across multiple sessions, safely executes code exercises, and tracks learning progress with audit trails for educators.
+
+**Why Agents API?** Persistent sessions remember student progress across multiple interactions, safety shields prevent malicious code execution while allowing legitimate programming exercises, and structured execution logs help educators track learning patterns.
+
+### 5. Advanced Software Debugging Assistant
+**Best Choice: Agents API with Responses Backend**
+
+**Scenario:** You're building a debugging assistant that helps developers troubleshoot complex issues. It needs to maintain context throughout a debugging session, safely execute diagnostic code, switch between different analysis tools dynamically, and branch conversations to explore multiple potential causes simultaneously.
+
+**Why Agents + Responses?** The Agent provides safety shields for code execution and session management for the overall debugging workflow. The underlying Responses API enables dynamic model selection and flexible tool configuration per query, while branching lets you explore different theories (memory leak vs. concurrency issue) from the same debugging point and compare results.
+
+:::info[Future Enhancement]
+The ability to use Responses API as the backend for Agents is not yet implemented but is planned for a future release. Currently, Agents use Chat Completions API as their backend by default.
+:::
+
+## Decision Framework
+
+Use this framework to choose the right API for your use case:
+
+### Choose Agents API when:
+- ✅ You need **safety shields** for input/output validation
+- ✅ Your application requires **linear conversation flow** with persistent context
+- ✅ You need **audit trails** and structured execution logs
+- ✅ Your tool configuration is **static** throughout the session
+- ✅ You're building **educational, financial, or enterprise** applications with compliance requirements
+
+### Choose Responses API when:
+- ✅ You need **conversation branching** to explore multiple paths
+- ✅ You want **dynamic per-call configuration** (models, tools, vector stores)
+- ✅ You're **migrating from OpenAI** and want familiar tool patterns
+- ✅ You need **OpenAI compatibility** for existing workflows
+- ✅ Your application benefits from **flexible, experimental** interactions
+
+## Related Resources
+
+- **[Agents](./agent)** - Understanding the Agents API fundamentals
+- **[Agent Execution Loop](./agent_execution_loop)** - How agents process turns and steps
+- **[Tools Integration](./tools)** - Adding capabilities to both APIs
+- **[OpenAI Compatibility](/docs/providers/openai-compatibility)** - Using OpenAI-compatible endpoints
+- **[Safety Guardrails](./safety)** - Implementing safety measures in agents
--- a/docs/docs/building_applications/safety.mdx
+++ b/docs/docs/building_applications/safety.mdx
@ -0,0 +1,395 @@
+---
+title: Safety Guardrails
+description: Implement safety measures and content moderation in Llama Stack applications
+sidebar_label: Safety
+sidebar_position: 9
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Safety Guardrails
+
+Safety is a critical component of any AI application. Llama Stack provides a comprehensive Shield system that can be applied at multiple touchpoints to ensure responsible AI behavior and content moderation.
+
+## Shield System Overview
+
+The Shield system in Llama Stack provides:
+- **Content filtering** for both input and output messages
+- **Multi-touchpoint protection** across your application flow
+- **Configurable safety policies** tailored to your use case
+- **Integration with agents** for automated safety enforcement
+
+## Basic Shield Usage
+
+### Registering a Safety Shield
+
+<Tabs>
+<TabItem value="registration" label="Shield Registration">
+
+```python
+# Register a safety shield
+shield_id = "content_safety"
+client.shields.register(
+    shield_id=shield_id,
+    provider_shield_id="llama-guard-basic"
+)
+```
+
+</TabItem>
+<TabItem value="manual-check" label="Manual Safety Check">
+
+```python
+# Run content through shield manually
+response = client.safety.run_shield(
+    shield_id=shield_id,
+    messages=[{"role": "user", "content": "User message here"}]
+)
+
+if response.violation:
+    print(f"Safety violation detected: {response.violation.user_message}")
+    # Handle violation appropriately
+else:
+    print("Content passed safety checks")
+```
+
+</TabItem>
+</Tabs>
+
+## Agent Integration
+
+Shields can be automatically applied to agent interactions for seamless safety enforcement:
+
+<Tabs>
+<TabItem value="input-shields" label="Input Shields">
+
+```python
+from llama_stack_client import Agent
+
+# Create agent with input safety shields
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    input_shields=["content_safety"],  # Shield user inputs
+    tools=["builtin::websearch"],
+)
+
+session_id = agent.create_session("safe_session")
+
+# All user inputs will be automatically screened
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Tell me about AI safety"}],
+    session_id=session_id,
+)
+```
+
+</TabItem>
+<TabItem value="output-shields" label="Output Shields">
+
+```python
+# Create agent with output safety shields
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    output_shields=["content_safety"],  # Shield agent outputs
+    tools=["builtin::websearch"],
+)
+
+session_id = agent.create_session("safe_session")
+
+# All agent responses will be automatically screened
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Help me with my research"}],
+    session_id=session_id,
+)
+```
+
+</TabItem>
+<TabItem value="both-shields" label="Input & Output Shields">
+
+```python
+# Create agent with comprehensive safety coverage
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    input_shields=["content_safety"],   # Screen user inputs
+    output_shields=["content_safety"],  # Screen agent outputs
+    tools=["builtin::websearch"],
+)
+
+session_id = agent.create_session("fully_protected_session")
+
+# Both input and output are automatically protected
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Research question here"}],
+    session_id=session_id,
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Available Shield Types
+
+### Llama Guard Shields
+
+Llama Guard provides state-of-the-art content safety classification:
+
+<Tabs>
+<TabItem value="basic" label="Basic Llama Guard">
+
+```python
+# Basic Llama Guard for general content safety
+client.shields.register(
+    shield_id="llama_guard_basic",
+    provider_shield_id="llama-guard-basic"
+)
+```
+
+**Use Cases:**
+- General content moderation
+- Harmful content detection
+- Basic safety compliance
+
+</TabItem>
+<TabItem value="advanced" label="Advanced Llama Guard">
+
+```python
+# Advanced Llama Guard with custom categories
+client.shields.register(
+    shield_id="llama_guard_advanced",
+    provider_shield_id="llama-guard-advanced",
+    config={
+        "categories": [
+            "violence", "hate_speech", "sexual_content",
+            "self_harm", "illegal_activity"
+        ],
+        "threshold": 0.8
+    }
+)
+```
+
+**Use Cases:**
+- Fine-tuned safety policies
+- Domain-specific content filtering
+- Enterprise compliance requirements
+
+</TabItem>
+</Tabs>
+
+### Custom Safety Shields
+
+Create domain-specific safety shields for specialized use cases:
+
+```python
+# Register custom safety shield
+client.shields.register(
+    shield_id="financial_compliance",
+    provider_shield_id="custom-financial-shield",
+    config={
+        "detect_pii": True,
+        "financial_advice_warning": True,
+        "regulatory_compliance": "FINRA"
+    }
+)
+```
+
+## Safety Response Handling
+
+When safety violations are detected, handle them appropriately:
+
+<Tabs>
+<TabItem value="basic-handling" label="Basic Handling">
+
+```python
+response = client.safety.run_shield(
+    shield_id="content_safety",
+    messages=[{"role": "user", "content": "Potentially harmful content"}]
+)
+
+if response.violation:
+    violation = response.violation
+    print(f"Violation Type: {violation.violation_type}")
+    print(f"User Message: {violation.user_message}")
+    print(f"Metadata: {violation.metadata}")
+
+    # Log the violation for audit purposes
+    logger.warning(f"Safety violation detected: {violation.violation_type}")
+
+    # Provide appropriate user feedback
+    return "I can't help with that request. Please try asking something else."
+```
+
+</TabItem>
+<TabItem value="advanced-handling" label="Advanced Handling">
+
+```python
+def handle_safety_response(safety_response, user_message):
+    """Advanced safety response handling with logging and user feedback"""
+
+    if not safety_response.violation:
+        return {"safe": True, "message": "Content passed safety checks"}
+
+    violation = safety_response.violation
+
+    # Log violation details
+    audit_log = {
+        "timestamp": datetime.now().isoformat(),
+        "violation_type": violation.violation_type,
+        "original_message": user_message,
+        "shield_response": violation.user_message,
+        "metadata": violation.metadata
+    }
+    logger.warning(f"Safety violation: {audit_log}")
+
+    # Determine appropriate response based on violation type
+    if violation.violation_type == "hate_speech":
+        user_feedback = "I can't engage with content that contains hate speech. Let's keep our conversation respectful."
+    elif violation.violation_type == "violence":
+        user_feedback = "I can't provide information that could promote violence. How else can I help you today?"
+    else:
+        user_feedback = "I can't help with that request. Please try asking something else."
+
+    return {
+        "safe": False,
+        "user_feedback": user_feedback,
+        "violation_details": audit_log
+    }
+
+# Usage
+safety_result = handle_safety_response(response, user_input)
+if not safety_result["safe"]:
+    return safety_result["user_feedback"]
+```
+
+</TabItem>
+</Tabs>
+
+## Safety Configuration Best Practices
+
+### 🛡️ **Multi-Layer Protection**
+- Use both input and output shields for comprehensive coverage
+- Combine multiple shield types for different threat categories
+- Implement fallback mechanisms when shields fail
+
+### 📊 **Monitoring & Auditing**
+- Log all safety violations for compliance and analysis
+- Monitor false positive rates to tune shield sensitivity
+- Track safety metrics across different use cases
+
+### ⚙️ **Configuration Management**
+- Use environment-specific safety configurations
+- Implement A/B testing for shield effectiveness
+- Regularly update shield models and policies
+
+### 🔧 **Integration Patterns**
+- Integrate shields early in the development process
+- Test safety measures with adversarial inputs
+- Provide clear user feedback for violations
+
+## Advanced Safety Scenarios
+
+### Context-Aware Safety
+
+```python
+# Safety shields that consider conversation context
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions="You are a healthcare assistant",
+    input_shields=["medical_safety"],
+    output_shields=["medical_safety"],
+    # Context helps shields make better decisions
+    safety_context={
+        "domain": "healthcare",
+        "user_type": "patient",
+        "compliance_level": "HIPAA"
+    }
+)
+```
+
+### Dynamic Shield Selection
+
+```python
+def select_shield_for_user(user_profile):
+    """Select appropriate safety shield based on user context"""
+    if user_profile.age < 18:
+        return "child_safety_shield"
+    elif user_profile.context == "enterprise":
+        return "enterprise_compliance_shield"
+    else:
+        return "general_safety_shield"
+
+# Use dynamic shield selection
+shield_id = select_shield_for_user(current_user)
+response = client.safety.run_shield(
+    shield_id=shield_id,
+    messages=messages
+)
+```
+
+## Compliance and Regulations
+
+### Industry-Specific Safety
+
+<Tabs>
+<TabItem value="healthcare" label="Healthcare (HIPAA)">
+
+```python
+# Healthcare-specific safety configuration
+client.shields.register(
+    shield_id="hipaa_compliance",
+    provider_shield_id="healthcare-safety-shield",
+    config={
+        "detect_phi": True,  # Protected Health Information
+        "medical_advice_warning": True,
+        "regulatory_framework": "HIPAA"
+    }
+)
+```
+
+</TabItem>
+<TabItem value="financial" label="Financial (FINRA)">
+
+```python
+# Financial services safety configuration
+client.shields.register(
+    shield_id="finra_compliance",
+    provider_shield_id="financial-safety-shield",
+    config={
+        "detect_financial_advice": True,
+        "investment_disclaimers": True,
+        "regulatory_framework": "FINRA"
+    }
+)
+```
+
+</TabItem>
+<TabItem value="education" label="Education (COPPA)">
+
+```python
+# Educational platform safety for minors
+client.shields.register(
+    shield_id="coppa_compliance",
+    provider_shield_id="educational-safety-shield",
+    config={
+        "child_protection": True,
+        "educational_content_only": True,
+        "regulatory_framework": "COPPA"
+    }
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Related Resources
+
+- **[Agents](./agent)** - Integrating safety shields with intelligent agents
+- **[Agent Execution Loop](./agent_execution_loop)** - Understanding safety in the execution flow
+- **[Evaluations](./evals)** - Evaluating safety shield effectiveness
+- **[Telemetry](./telemetry)** - Monitoring safety violations and metrics
+- **[Llama Guard Documentation](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)** - Advanced safety model details
--- a/docs/docs/building_applications/telemetry.mdx
+++ b/docs/docs/building_applications/telemetry.mdx
@ -0,0 +1,342 @@
+---
+title: Telemetry
+description: Monitor and observe Llama Stack applications with comprehensive telemetry capabilities
+sidebar_label: Telemetry
+sidebar_position: 8
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Telemetry
+
+The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output for complete observability of your AI applications.
+
+## Event Types
+
+The telemetry system supports three main types of events:
+
+<Tabs>
+<TabItem value="unstructured" label="Unstructured Logs">
+
+Free-form log messages with severity levels for general application logging:
+
+```python
+unstructured_log_event = UnstructuredLogEvent(
+    message="This is a log message",
+    severity=LogSeverity.INFO
+)
+```
+
+</TabItem>
+<TabItem value="metrics" label="Metric Events">
+
+Numerical measurements with units for tracking performance and usage:
+
+```python
+metric_event = MetricEvent(
+    metric="my_metric",
+    value=10,
+    unit="count"
+)
+```
+
+</TabItem>
+<TabItem value="structured" label="Structured Logs">
+
+System events like span start/end that provide structured operation tracking:
+
+```python
+structured_log_event = SpanStartPayload(
+    name="my_span",
+    parent_span_id="parent_span_id"
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Spans and Traces
+
+- **Spans**: Represent individual operations with timing information and hierarchical relationships
+- **Traces**: Collections of related spans that form a complete request flow across your application
+
+This hierarchical structure allows you to understand the complete execution path of requests through your Llama Stack application.
+
+## Automatic Metrics Generation
+
+Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
+
+### Available Metrics
+
+The following metrics are automatically generated for each inference request:
+
+| Metric Name | Type | Unit | Description | Labels |
+|-------------|------|------|-------------|--------|
+| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
+| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
+| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
+
+### Metric Generation Flow
+
+1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
+2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
+3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
+4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
+
+### Metric Aggregation Level
+
+All metrics are generated and aggregated at the **inference request level**. This means:
+
+- Each individual inference request generates its own set of metrics
+- Metrics are not pre-aggregated across multiple requests
+- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
+- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
+
+### Example Metric Event
+
+```python
+MetricEvent(
+    trace_id="1234567890abcdef",
+    span_id="abcdef1234567890",
+    metric="total_tokens",
+    value=150,
+    timestamp=1703123456.789,
+    unit="tokens",
+    attributes={
+        "model_id": "meta-llama/Llama-3.2-3B-Instruct",
+        "provider_id": "tgi"
+    },
+)
+```
+
+## Telemetry Sinks
+
+Choose from multiple sink types based on your observability needs:
+
+<Tabs>
+<TabItem value="opentelemetry" label="OpenTelemetry">
+
+Send events to an OpenTelemetry Collector for integration with observability platforms:
+
+**Use Cases:**
+- Visualizing traces in tools like Jaeger
+- Collecting metrics for Prometheus
+- Integration with enterprise observability stacks
+
+**Features:**
+- Standard OpenTelemetry format
+- Compatible with all OpenTelemetry collectors
+- Supports both traces and metrics
+
+</TabItem>
+<TabItem value="sqlite" label="SQLite">
+
+Store events in a local SQLite database for direct querying:
+
+**Use Cases:**
+- Local development and debugging
+- Custom analytics and reporting
+- Offline analysis of application behavior
+
+**Features:**
+- Direct SQL querying capabilities
+- Persistent local storage
+- No external dependencies
+
+</TabItem>
+<TabItem value="console" label="Console">
+
+Print events to the console for immediate debugging:
+
+**Use Cases:**
+- Development and testing
+- Quick debugging sessions
+- Simple logging without external tools
+
+**Features:**
+- Immediate output visibility
+- No setup required
+- Human-readable format
+
+</TabItem>
+</Tabs>
+
+## Configuration
+
+### Meta-Reference Provider
+
+Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
+
+```yaml
+telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      service_name: "llama-stack-service"
+      sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
+      otel_exporter_otlp_endpoint: "http://localhost:4318"
+      sqlite_db_path: "/path/to/telemetry.db"
+```
+
+### Environment Variables
+
+Configure telemetry behavior using environment variables:
+
+- **`OTEL_EXPORTER_OTLP_ENDPOINT`**: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
+- **`OTEL_SERVICE_NAME`**: Service name for telemetry (default: empty string)
+- **`TELEMETRY_SINKS`**: Comma-separated list of sinks (default: `console,sqlite`)
+
+## Visualization with Jaeger
+
+The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
+
+### Starting Jaeger
+
+Start a Jaeger instance with OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686:
+
+```bash
+docker run --pull always --rm --name jaeger \
+  -p 16686:16686 -p 4318:4318 \
+  jaegertracing/jaeger:2.1.0
+```
+
+Once running, you can visualize traces by navigating to [http://localhost:16686/](http://localhost:16686/).
+
+## Querying Metrics
+
+When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:
+
+<Tabs>
+<TabItem value="prometheus" label="Prometheus Queries">
+
+Example Prometheus queries for analyzing token usage:
+
+```promql
+# Total tokens used across all models
+sum(llama_stack_tokens_total)
+
+# Tokens per model
+sum by (model_id) (llama_stack_tokens_total)
+
+# Average tokens per request over 5 minutes
+rate(llama_stack_tokens_total[5m])
+
+# Token usage by provider
+sum by (provider_id) (llama_stack_tokens_total)
+```
+
+</TabItem>
+<TabItem value="grafana" label="Grafana Dashboards">
+
+Create dashboards using Prometheus as a data source:
+
+- **Token Usage Over Time**: Line charts showing token consumption trends
+- **Model Performance**: Comparison of different models by token efficiency
+- **Provider Analysis**: Breakdown of usage across different providers
+- **Request Patterns**: Understanding peak usage times and patterns
+
+</TabItem>
+<TabItem value="otlp" label="OpenTelemetry Collector">
+
+Forward metrics to other observability systems:
+
+- Export to multiple backends simultaneously
+- Apply transformations and filtering
+- Integrate with existing monitoring infrastructure
+
+</TabItem>
+</Tabs>
+
+## SQLite Querying
+
+The `sqlite` sink allows you to query traces without an external system. This is particularly useful for development and custom analytics.
+
+### Example Queries
+
+```sql
+-- Query recent traces
+SELECT * FROM traces WHERE timestamp > datetime('now', '-1 hour');
+
+-- Analyze span durations
+SELECT name, AVG(duration_ms) as avg_duration
+FROM spans
+GROUP BY name
+ORDER BY avg_duration DESC;
+
+-- Find slow operations
+SELECT * FROM spans
+WHERE duration_ms > 1000
+ORDER BY duration_ms DESC;
+```
+
+:::tip[Advanced Analytics]
+Refer to the [Getting Started notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on querying traces and spans programmatically.
+:::
+
+## Best Practices
+
+### 🔍 **Monitoring Strategy**
+- Use OpenTelemetry for production environments
+- Combine multiple sinks for development (console + SQLite)
+- Set up alerts on key metrics like token usage and error rates
+
+### 📊 **Metrics Analysis**
+- Track token usage trends to optimize costs
+- Monitor response times across different models
+- Analyze usage patterns to improve resource allocation
+
+### 🚨 **Alerting & Debugging**
+- Set up alerts for unusual token consumption spikes
+- Use trace data to debug performance issues
+- Monitor error rates and failure patterns
+
+### 🔧 **Configuration Management**
+- Use environment variables for flexible deployment
+- Configure appropriate retention policies for SQLite
+- Ensure proper network access to OpenTelemetry collectors
+
+## Integration Examples
+
+### Basic Telemetry Setup
+
+```python
+from llama_stack_client import LlamaStackClient
+
+# Client with telemetry headers
+client = LlamaStackClient(
+    base_url="http://localhost:8000",
+    extra_headers={
+        "X-Telemetry-Service": "my-ai-app",
+        "X-Telemetry-Version": "1.0.0"
+    }
+)
+
+# All API calls will be automatically traced
+response = client.inference.chat_completion(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+### Custom Telemetry Context
+
+```python
+# Add custom span attributes for better tracking
+with tracer.start_as_current_span("custom_operation") as span:
+    span.set_attribute("user_id", "user123")
+    span.set_attribute("operation_type", "chat_completion")
+
+    response = client.inference.chat_completion(
+        model="meta-llama/Llama-3.2-3B-Instruct",
+        messages=[{"role": "user", "content": "Hello!"}]
+    )
+```
+
+## Related Resources
+
+- **[Agents](./agent)** - Monitoring agent execution with telemetry
+- **[Evaluations](./evals)** - Using telemetry data for performance evaluation
+- **[Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Telemetry examples and queries
+- **[OpenTelemetry Documentation](https://opentelemetry.io/)** - Comprehensive observability framework
+- **[Jaeger Documentation](https://www.jaegertracing.io/)** - Distributed tracing visualization
--- a/docs/docs/building_applications/tools.mdx
+++ b/docs/docs/building_applications/tools.mdx
@ -0,0 +1,340 @@
+---
+title: Tools
+description: Extend agent capabilities with external tools and function calling
+sidebar_label: Tools
+sidebar_position: 6
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Tools
+
+Tools are functions that can be invoked by an agent to perform tasks. They are organized into tool groups and registered with specific providers. Each tool group represents a collection of related tools from a single provider. They are organized into groups so that state can be externalized: the collection operates on the same state typically.
+
+An example of this would be a "db_access" tool group that contains tools for interacting with a database. "list_tables", "query_table", "insert_row" could be examples of tools in this group.
+
+Tools are treated as any other resource in llama stack like models. You can register them, have providers for them etc.
+
+When instantiating an agent, you can provide it a list of tool groups that it has access to. Agent gets the corresponding tool definitions for the specified tool groups and passes them along to the model.
+
+Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools.
+
+## Server-side vs. Client-side Tool Execution
+
+Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution and optional continuation using the `agent.resume_turn` method.
+
+## Server-side Tools
+
+Llama Stack provides built-in providers for some common tools. These include web search, math, and RAG capabilities.
+
+### Web Search
+
+You have three providers to execute the web search tool calls generated by a model: Brave Search, Bing Search, and Tavily Search.
+
+To indicate that the web search tool calls should be executed by brave-search, you can point the "builtin::websearch" toolgroup to the "brave-search" provider.
+
+```python
+client.toolgroups.register(
+    toolgroup_id="builtin::websearch",
+    provider_id="brave-search",
+    args={"max_results": 5},
+)
+```
+
+The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is:
+```
+{"<provider_name>_api_key": <your api key>}
+```
+
+### Math
+
+The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API.
+
+```python
+client.toolgroups.register(
+    toolgroup_id="builtin::wolfram_alpha",
+    provider_id="wolfram-alpha"
+)
+```
+
+Example usage:
+```python
+result = client.tool_runtime.invoke_tool(
+    tool_name="wolfram_alpha",
+    args={"query": "solve x^2 + 2x + 1 = 0"}
+)
+```
+
+### RAG
+
+The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
+
+```python
+# Register Memory tool group
+client.toolgroups.register(
+    toolgroup_id="builtin::rag",
+    provider_id="faiss",
+    args={"max_chunks": 5, "max_tokens_in_context": 4096},
+)
+```
+
+Features:
+- Support for multiple memory bank types
+- Configurable query generation
+- Context retrieval with token limits
+
+:::note[Default Configuration]
+By default, llama stack run.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers.
+:::
+
+## Model Context Protocol (MCP)
+
+[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered from an MCP endpoint and can be used to extend the agent's capabilities.
+
+### Using Remote MCP Servers
+
+You can find some popular remote MCP servers [here](https://github.com/jaw9c/awesome-remote-mcp-servers). You can register them as toolgroups in the same way as local providers.
+
+```python
+client.toolgroups.register(
+    toolgroup_id="mcp::deepwiki",
+    provider_id="model-context-protocol",
+    mcp_endpoint=URL(uri="https://mcp.deepwiki.com/sse"),
+)
+```
+
+Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide authorization headers to send to the MCP server using the "Provider Data" abstraction provided by Llama Stack. When making an agent call,
+
+```python
+agent = Agent(
+    ...,
+    tools=["mcp::deepwiki"],
+    extra_headers={
+        "X-LlamaStack-Provider-Data": json.dumps(
+            {
+                "mcp_headers": {
+                    "http://mcp.deepwiki.com/sse": {
+                        "Authorization": "Bearer <your_access_token>",
+                    },
+                },
+            }
+        ),
+    },
+)
+agent.create_turn(...)
+```
+
+### Running Your Own MCP Server
+
+Here's an example of how to run a simple MCP server that exposes a File System as a set of tools to the Llama Stack agent.
+
+<Tabs>
+<TabItem value="setup" label="Server Setup">
+
+```shell
+# Start your MCP server
+mkdir /tmp/content
+touch /tmp/content/foo
+touch /tmp/content/bar
+npx -y supergateway --port 8000 --stdio 'npx -y @modelcontextprotocol/server-filesystem /tmp/content'
+```
+
+</TabItem>
+<TabItem value="register" label="Registration">
+
+```python
+# Register the MCP server as a tool group
+client.toolgroups.register(
+    toolgroup_id="mcp::filesystem",
+    provider_id="model-context-protocol",
+    mcp_endpoint=URL(uri="http://localhost:8000/sse"),
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Adding Custom (Client-side) Tools
+
+When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed along to the generative model.
+
+```python
+# Example tool definition
+def my_tool(input: int) -> int:
+    """
+    Runs my awesome tool.
+
+    :param input: some int parameter
+    """
+    return input * 2
+```
+
+:::tip[Documentation Best Practices]
+We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
+:::
+
+Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
+
+```python
+# Example agent config with client provided tools
+agent = Agent(client, ..., tools=[my_tool])
+```
+
+Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
+
+## Tool Invocation
+
+Tools can be invoked using the `invoke_tool` method:
+
+```python
+result = client.tool_runtime.invoke_tool(
+    tool_name="web_search",
+    kwargs={"query": "What is the capital of France?"}
+)
+```
+
+The result contains:
+- `content`: The tool's output
+- `error_message`: Optional error message if the tool failed
+- `error_code`: Optional error code if the tool failed
+
+## Listing Available Tools
+
+You can list all available tools or filter by tool group:
+
+```python
+# List all tools
+all_tools = client.tools.list_tools()
+
+# List tools in a specific group
+group_tools = client.tools.list_tools(toolgroup_id="search_tools")
+```
+
+## Complete Examples
+
+### Web Search Agent
+
+<Tabs>
+<TabItem value="setup" label="Setup & Configuration">
+
+1. Start by registering a Tavily API key at [Tavily](https://tavily.com/).
+2. [Optional] Provide the API key directly to the Llama Stack server
+```bash
+export TAVILY_SEARCH_API_KEY="your key"
+```
+```bash
+--env TAVILY_SEARCH_API_KEY=${TAVILY_SEARCH_API_KEY}
+```
+
+</TabItem>
+<TabItem value="implementation" label="Implementation">
+
+```python
+from llama_stack_client.lib.agents.agent import Agent
+from llama_stack_client.types.agent_create_params import AgentConfig
+from llama_stack_client.lib.agents.event_logger import EventLogger
+from llama_stack_client import LlamaStackClient
+
+client = LlamaStackClient(
+    base_url=f"http://localhost:8321",
+    provider_data={
+        "tavily_search_api_key": "your_TAVILY_SEARCH_API_KEY"
+    },  # Set this from the client side. No need to provide it if it has already been configured on the Llama Stack server.
+)
+
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions=(
+        "You are a web search assistant, must use websearch tool to look up the most current and precise information available. "
+    ),
+    tools=["builtin::websearch"],
+)
+
+session_id = agent.create_session("websearch-session")
+
+response = agent.create_turn(
+    messages=[
+        {"role": "user", "content": "How did the USA perform in the last Olympics?"}
+    ],
+    session_id=session_id,
+)
+for log in EventLogger().log(response):
+    log.print()
+```
+
+</TabItem>
+</Tabs>
+
+### WolframAlpha Math Agent
+
+<Tabs>
+<TabItem value="setup" label="Setup & Configuration">
+
+1. Start by registering for a WolframAlpha API key at [WolframAlpha Developer Portal](https://developer.wolframalpha.com/access).
+2. Provide the API key either when starting the Llama Stack server:
+    ```bash
+    --env WOLFRAM_ALPHA_API_KEY=${WOLFRAM_ALPHA_API_KEY}
+    ```
+    or from the client side:
+    ```python
+    client = LlamaStackClient(
+        base_url="http://localhost:8321",
+        provider_data={"wolfram_alpha_api_key": wolfram_api_key},
+    )
+    ```
+
+</TabItem>
+<TabItem value="implementation" label="Implementation">
+
+```python
+# Configure the tools in the Agent by setting tools=["builtin::wolfram_alpha"]
+agent = Agent(
+    client,
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    instructions="You are a mathematical assistant that can solve complex equations.",
+    tools=["builtin::wolfram_alpha"],
+)
+
+session_id = agent.create_session("math-session")
+
+# Example user query
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Solve x^2 + 2x + 1 = 0 using WolframAlpha"}],
+    session_id=session_id,
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Best Practices
+
+### 🛠️ **Tool Selection**
+- Use **server-side tools** for production applications requiring reliability and security
+- Use **client-side tools** for development, prototyping, or specialized integrations
+- Combine multiple tool types for comprehensive functionality
+
+### 📝 **Documentation**
+- Write clear, detailed docstrings for custom tools
+- Include parameter descriptions and expected return types
+- Test tool descriptions with the model to ensure proper usage
+
+### 🔐 **Security**
+- Store API keys securely using environment variables or secure configuration
+- Use the `X-LlamaStack-Provider-Data` header for dynamic authentication
+- Validate tool inputs and outputs for security
+
+### 🔄 **Error Handling**
+- Implement proper error handling in custom tools
+- Use structured error responses with meaningful messages
+- Monitor tool performance and reliability
+
+## Related Resources
+
+- **[Agents](./agent)** - Building intelligent agents with tools
+- **[RAG (Retrieval Augmented Generation)](./rag)** - Using knowledge retrieval tools
+- **[Agent Execution Loop](./agent_execution_loop)** - Understanding tool execution flow
+- **[Building AI Applications Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Comprehensive examples
+- **[Llama Stack Apps Examples](https://github.com/meta-llama/llama-stack-apps)** - Real-world tool implementations