docs: update Agent documentation (#1333)

Summary: - [new] Agent concepts (session, turn) - [new] how to write custom tools - [new] non-streaming API and how to get outputs - [update] remaining `memory` -> `rag` rename - [new] note importance of `instructions` Test Plan: read
2025-12-04 18:13:44 +00:00 · 2025-03-01 22:34:52 -08:00 · 2025-03-01 22:34:52 -08:00 · 52977e56a8
commit 52977e56a8
parent 46b0a404e8
6 changed files with 170 additions and 64 deletions
--- a/docs/source/building_applications/agent.md
+++ b/docs/source/building_applications/agent.md
@ -0,0 +1,91 @@
 # Llama Stack Agent Framework
 The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
 ## Core Concepts
 ### 1. Agent Configuration
 Agents are configured using the `AgentConfig` class, which includes:
 - **Model**: The underlying LLM to power the agent
 - **Instructions**: System prompt that defines the agent's behavior
 - **Tools**: Capabilities the agent can use to interact with external systems
 - **Safety Shields**: Guardrails to ensure responsible AI behavior
 ```python
 from llama_stack_client.types.agent_create_params import AgentConfig
 from llama_stack_client.lib.agents.agent import Agent
 # Configure an agent
 agent_config = AgentConfig(
    model="meta-llama/Llama-3-70b-chat",
    instructions="You are a helpful assistant that can use tools to answer questions.",
    toolgroups=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
 )
 # Create the agent
 agent = Agent(llama_stack_client, agent_config)
 ```
 ### 2. Sessions
 Agents maintain state through sessions, which represent a conversation thread:
 ```python
 # Create a session
 session_id = agent.create_session(session_name="My conversation")
 ```
 ### 3. Turns
 Each interaction with an agent is called a "turn" and consists of:
 - **Input Messages**: What the user sends to the agent
 - **Steps**: The agent's internal processing (inference, tool execution, etc.)
 - **Output Message**: The agent's response
 ```python
 from llama_stack_client.lib.agents.event_logger import EventLogger
 # Create a turn with streaming response
 turn_response = agent.create_turn(
    session_id=session_id,
    messages=[{"role": "user", "content": "Tell me about Llama models"}],
 )
 for log in EventLogger().log(turn_response):
    log.print()
 ```
 ###  Non-Streaming
 ```python
 from rich.pretty import pprint
 # Non-streaming API
 response = agent.create_turn(
    session_id=session_id,
    messages=[{"role": "user", "content": "Tell me about Llama models"}],
    stream=False,
 )
 print("Inputs:")
 pprint(response.input_messages)
 print("Output:")
 pprint(response.output_message.content)
 print("Steps:")
 pprint(response.steps)
 ```
 ### 4. Steps
 Each turn consists of multiple steps that represent the agent's thought process:
 - **Inference Steps**: The agent generating text responses
 - **Tool Execution Steps**: The agent using tools to gather information
 - **Shield Call Steps**: Safety checks being performed
 ## Agent Execution Loop
 Refer to the [Agent Execution Loop](agent_execution_loop) for more details on what happens within an agent turn.
--- a/docs/source/building_applications/agent_execution_loop.md
+++ b/docs/source/building_applications/agent_execution_loop.md
@ -13,7 +13,7 @@ Each agent turn follows these key steps:
 3. **Inference Loop**: The agent enters its main execution loop:
   - The LLM receives a user prompt (with previous tool outputs)
-   - The LLM generates a response, potentially with tool calls
+   - The LLM generates a response, potentially with [tool calls](tools)
   - If tool calls are present:
     - Tool inputs are safety-checked
     - Tools are executed (e.g., web search, code execution)
@ -68,6 +68,7 @@ Each step in this process can be monitored and controlled through configurations
 ```python
 from llama_stack_client.lib.agents.event_logger import EventLogger
 from rich.pretty import pprint
 agent_config = AgentConfig(
    model="Llama3.2-3B-Instruct",
@ -108,14 +109,21 @@ response = agent.create_turn(
 # Monitor each step of execution
 for log in EventLogger().log(response):
-    if log.event.step_type == "memory_retrieval":
+    log.print()
-        print("Retrieved context:", log.event.retrieved_context)
+
-    elif log.event.step_type == "inference":
+# Using non-streaming API, the response contains input, steps, and output.
-        print("LLM output:", log.event.model_response)
+response = agent.create_turn(
-    elif log.event.step_type == "tool_execution":
+    messages=[{"role": "user", "content": "Analyze this code and run it"}],
-        print("Tool call:", log.event.tool_call)
+    attachments=[
-        print("Tool response:", log.event.tool_response)
+        {
-    elif log.event.step_type == "shield_call":
+            "content": "https://raw.githubusercontent.com/example/code.py",
-        if log.event.violation:
+            "mime_type": "text/plain",
-            print("Safety violation:", log.event.violation)
+        }
    ],
    session_id=session_id,
 )
 pprint(f"Input: {response.input_messages}")
 pprint(f"Output: {response.output_message.content}")
 pprint(f"Steps: {response.steps}")
 ```
--- a/docs/source/building_applications/evals.md
+++ b/docs/source/building_applications/evals.md
@ -149,7 +149,6 @@ agent_config = {
        }
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "input_shields": [],
    "output_shields": [],
    "enable_session_persistence": False,
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -8,22 +8,24 @@ The best way to get started is to look at this notebook which walks through the
 Here are some key topics that will help you build effective agents:
- **[Agent Execution Loop](agent_execution_loop)**
+- **[Agent](agent)**: Understand the components and design patterns of the Llama Stack agent framework.
- **[RAG](rag)**
+- **[Agent Execution Loop](agent_execution_loop)**: Understand how agents process information, make decisions, and execute actions in a continuous loop.
- **[Safety](safety)**
+- **[RAG (Retrieval-Augmented Generation)](rag)**: Learn how to enhance your agents with external knowledge through retrieval mechanisms.
- **[Tools](tools)**
+- **[Tools](tools)**: Extend your agents' capabilities by integrating with external tools and APIs.
- **[Telemetry](telemetry)**
+- **[Evals](evals)**: Evaluate your agents' effectiveness and identify areas for improvement.
- **[Evals](evals)**
+- **[Telemetry](telemetry)**: Monitor and analyze your agents' performance and behavior.
-
+- **[Safety](safety)**: Implement guardrails and safety measures to ensure responsible AI behavior.
 ```{toctree}
 :hidden:
 :maxdepth: 1
 agent
 agent_execution_loop
 rag
 safety
 tools
 telemetry
 evals
 advanced_agent_patterns
 safety
 ```
--- a/docs/source/building_applications/rag.md
+++ b/docs/source/building_applications/rag.md
@ -1,8 +1,8 @@
-## Using "Memory" or Retrieval Augmented Generation (RAG)
+## Using Retrieval Augmented Generation (RAG)
-Memory enables your applications to reference and recall information from previous interactions or external documents.
+RAG enables your applications to reference and recall information from previous interactions or external documents.
-Llama Stack organizes the memory APIs into three layers:
+Llama Stack organizes the APIs that enable RAG into three layers:
 - the lowermost APIs deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon.)
 - next is the "Rag Tool", a first-class tool as part of the Tools API that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly.
 - finally, it all comes together with the top-level "Agents" API that allows you to create agents that can use the tools to answer questions, perform tasks, and more.
@ -86,7 +86,7 @@ from llama_stack_client.lib.agents.agent import Agent
 # Configure agent with memory
 agent_config = AgentConfig(
-    model="meta-llama/Llama-3.2-3B-Instruct",
+    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
    toolgroups=[
@ -102,6 +102,19 @@ agent_config = AgentConfig(
 agent = Agent(client, agent_config)
 session_id = agent.create_session("rag_session")
 # Ask questions about documents in the vector db, and the agent will query the db to answer the question.
 response = agent.create_turn(
    messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
    session_id=session_id,
 )
 ```
 > **NOTE:** the `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
 You can also pass documents along with the user's message and ask questions about them.
 ```python
 # Initial document ingestion
 response = agent.create_turn(
    messages=[
--- a/docs/source/building_applications/tools.md
+++ b/docs/source/building_applications/tools.md
@ -83,15 +83,15 @@ result = client.tool_runtime.invoke_tool(
 )
 ```
-#### Memory
+#### RAG
-The Memory tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
+The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
 ```python
 # Register Memory tool group
 client.toolgroups.register(
-    toolgroup_id="builtin::memory",
+    toolgroup_id="builtin::rag",
-    provider_id="memory",
+    provider_id="faiss",
    args={"max_chunks": 5, "max_tokens_in_context": 4096},
 )
 ```
@ -102,7 +102,7 @@ Features:
 - Context retrieval with token limits
-> **Note:** By default, llama stack run.yaml defines toolgroups for web search, code interpreter and memory, that are provided by tavily-search, code-interpreter and memory providers.
+> **Note:** By default, llama stack run.yaml defines toolgroups for web search, code interpreter and rag, that are provided by tavily-search, code-interpreter and rag providers.
 ## Model Context Protocol (MCP) Tools
@ -125,50 +125,43 @@ MCP tools require:
 - Tools are discovered dynamically from the endpoint
-## Tools provided by the client
+## Adding Custom Tools
-These tools are registered along with the agent config and are specific to the agent for which they are registered. The main difference between these tools and the tools provided by the built-in providers is that the execution of these tools is handled by the client and the agent transfers the tool call to the client and waits for the result from the client.
+When you want to use tools other than the built-in tools, you can implement a python function and decorate it with `@client_tool`. 
 To define a custom tool, you need to use the `@client_tool` decorator.
 ```python
 from llama_stack_client.lib.agents.client_tool import client_tool
 # Example tool definition
@client_tool
 def my_tool(input: int) -> int:
    """
    Runs my awesome tool.
    :param input: some int parameter
    """
    return input * 2
 ```
 > **NOTE:** We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
 Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
 ```python
 # Example agent config with client provided tools
-config = AgentConfig(
+client_tools = [
-    toolgroups=[
+    my_tool,
-        "builtin::websearch",
+]
-    ],
+
-    client_tools=[ToolDef(name="client_tool", description="Client provided tool")],
+agent_config = AgentConfig(
    ...,
    client_tools=[client_tool.get_tool_definition() for client_tool in client_tools],
 )
 agent = Agent(client, agent_config, client_tools)
 ```
 Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
 ## Tool Structure
 Each tool has the following components:
 - `name`: Unique identifier for the tool
 - `description`: Human-readable description of the tool's functionality
 - `parameters`: List of parameters the tool accepts
  - `name`: Parameter name
  - `parameter_type`: Data type (string, number, etc.)
  - `description`: Parameter description
  - `required`: Whether the parameter is required (default: true)
  - `default`: Default value if any
 Example tool definition:
 ```python
 {
    "name": "web_search",
    "description": "Search the web for information",
    "parameters": [
        {
            "name": "query",
            "parameter_type": "string",
            "description": "The query to search for",
            "required": True,
        }
    ],
 }
 ```
 ## Tool Invocation