docs: update Agent documentation (#1333)

Summary:
- [new] Agent concepts (session, turn)
- [new] how to write custom tools
- [new] non-streaming API and how to get outputs
- [update] remaining `memory` -> `rag` rename
- [new] note importance of `instructions`

Test Plan:
read
This commit is contained in:
ehhuang 2025-03-01 22:34:52 -08:00 committed by GitHub
parent 46b0a404e8
commit 52977e56a8
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 170 additions and 64 deletions

View file

@ -0,0 +1,91 @@
# Llama Stack Agent Framework
The Llama Stack agent framework is built on a modular architecture that allows for flexible and powerful AI applications. This document explains the key components and how they work together.
## Core Concepts
### 1. Agent Configuration
Agents are configured using the `AgentConfig` class, which includes:
- **Model**: The underlying LLM to power the agent
- **Instructions**: System prompt that defines the agent's behavior
- **Tools**: Capabilities the agent can use to interact with external systems
- **Safety Shields**: Guardrails to ensure responsible AI behavior
```python
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.lib.agents.agent import Agent
# Configure an agent
agent_config = AgentConfig(
model="meta-llama/Llama-3-70b-chat",
instructions="You are a helpful assistant that can use tools to answer questions.",
toolgroups=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
)
# Create the agent
agent = Agent(llama_stack_client, agent_config)
```
### 2. Sessions
Agents maintain state through sessions, which represent a conversation thread:
```python
# Create a session
session_id = agent.create_session(session_name="My conversation")
```
### 3. Turns
Each interaction with an agent is called a "turn" and consists of:
- **Input Messages**: What the user sends to the agent
- **Steps**: The agent's internal processing (inference, tool execution, etc.)
- **Output Message**: The agent's response
```python
from llama_stack_client.lib.agents.event_logger import EventLogger
# Create a turn with streaming response
turn_response = agent.create_turn(
session_id=session_id,
messages=[{"role": "user", "content": "Tell me about Llama models"}],
)
for log in EventLogger().log(turn_response):
log.print()
```
### Non-Streaming
```python
from rich.pretty import pprint
# Non-streaming API
response = agent.create_turn(
session_id=session_id,
messages=[{"role": "user", "content": "Tell me about Llama models"}],
stream=False,
)
print("Inputs:")
pprint(response.input_messages)
print("Output:")
pprint(response.output_message.content)
print("Steps:")
pprint(response.steps)
```
### 4. Steps
Each turn consists of multiple steps that represent the agent's thought process:
- **Inference Steps**: The agent generating text responses
- **Tool Execution Steps**: The agent using tools to gather information
- **Shield Call Steps**: Safety checks being performed
## Agent Execution Loop
Refer to the [Agent Execution Loop](agent_execution_loop) for more details on what happens within an agent turn.

View file

@ -13,7 +13,7 @@ Each agent turn follows these key steps:
3. **Inference Loop**: The agent enters its main execution loop: 3. **Inference Loop**: The agent enters its main execution loop:
- The LLM receives a user prompt (with previous tool outputs) - The LLM receives a user prompt (with previous tool outputs)
- The LLM generates a response, potentially with tool calls - The LLM generates a response, potentially with [tool calls](tools)
- If tool calls are present: - If tool calls are present:
- Tool inputs are safety-checked - Tool inputs are safety-checked
- Tools are executed (e.g., web search, code execution) - Tools are executed (e.g., web search, code execution)
@ -68,6 +68,7 @@ Each step in this process can be monitored and controlled through configurations
```python ```python
from llama_stack_client.lib.agents.event_logger import EventLogger from llama_stack_client.lib.agents.event_logger import EventLogger
from rich.pretty import pprint
agent_config = AgentConfig( agent_config = AgentConfig(
model="Llama3.2-3B-Instruct", model="Llama3.2-3B-Instruct",
@ -108,14 +109,21 @@ response = agent.create_turn(
# Monitor each step of execution # Monitor each step of execution
for log in EventLogger().log(response): for log in EventLogger().log(response):
if log.event.step_type == "memory_retrieval": log.print()
print("Retrieved context:", log.event.retrieved_context)
elif log.event.step_type == "inference": # Using non-streaming API, the response contains input, steps, and output.
print("LLM output:", log.event.model_response) response = agent.create_turn(
elif log.event.step_type == "tool_execution": messages=[{"role": "user", "content": "Analyze this code and run it"}],
print("Tool call:", log.event.tool_call) attachments=[
print("Tool response:", log.event.tool_response) {
elif log.event.step_type == "shield_call": "content": "https://raw.githubusercontent.com/example/code.py",
if log.event.violation: "mime_type": "text/plain",
print("Safety violation:", log.event.violation) }
],
session_id=session_id,
)
pprint(f"Input: {response.input_messages}")
pprint(f"Output: {response.output_message.content}")
pprint(f"Steps: {response.steps}")
``` ```

View file

@ -149,7 +149,6 @@ agent_config = {
} }
], ],
"tool_choice": "auto", "tool_choice": "auto",
"tool_prompt_format": "json",
"input_shields": [], "input_shields": [],
"output_shields": [], "output_shields": [],
"enable_session_persistence": False, "enable_session_persistence": False,

View file

@ -8,22 +8,24 @@ The best way to get started is to look at this notebook which walks through the
Here are some key topics that will help you build effective agents: Here are some key topics that will help you build effective agents:
- **[Agent Execution Loop](agent_execution_loop)** - **[Agent](agent)**: Understand the components and design patterns of the Llama Stack agent framework.
- **[RAG](rag)** - **[Agent Execution Loop](agent_execution_loop)**: Understand how agents process information, make decisions, and execute actions in a continuous loop.
- **[Safety](safety)** - **[RAG (Retrieval-Augmented Generation)](rag)**: Learn how to enhance your agents with external knowledge through retrieval mechanisms.
- **[Tools](tools)** - **[Tools](tools)**: Extend your agents' capabilities by integrating with external tools and APIs.
- **[Telemetry](telemetry)** - **[Evals](evals)**: Evaluate your agents' effectiveness and identify areas for improvement.
- **[Evals](evals)** - **[Telemetry](telemetry)**: Monitor and analyze your agents' performance and behavior.
- **[Safety](safety)**: Implement guardrails and safety measures to ensure responsible AI behavior.
```{toctree} ```{toctree}
:hidden: :hidden:
:maxdepth: 1 :maxdepth: 1
agent
agent_execution_loop agent_execution_loop
rag rag
safety
tools tools
telemetry telemetry
evals evals
advanced_agent_patterns
safety
``` ```

View file

@ -1,8 +1,8 @@
## Using "Memory" or Retrieval Augmented Generation (RAG) ## Using Retrieval Augmented Generation (RAG)
Memory enables your applications to reference and recall information from previous interactions or external documents. RAG enables your applications to reference and recall information from previous interactions or external documents.
Llama Stack organizes the memory APIs into three layers: Llama Stack organizes the APIs that enable RAG into three layers:
- the lowermost APIs deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon.) - the lowermost APIs deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon.)
- next is the "Rag Tool", a first-class tool as part of the Tools API that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly. - next is the "Rag Tool", a first-class tool as part of the Tools API that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly.
- finally, it all comes together with the top-level "Agents" API that allows you to create agents that can use the tools to answer questions, perform tasks, and more. - finally, it all comes together with the top-level "Agents" API that allows you to create agents that can use the tools to answer questions, perform tasks, and more.
@ -86,7 +86,7 @@ from llama_stack_client.lib.agents.agent import Agent
# Configure agent with memory # Configure agent with memory
agent_config = AgentConfig( agent_config = AgentConfig(
model="meta-llama/Llama-3.2-3B-Instruct", model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant", instructions="You are a helpful assistant",
enable_session_persistence=False, enable_session_persistence=False,
toolgroups=[ toolgroups=[
@ -102,6 +102,19 @@ agent_config = AgentConfig(
agent = Agent(client, agent_config) agent = Agent(client, agent_config)
session_id = agent.create_session("rag_session") session_id = agent.create_session("rag_session")
# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
response = agent.create_turn(
messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
session_id=session_id,
)
```
> **NOTE:** the `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
You can also pass documents along with the user's message and ask questions about them.
```python
# Initial document ingestion # Initial document ingestion
response = agent.create_turn( response = agent.create_turn(
messages=[ messages=[

View file

@ -83,15 +83,15 @@ result = client.tool_runtime.invoke_tool(
) )
``` ```
#### Memory #### RAG
The Memory tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph). The RAG tool enables retrieval of context from various types of memory banks (vector, key-value, keyword, and graph).
```python ```python
# Register Memory tool group # Register Memory tool group
client.toolgroups.register( client.toolgroups.register(
toolgroup_id="builtin::memory", toolgroup_id="builtin::rag",
provider_id="memory", provider_id="faiss",
args={"max_chunks": 5, "max_tokens_in_context": 4096}, args={"max_chunks": 5, "max_tokens_in_context": 4096},
) )
``` ```
@ -102,7 +102,7 @@ Features:
- Context retrieval with token limits - Context retrieval with token limits
> **Note:** By default, llama stack run.yaml defines toolgroups for web search, code interpreter and memory, that are provided by tavily-search, code-interpreter and memory providers. > **Note:** By default, llama stack run.yaml defines toolgroups for web search, code interpreter and rag, that are provided by tavily-search, code-interpreter and rag providers.
## Model Context Protocol (MCP) Tools ## Model Context Protocol (MCP) Tools
@ -125,50 +125,43 @@ MCP tools require:
- Tools are discovered dynamically from the endpoint - Tools are discovered dynamically from the endpoint
## Tools provided by the client ## Adding Custom Tools
These tools are registered along with the agent config and are specific to the agent for which they are registered. The main difference between these tools and the tools provided by the built-in providers is that the execution of these tools is handled by the client and the agent transfers the tool call to the client and waits for the result from the client. When you want to use tools other than the built-in tools, you can implement a python function and decorate it with `@client_tool`.
To define a custom tool, you need to use the `@client_tool` decorator.
```python
from llama_stack_client.lib.agents.client_tool import client_tool
# Example tool definition
@client_tool
def my_tool(input: int) -> int:
"""
Runs my awesome tool.
:param input: some int parameter
"""
return input * 2
```
> **NOTE:** We employ python docstrings to describe the tool and the parameters. It is important to document the tool and the parameters so that the model can use the tool correctly. It is recommended to experiment with different docstrings to see how they affect the model's behavior.
Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
```python ```python
# Example agent config with client provided tools # Example agent config with client provided tools
config = AgentConfig( client_tools = [
toolgroups=[ my_tool,
"builtin::websearch", ]
],
client_tools=[ToolDef(name="client_tool", description="Client provided tool")], agent_config = AgentConfig(
...,
client_tools=[client_tool.get_tool_definition() for client_tool in client_tools],
) )
agent = Agent(client, agent_config, client_tools)
``` ```
Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools. Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
## Tool Structure
Each tool has the following components:
- `name`: Unique identifier for the tool
- `description`: Human-readable description of the tool's functionality
- `parameters`: List of parameters the tool accepts
- `name`: Parameter name
- `parameter_type`: Data type (string, number, etc.)
- `description`: Parameter description
- `required`: Whether the parameter is required (default: true)
- `default`: Default value if any
Example tool definition:
```python
{
"name": "web_search",
"description": "Search the web for information",
"parameters": [
{
"name": "query",
"parameter_type": "string",
"description": "The query to search for",
"required": True,
}
],
}
```
## Tool Invocation ## Tool Invocation