llama-stack/docs/source/building_applications/agent_execution_loop.md
ehhuang ca2910d27a
docs: update test_agents to use new Agent SDK API (#1402)
# Summary:
new Agent SDK API is added in
https://github.com/meta-llama/llama-stack-client-python/pull/178

Update docs and test to reflect this.

Closes https://github.com/meta-llama/llama-stack/issues/1365

# Test Plan:
```bash
py.test -v -s --nbval-lax ./docs/getting_started.ipynb

LLAMA_STACK_CONFIG=fireworks \
   pytest -s -v tests/integration/agents/test_agents.py \
  --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct
```
2025-03-06 15:21:12 -08:00

134 lines
4.4 KiB
Markdown

## Agent Execution Loop
Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
Each agent turn follows these key steps:
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
2. **Context Retrieval**:
- If RAG is enabled, the agent can choose to query relevant documents from memory banks. You can use the `instructions` field to steer the agent.
- For new documents, they are first inserted into the memory bank.
- Retrieved context is provided to the LLM as a tool response in the message history.
3. **Inference Loop**: The agent enters its main execution loop:
- The LLM receives a user prompt (with previous tool outputs)
- The LLM generates a response, potentially with [tool calls](tools)
- If tool calls are present:
- Tool inputs are safety-checked
- Tools are executed (e.g., web search, code execution)
- Tool responses are fed back to the LLM for synthesis
- The loop continues until:
- The LLM provides a final response without tool calls
- Maximum iterations are reached
- Token limit is exceeded
4. **Final Safety Check**: The agent's final response is screened through safety shields
```{mermaid}
sequenceDiagram
participant U as User
participant E as Executor
participant M as Memory Bank
participant L as LLM
participant T as Tools
participant S as Safety Shield
Note over U,S: Agent Turn Start
U->>S: 1. Submit Prompt
activate S
S->>E: Input Safety Check
deactivate S
loop Inference Loop
E->>L: 2.1 Augment with Context
L-->>E: 2.2 Response (with/without tool calls)
alt Has Tool Calls
E->>S: Check Tool Input
S->>T: 3.1 Execute Tool
T-->>E: 3.2 Tool Response
E->>L: 4.1 Tool Response
L-->>E: 4.2 Synthesized Response
end
opt Stop Conditions
Note over E: Break if:
Note over E: - No tool calls
Note over E: - Max iterations reached
Note over E: - Token limit exceeded
end
end
E->>S: Output Safety Check
S->>U: 5. Final Response
```
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from rich.pretty import pprint
# Replace host and port
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
# Check with `llama-stack-client models list`
model="Llama3.2-3B-Instruct",
instructions="You are a helpful assistant",
# Enable both RAG and tool usage
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": ["my_docs"]},
},
"builtin::code_interpreter",
],
# Configure safety (optional)
input_shields=["llama_guard"],
output_shields=["llama_guard"],
# Control the inference loop
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.7, "top_p": 0.95},
"max_tokens": 2048,
},
)
session_id = agent.create_session("monitored_session")
# Stream the agent's execution steps
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
documents=[
{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain",
}
],
session_id=session_id,
)
# Monitor each step of execution
for log in EventLogger().log(response):
log.print()
# Using non-streaming API, the response contains input, steps, and output.
response = agent.create_turn(
messages=[{"role": "user", "content": "Analyze this code and run it"}],
documents=[
{
"content": "https://raw.githubusercontent.com/example/code.py",
"mime_type": "text/plain",
}
],
session_id=session_id,
)
pprint(f"Input: {response.input_messages}")
pprint(f"Output: {response.output_message.content}")
pprint(f"Steps: {response.steps}")
```