mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-28 10:54:19 +00:00
418 lines
11 KiB
Markdown
418 lines
11 KiB
Markdown
# Building AI Applications
|
|
|
|
Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively.
|
|
|
|
## Basic Inference
|
|
|
|
The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference:
|
|
|
|
```python
|
|
from llama_stack_client import LlamaStackClient
|
|
|
|
client = LlamaStackClient(base_url="http://localhost:5001")
|
|
|
|
# List available models
|
|
models = client.models.list()
|
|
|
|
# Simple chat completion
|
|
response = client.inference.chat_completion(
|
|
model_id="Llama3.2-3B-Instruct",
|
|
messages=[
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Write a haiku about coding"}
|
|
]
|
|
)
|
|
print(response.completion_message.content)
|
|
```
|
|
|
|
## Adding Memory & RAG
|
|
|
|
Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
|
|
|
|
1. **Vector Memory Banks**: For semantic search and retrieval
|
|
2. **Key-Value Memory Banks**: For structured data storage
|
|
3. **Keyword Memory Banks**: For basic text search
|
|
4. **Graph Memory Banks**: For relationship-based retrieval
|
|
|
|
Here's how to set up a vector memory bank for RAG:
|
|
|
|
```python
|
|
# Register a memory bank
|
|
bank_id = "my_documents"
|
|
response = client.memory_banks.register(
|
|
memory_bank_id=bank_id,
|
|
params={
|
|
"memory_bank_type": "vector",
|
|
"embedding_model": "all-MiniLM-L6-v2",
|
|
"chunk_size_in_tokens": 512
|
|
}
|
|
)
|
|
|
|
# Insert documents
|
|
documents = [
|
|
{
|
|
"document_id": "doc1",
|
|
"content": "Your document text here",
|
|
"mime_type": "text/plain"
|
|
}
|
|
]
|
|
client.memory.insert(bank_id, documents)
|
|
|
|
# Query documents
|
|
results = client.memory.query(
|
|
bank_id=bank_id,
|
|
query="What do you know about...",
|
|
)
|
|
```
|
|
|
|
## Implementing Safety Guardrails
|
|
|
|
Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
|
|
|
|
```python
|
|
# Register a safety shield
|
|
shield_id = "content_safety"
|
|
client.shields.register(
|
|
shield_id=shield_id,
|
|
provider_shield_id="llama-guard-basic"
|
|
)
|
|
|
|
# Run content through shield
|
|
response = client.safety.run_shield(
|
|
shield_id=shield_id,
|
|
messages=[{"role": "user", "content": "User message here"}]
|
|
)
|
|
|
|
if response.violation:
|
|
print(f"Safety violation detected: {response.violation.user_message}")
|
|
```
|
|
|
|
## Building Agents
|
|
|
|
Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
|
|
|
|
### The Agent Execution Loop
|
|
|
|
Each agent turn follows these key steps:
|
|
|
|
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
|
|
|
|
2. **Context Retrieval**:
|
|
- If RAG is enabled, the agent queries relevant documents from memory banks
|
|
- For new documents, they are first inserted into the memory bank
|
|
- Retrieved context is augmented to the user's prompt
|
|
|
|
3. **Inference Loop**: The agent enters its main execution loop:
|
|
- The LLM receives the augmented prompt (with context and/or previous tool outputs)
|
|
- The LLM generates a response, potentially with tool calls
|
|
- If tool calls are present:
|
|
- Tool inputs are safety-checked
|
|
- Tools are executed (e.g., web search, code execution)
|
|
- Tool responses are fed back to the LLM for synthesis
|
|
- The loop continues until:
|
|
- The LLM provides a final response without tool calls
|
|
- Maximum iterations are reached
|
|
- Token limit is exceeded
|
|
|
|
4. **Final Safety Check**: The agent's final response is screened through safety shields
|
|
|
|
```{mermaid}
|
|
sequenceDiagram
|
|
participant U as User
|
|
participant E as Executor
|
|
participant M as Memory Bank
|
|
participant L as LLM
|
|
participant T as Tools
|
|
participant S as Safety Shield
|
|
|
|
Note over U,S: Agent Turn Start
|
|
U->>S: 1. Submit Prompt
|
|
activate S
|
|
S->>E: Input Safety Check
|
|
deactivate S
|
|
|
|
E->>M: 2.1 Query Context
|
|
M-->>E: 2.2 Retrieved Documents
|
|
|
|
loop Inference Loop
|
|
E->>L: 3.1 Augment with Context
|
|
L-->>E: 3.2 Response (with/without tool calls)
|
|
|
|
alt Has Tool Calls
|
|
E->>S: Check Tool Input
|
|
S->>T: 4.1 Execute Tool
|
|
T-->>E: 4.2 Tool Response
|
|
E->>L: 5.1 Tool Response
|
|
L-->>E: 5.2 Synthesized Response
|
|
end
|
|
|
|
opt Stop Conditions
|
|
Note over E: Break if:
|
|
Note over E: - No tool calls
|
|
Note over E: - Max iterations reached
|
|
Note over E: - Token limit exceeded
|
|
end
|
|
end
|
|
|
|
E->>S: Output Safety Check
|
|
S->>U: 6. Final Response
|
|
```
|
|
|
|
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
|
|
|
|
```python
|
|
from llama_stack_client.lib.agents.event_logger import EventLogger
|
|
|
|
agent_config = AgentConfig(
|
|
model="Llama3.2-3B-Instruct",
|
|
instructions="You are a helpful assistant",
|
|
# Enable both RAG and tool usage
|
|
tools=[
|
|
{
|
|
"type": "memory",
|
|
"memory_bank_configs": [{
|
|
"type": "vector",
|
|
"bank_id": "my_docs"
|
|
}],
|
|
"max_tokens_in_context": 4096
|
|
},
|
|
{
|
|
"type": "code_interpreter",
|
|
"enable_inline_code_execution": True
|
|
}
|
|
],
|
|
# Configure safety
|
|
input_shields=["content_safety"],
|
|
output_shields=["content_safety"],
|
|
# Control the inference loop
|
|
max_infer_iters=5,
|
|
sampling_params={
|
|
"temperature": 0.7,
|
|
"max_tokens": 2048
|
|
}
|
|
)
|
|
|
|
agent = Agent(client, agent_config)
|
|
session_id = agent.create_session("monitored_session")
|
|
|
|
# Stream the agent's execution steps
|
|
response = agent.create_turn(
|
|
messages=[{"role": "user", "content": "Analyze this code and run it"}],
|
|
attachments=[{
|
|
"content": "https://raw.githubusercontent.com/example/code.py",
|
|
"mime_type": "text/plain"
|
|
}],
|
|
session_id=session_id
|
|
)
|
|
|
|
# Monitor each step of execution
|
|
for log in EventLogger().log(response):
|
|
if log.event.step_type == "memory_retrieval":
|
|
print("Retrieved context:", log.event.retrieved_context)
|
|
elif log.event.step_type == "inference":
|
|
print("LLM output:", log.event.model_response)
|
|
elif log.event.step_type == "tool_execution":
|
|
print("Tool call:", log.event.tool_call)
|
|
print("Tool response:", log.event.tool_response)
|
|
elif log.event.step_type == "shield_call":
|
|
if log.event.violation:
|
|
print("Safety violation:", log.event.violation)
|
|
```
|
|
|
|
This example shows how an agent can: Llama Stack provides a high-level agent framework:
|
|
|
|
```python
|
|
from llama_stack_client.lib.agents.agent import Agent
|
|
from llama_stack_client.types.agent_create_params import AgentConfig
|
|
|
|
# Configure an agent
|
|
agent_config = AgentConfig(
|
|
model="Llama3.2-3B-Instruct",
|
|
instructions="You are a helpful assistant",
|
|
tools=[
|
|
{
|
|
"type": "memory",
|
|
"memory_bank_configs": [],
|
|
"query_generator_config": {
|
|
"type": "default",
|
|
"sep": " "
|
|
}
|
|
}
|
|
],
|
|
input_shields=["content_safety"],
|
|
output_shields=["content_safety"],
|
|
enable_session_persistence=True
|
|
)
|
|
|
|
# Create an agent
|
|
agent = Agent(client, agent_config)
|
|
session_id = agent.create_session("my_session")
|
|
|
|
# Run agent turns
|
|
response = agent.create_turn(
|
|
messages=[{"role": "user", "content": "Your question here"}],
|
|
session_id=session_id
|
|
)
|
|
```
|
|
|
|
### Adding Tools to Agents
|
|
|
|
Agents can be enhanced with various tools:
|
|
|
|
1. **Search**: Web search capabilities through providers like Brave
|
|
2. **Code Interpreter**: Execute code snippets
|
|
3. **RAG**: Memory and document retrieval
|
|
4. **Function Calling**: Custom function execution
|
|
5. **WolframAlpha**: Mathematical computations
|
|
6. **Photogen**: Image generation
|
|
|
|
Example of configuring an agent with tools:
|
|
|
|
```python
|
|
agent_config = AgentConfig(
|
|
model="Llama3.2-3B-Instruct",
|
|
tools=[
|
|
{
|
|
"type": "brave_search",
|
|
"api_key": "YOUR_API_KEY",
|
|
"engine": "brave"
|
|
},
|
|
{
|
|
"type": "code_interpreter",
|
|
"enable_inline_code_execution": True
|
|
}
|
|
],
|
|
tool_choice="auto",
|
|
tool_prompt_format="json"
|
|
)
|
|
```
|
|
|
|
## Building RAG-Enhanced Agents
|
|
|
|
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
|
|
|
```python
|
|
from llama_stack_client.types import Attachment
|
|
|
|
# Create attachments from documents
|
|
attachments = [
|
|
Attachment(
|
|
content="https://raw.githubusercontent.com/example/doc.rst",
|
|
mime_type="text/plain"
|
|
)
|
|
]
|
|
|
|
# Configure agent with memory
|
|
agent_config = AgentConfig(
|
|
model="Llama3.2-3B-Instruct",
|
|
instructions="You are a helpful assistant",
|
|
tools=[{
|
|
"type": "memory",
|
|
"memory_bank_configs": [],
|
|
"query_generator_config": {"type": "default", "sep": " "},
|
|
"max_tokens_in_context": 4096,
|
|
"max_chunks": 10
|
|
}],
|
|
enable_session_persistence=True
|
|
)
|
|
|
|
agent = Agent(client, agent_config)
|
|
session_id = agent.create_session("rag_session")
|
|
|
|
# Initial document ingestion
|
|
response = agent.create_turn(
|
|
messages=[{
|
|
"role": "user",
|
|
"content": "I am providing some documents for reference."
|
|
}],
|
|
attachments=attachments,
|
|
session_id=session_id
|
|
)
|
|
|
|
# Query with RAG
|
|
response = agent.create_turn(
|
|
messages=[{
|
|
"role": "user",
|
|
"content": "What are the key topics in the documents?"
|
|
}],
|
|
session_id=session_id
|
|
)
|
|
```
|
|
|
|
## Testing & Evaluation
|
|
|
|
Llama Stack provides built-in tools for evaluating your applications:
|
|
|
|
1. **Benchmarking**: Test against standard datasets
|
|
2. **Application Evaluation**: Score your application's outputs
|
|
3. **Custom Metrics**: Define your own evaluation criteria
|
|
|
|
Here's how to set up basic evaluation:
|
|
|
|
```python
|
|
# Create an evaluation task
|
|
response = client.eval_tasks.register(
|
|
eval_task_id="my_eval",
|
|
dataset_id="my_dataset",
|
|
scoring_functions=["accuracy", "relevance"]
|
|
)
|
|
|
|
# Run evaluation
|
|
job = client.eval.run_eval(
|
|
task_id="my_eval",
|
|
task_config={
|
|
"type": "app",
|
|
"eval_candidate": {
|
|
"type": "agent",
|
|
"config": agent_config
|
|
}
|
|
}
|
|
)
|
|
|
|
# Get results
|
|
result = client.eval.job_result(
|
|
task_id="my_eval",
|
|
job_id=job.job_id
|
|
)
|
|
```
|
|
|
|
## Debugging & Monitoring
|
|
|
|
Llama Stack includes comprehensive telemetry for debugging and monitoring your applications:
|
|
|
|
1. **Tracing**: Track request flows across components
|
|
2. **Metrics**: Measure performance and usage
|
|
3. **Logging**: Debug issues and track behavior
|
|
|
|
The telemetry system supports multiple output formats:
|
|
|
|
- OpenTelemetry for visualization in tools like Jaeger
|
|
- SQLite for local storage and querying
|
|
- Console output for development
|
|
|
|
Example of querying traces:
|
|
|
|
```python
|
|
# Query traces for a session
|
|
traces = client.telemetry.query_traces(
|
|
attribute_filters=[{
|
|
"key": "session_id",
|
|
"op": "eq",
|
|
"value": session_id
|
|
}]
|
|
)
|
|
|
|
# Get detailed span information
|
|
span_tree = client.telemetry.get_span_tree(
|
|
span_id=traces[0].root_span_id
|
|
)
|
|
```
|
|
|
|
For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section.
|
|
|
|
```{toctree}
|
|
:hidden:
|
|
:maxdepth: 3
|
|
|
|
telemetry
|
|
```
|