# Llama Stack RAG Lifecycle

In this notebook, we will walk through the lifecycle of building and evaluating a RAG pipeline using Llama Stack. 

**Example: Torchtune Knowledge Agent** 

Throughout this notebook, we will build a knowledge agent that can answer questions about the Torchtune project. 

## 0. Setup

In [13]:
from llama_stack_client import LlamaStackClient, Agent
from llama_stack.core.library_client import LlamaStackAsLibraryClient
from rich.pretty import pprint
import json
import uuid
from pydantic import BaseModel
import rich
import os
try:
    from google.colab import userdata
    os.environ['FIREWORKS_API_KEY'] = userdata.get('FIREWORKS_API_KEY')
except ImportError:
    print("Not in Google Colab environment")

# client = LlamaStackAsLibraryClient("fireworks", provider_data = {"fireworks_api_key": os.environ['FIREWORKS_API_KEY']})
# _ = client.initialize()

# Uncomment to run on a hosted Llama Stack server
client = LlamaStackClient(base_url="http://localhost:8321")

MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"

Not in Google Colab environment


## 1. Simple Vanilla Agent

First, we will build a simple vanilla agent without any access to external knowledge base or tools, and check how it performs on a couple of questions. 


In [14]:
# First, let's come up with a couple of examples to test the agent
examples = [
    {
        "input_query": "What precision formats does torchtune support?",
        "expected_answer": "Torchtune supports two data types for precision: fp32 (full-precision) which uses 4 bytes per model and optimizer parameter, and bfloat16 (half-precision) which uses 2 bytes per model and optimizer parameter."
    },
    {
        "input_query": "What does DoRA stand for in torchtune?",
        "expected_answer": "Weight-Decomposed Low-Rank Adaptation"
    },
    {
        "input_query": "How does the CPUOffloadOptimizer reduce GPU memory usage?",
        "expected_answer": "The CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It can also optionally offload gradients to CPU by using offload_gradients=True"
    },
    {
        "input_query": "How do I ensure only LoRA parameters are trainable when fine-tuning?",
        "expected_answer": "You can set only LoRA parameters to trainable using torchtune's utility functions: first fetch all LoRA parameters with lora_params = get_adapter_params(lora_model), then set them as trainable with set_trainable_params(lora_model, lora_params). The LoRA recipe handles this automatically."
    }
]

In [16]:
simple_agent = Agent(client,
                     model=MODEL_ID, 
                     instructions="You are a helpful assistant that can answer questions about the Torchtune project.")
for example in examples:
    simple_session_id = simple_agent.create_session(session_name=f"simple_session_{uuid.uuid4()}")
    response = simple_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": example["input_query"]
            }
        ],
        session_id=simple_session_id,
        stream=False
    )
    rich.print(f"[bold cyan]Question:[/bold cyan] {example['input_query']}")
    rich.print(f"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}")

#### 1.1 Evaluate Agent Responses
Let's gather up the agent's logs and evaluate the agent's performance. We can see that our agent's response is quite bad and off from the expected answer.

In [17]:
eval_rows = []
for i, session_id in enumerate(simple_agent.sessions):
    session_response = client.agents.session.retrieve(agent_id=simple_agent.agent_id, session_id=session_id)
    for turn in session_response.turns:
        eval_rows.append({
            "input_query": examples[i]["input_query"],
            "expected_answer": examples[i]["expected_answer"],
            "generated_answer": turn.output_message.content,
        })

scoring_params = {
    "braintrust::factuality": None,
}
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params,
)
pprint(scoring_response)

## 2. Search Agent

Now, let's see how we can improve the agent's performance by adding a search tool.

In [18]:
search_agent = Agent(client, 
                     model=MODEL_ID,
                     instructions="You are a helpful assistant that can answer questions about the Torchtune project. You should always use the search tool to answer questions.",
                     tools=["builtin::websearch"])
for example in examples:
    search_session_id = search_agent.create_session(session_name=f"search_session_{uuid.uuid4()}")
    response = search_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": example["input_query"]
            }
        ],
        session_id=search_session_id,
        stream=False
    )
    rich.print(f"[bold cyan]Question:[/bold cyan] {example['input_query']}")
    rich.print(f"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}")

#### 2.1 Evaluate Agent Responses

We can see that with a search tool, the agent's performance is much better, and have less hallucinations. 

In [19]:
eval_rows = []
for i, session_id in enumerate(search_agent.sessions):
    session_response = client.agents.session.retrieve(agent_id=search_agent.agent_id, session_id=session_id)
    for turn in session_response.turns:
        eval_rows.append({
            "input_query": examples[i]["input_query"],
            "expected_answer": examples[i]["expected_answer"],
            "generated_answer": turn.output_message.content,
        })

scoring_params = {
    "braintrust::factuality": None,
}
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params,
)
pprint(scoring_response)

### 3. RAG Agent

Now, let's see how we can improve the agent's performance by adding a RAG tool that explicitly retrieves information from Torchtune's documentation. 

In [None]:
from llama_stack_client.types import Document
urls = [
    "memory_optimizations.rst",
    "chat.rst",
    "llama3.rst",
    "qat_finetune.rst",
    "lora_finetune.rst",
]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

vector_providers = [
    provider for provider in client.providers.list() if provider.api == "vector_io"
]
selected_vector_provider = vector_providers[0]
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="nomic-embed-text-v1.5",
    embedding_dimension=768,
    provider_id=selected_vector_provider.provider_id,
)

client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

In [27]:
rag_agent = Agent(
    client,
    model=MODEL_ID,
    instructions="You are a helpful assistant that can answer questions about the Torchtune project. You should always use the RAG tool to answer questions.",
    tools=[{
        "name": "builtin::rag",
        "args": {"vector_db_ids": [vector_db_id]},
    }],
)

for example in examples:
    rag_session_id = rag_agent.create_session(session_name=f"rag_session_{uuid.uuid4()}")
    response = rag_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": example["input_query"]
            }
        ],
        session_id=rag_session_id,
        stream=False
    )
    rich.print(f"[bold cyan]Question:[/bold cyan] {example['input_query']}")
    rich.print(f"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}")

In [28]:
eval_rows = []
for i, session_id in enumerate(rag_agent.sessions):
    session_response = client.agents.session.retrieve(agent_id=rag_agent.agent_id, session_id=session_id)
    for turn in session_response.turns:
        eval_rows.append({
            "input_query": examples[i]["input_query"],
            "expected_answer": examples[i]["expected_answer"],
            "generated_answer": turn.output_message.content,
        })

scoring_params = {
    "braintrust::factuality": None,
}
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params,
)
pprint(scoring_response)

#### Deep dive into RAG Tool Performance
- Now, let's take a closer look at how the RAG tool is doing, specifically on the second example where the agent's answer is not correct on identifying what DoRA stands for. 
- Notice that the issue lies with the retrieval step, where the retrieved document is not relevant to the question. 

In [29]:
session_response = client.agents.session.retrieve(agent_id=rag_agent.agent_id, session_id=rag_agent.sessions[1])
pprint(session_response.turns)

### 3.1 Improved RAG with Long Context

- Instead of performing reteival tool, we send documents as attachments to the agent and let it use the entire document context. 
- Note how that the model is able to understand the entire context from documentation and answers the question with better factuality with improved retrieval. 

In [19]:
urls = [
    "memory_optimizations.rst",
    "chat.rst",
    "llama3.rst",
    "qat_finetune.rst",
    "lora_finetune.rst",
]

attachments = [
    {
        "content": {
            "uri": f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        },
        "mime_type": "text/plain",
    }

    for i, url in enumerate(urls)
]

rag_attachment_agent = Agent(
    client,
    model=MODEL_ID,
    instructions="You are a helpful assistant that can answer questions about the Torchtune project. Use context from attached documentation for Torchtune to answer questions.",
)

for example in examples:
    session_id = rag_attachment_agent.create_session(session_name=f"rag_attachment_session_{uuid.uuid4()}")
    response = rag_attachment_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": example["input_query"]
            }
        ],
        session_id=session_id,
        documents=attachments,
        stream=False
    )
    rich.print(f"[bold cyan]Question:[/bold cyan] {example['input_query']}")
    rich.print(f"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}")



In [16]:
eval_rows = []
for i, session_id in enumerate(rag_attachment_agent.sessions):
    session_response = client.agents.session.retrieve(agent_id=rag_attachment_agent.agent_id, session_id=session_id)
    for turn in session_response.turns:
        eval_rows.append({
            "input_query": examples[i]["input_query"],
            "expected_answer": examples[i]["expected_answer"],
            "generated_answer": turn.output_message.content,
        })

scoring_params = {
    "braintrust::factuality": None,
}
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params,
)
pprint(scoring_response)