{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Llama Stack RAG Lifecycle\n", "\n", "In this notebook, we will walk through the lifecycle of building and evaluating a RAG pipeline using Llama Stack. \n", "\n", "**Example: Torchtune Knowledge Agent** \n", "\n", "Throughout this notebook, we will build a knowledge agent that can answer questions about the Torchtune project. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Setup" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not in Google Colab environment\n" ] } ], "source": [ "from llama_stack_client import LlamaStackClient, Agent\n", "from llama_stack.distribution.library_client import LlamaStackAsLibraryClient\n", "from rich.pretty import pprint\n", "import json\n", "import uuid\n", "from pydantic import BaseModel\n", "import rich\n", "import os\n", "try:\n", " from google.colab import userdata\n", " os.environ['FIREWORKS_API_KEY'] = userdata.get('FIREWORKS_API_KEY')\n", "except ImportError:\n", " print(\"Not in Google Colab environment\")\n", "\n", "# client = LlamaStackAsLibraryClient(\"fireworks\", provider_data = {\"fireworks_api_key\": os.environ['FIREWORKS_API_KEY']})\n", "# _ = client.initialize()\n", "\n", "# Uncomment to run on a hosted Llama Stack server\n", "client = LlamaStackClient(base_url=\"http://localhost:8321\")\n", "\n", "MODEL_ID = \"meta-llama/Llama-3.3-70B-Instruct\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Simple Vanilla Agent\n", "\n", "First, we will build a simple vanilla agent without any access to external knowledge base or tools, and check how it performs on a couple of questions. \n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# First, let's come up with a couple of examples to test the agent\n", "examples = [\n", " {\n", " \"input_query\": \"What precision formats does torchtune support?\",\n", " \"expected_answer\": \"Torchtune supports two data types for precision: fp32 (full-precision) which uses 4 bytes per model and optimizer parameter, and bfloat16 (half-precision) which uses 2 bytes per model and optimizer parameter.\"\n", " },\n", " {\n", " \"input_query\": \"What does DoRA stand for in torchtune?\",\n", " \"expected_answer\": \"Weight-Decomposed Low-Rank Adaptation\"\n", " },\n", " {\n", " \"input_query\": \"How does the CPUOffloadOptimizer reduce GPU memory usage?\",\n", " \"expected_answer\": \"The CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It can also optionally offload gradients to CPU by using offload_gradients=True\"\n", " },\n", " {\n", " \"input_query\": \"How do I ensure only LoRA parameters are trainable when fine-tuning?\",\n", " \"expected_answer\": \"You can set only LoRA parameters to trainable using torchtune's utility functions: first fetch all LoRA parameters with lora_params = get_adapter_params(lora_model), then set them as trainable with set_trainable_params(lora_model, lora_params). The LoRA recipe handles this automatically.\"\n", " }\n", "]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Question: What precision formats does torchtune support?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What precision formats does torchtune support?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: Torchtune supports the following precision formats:\n", "\n", "* Full precision (FP32)\n", "* Mixed precision (FP16)\n", "\n", "It may also support other formats such as INT8 and BF16 in the future, but currently, it primarily focuses on FP32 \n", "and FP16. \n", "\n", "Please note that the specific precision formats supported by Torchtune may change over time, and it's always best \n", "to check the official documentation for the most up-to-date information.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m Torchtune supports the following precision formats:\n", "\n", "* Full precision \u001b[1m(\u001b[0mFP32\u001b[1m)\u001b[0m\n", "* Mixed precision \u001b[1m(\u001b[0mFP16\u001b[1m)\u001b[0m\n", "\n", "It may also support other formats such as INT8 and BF16 in the future, but currently, it primarily focuses on FP32 \n", "and FP16. \n", "\n", "Please note that the specific precision formats supported by Torchtune may change over time, and it's always best \n", "to check the official documentation for the most up-to-date information.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: What does DoRA stand for in torchtune?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What does DoRA stand for in torchtune?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: In the context of the Torchtune project, DoRA stands for \"Decoupled Optimizer for Reparameterized \n", "Architectures\".\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m In the context of the Torchtune project, DoRA stands for \u001b[32m\"Decoupled Optimizer for Reparameterized \u001b[0m\n", "\u001b[32mArchitectures\"\u001b[0m.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How does the CPUOffloadOptimizer reduce GPU memory usage?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How does the CPUOffloadOptimizer reduce GPU memory usage?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: The CPUOffloadOptimizer in the Torchtune project is designed to reduce GPU memory usage by offloading\n", "certain computations from the GPU to the CPU. Here's how it works:\n", "\n", "1. **Identifying offloadable operations**: The optimizer analyzes the computation graph of the model and identifies\n", "operations that can be offloaded from the GPU to the CPU. These operations are typically those that don't require \n", "the massive parallel processing capabilities of the GPU, such as data preprocessing, encoding, or decoding.\n", "2. **Offloading operations to CPU**: The optimizer offloads the identified operations to the CPU, which frees up \n", "GPU memory and reduces the amount of data that needs to be transferred between the GPU and CPU.\n", "3. **Minimizing data transfer**: The optimizer minimizes the amount of data that needs to be transferred between \n", "the GPU and CPU by only transferring the necessary data for the offloaded operations. This reduces the overhead of \n", "data transfer and helps to conserve GPU memory.\n", "4. **Optimizing CPU-GPU synchronization**: The optimizer ensures that the CPU and GPU are properly synchronized, \n", "which helps to prevent unnecessary memory allocations and deallocations on the GPU.\n", "5. **Dynamic memory allocation**: The optimizer can dynamically allocate and deallocate memory on the GPU as \n", "needed, which helps to reduce memory fragmentation and waste.\n", "\n", "By offloading computations to the CPU and minimizing data transfer, the CPUOffloadOptimizer can significantly \n", "reduce GPU memory usage, which can lead to:\n", "\n", "* Improved model training and inference performance\n", "* Increased batch sizes and throughput\n", "* Reduced out-of-memory errors\n", "* Better support for larger models and datasets\n", "\n", "Overall, the CPUOffloadOptimizer is a powerful tool for optimizing GPU memory usage in deep learning workloads, and\n", "can help to improve the overall performance and efficiency of the Torchtune project.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m The CPUOffloadOptimizer in the Torchtune project is designed to reduce GPU memory usage by offloading\n", "certain computations from the GPU to the CPU. Here's how it works:\n", "\n", "\u001b[1;36m1\u001b[0m. **Identifying offloadable operations**: The optimizer analyzes the computation graph of the model and identifies\n", "operations that can be offloaded from the GPU to the CPU. These operations are typically those that don't require \n", "the massive parallel processing capabilities of the GPU, such as data preprocessing, encoding, or decoding.\n", "\u001b[1;36m2\u001b[0m. **Offloading operations to CPU**: The optimizer offloads the identified operations to the CPU, which frees up \n", "GPU memory and reduces the amount of data that needs to be transferred between the GPU and CPU.\n", "\u001b[1;36m3\u001b[0m. **Minimizing data transfer**: The optimizer minimizes the amount of data that needs to be transferred between \n", "the GPU and CPU by only transferring the necessary data for the offloaded operations. This reduces the overhead of \n", "data transfer and helps to conserve GPU memory.\n", "\u001b[1;36m4\u001b[0m. **Optimizing CPU-GPU synchronization**: The optimizer ensures that the CPU and GPU are properly synchronized, \n", "which helps to prevent unnecessary memory allocations and deallocations on the GPU.\n", "\u001b[1;36m5\u001b[0m. **Dynamic memory allocation**: The optimizer can dynamically allocate and deallocate memory on the GPU as \n", "needed, which helps to reduce memory fragmentation and waste.\n", "\n", "By offloading computations to the CPU and minimizing data transfer, the CPUOffloadOptimizer can significantly \n", "reduce GPU memory usage, which can lead to:\n", "\n", "* Improved model training and inference performance\n", "* Increased batch sizes and throughput\n", "* Reduced out-of-memory errors\n", "* Better support for larger models and datasets\n", "\n", "Overall, the CPUOffloadOptimizer is a powerful tool for optimizing GPU memory usage in deep learning workloads, and\n", "can help to improve the overall performance and efficiency of the Torchtune project.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How do I ensure only LoRA parameters are trainable when fine-tuning?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How do I ensure only LoRA parameters are trainable when fine-tuning?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: To ensure only LoRA (Low-Rank Adaptation) parameters are trainable when fine-tuning a model with \n", "Torchtune, you can follow these steps:\n", "\n", "1. **Freeze the original model weights**: Before fine-tuning, you need to freeze the original model weights to \n", "prevent them from being updated during the fine-tuning process. You can do this by setting the `requires_grad` \n", "attribute of the model parameters to `False`. This will prevent the original model weights from being updated.\n", "\n", "2. **Create LoRA parameters**: Create LoRA parameters for the layers you want to fine-tune. LoRA parameters are \n", "typically added to the original model weights to adapt the model to the new task.\n", "\n", "3. **Set LoRA parameters as trainable**: Set the LoRA parameters as trainable by setting their `requires_grad` \n", "attribute to `True`. This will allow the LoRA parameters to be updated during the fine-tuning process.\n", "\n", "Here's a sample code snippet to illustrate this:\n", "```python\n", "import torch\n", "import torch.nn as nn\n", "\n", "# Assume 'model' is your pre-trained model\n", "model = ...\n", "\n", "# Freeze the original model weights\n", "for param in model.parameters():\n", " param.requires_grad = False\n", "\n", "# Create LoRA parameters\n", "lora_params = []\n", "for name, module in model.named_modules():\n", " if isinstance(module, nn.Linear): # or any other module you want to fine-tune\n", " lora_param = nn.Parameter(torch.randn(module.weight.shape))\n", " lora_params.append(lora_param)\n", " setattr(model, f\"{name}_lora\", lora_param)\n", "\n", "# Set LoRA parameters as trainable\n", "for param in lora_params:\n", " param.requires_grad = True\n", "\n", "# Fine-tune the model with LoRA parameters\n", "optimizer = torch.optim.Adam(lora_params, lr=1e-4)\n", "```\n", "By following these steps, you can ensure that only the LoRA parameters are trainable during fine-tuning, while \n", "keeping the original model weights frozen.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m To ensure only LoRA \u001b[1m(\u001b[0mLow-Rank Adaptation\u001b[1m)\u001b[0m parameters are trainable when fine-tuning a model with \n", "Torchtune, you can follow these steps:\n", "\n", "\u001b[1;36m1\u001b[0m. **Freeze the original model weights**: Before fine-tuning, you need to freeze the original model weights to \n", "prevent them from being updated during the fine-tuning process. You can do this by setting the `requires_grad` \n", "attribute of the model parameters to `\u001b[3;91mFalse\u001b[0m`. This will prevent the original model weights from being updated.\n", "\n", "\u001b[1;36m2\u001b[0m. **Create LoRA parameters**: Create LoRA parameters for the layers you want to fine-tune. LoRA parameters are \n", "typically added to the original model weights to adapt the model to the new task.\n", "\n", "\u001b[1;36m3\u001b[0m. **Set LoRA parameters as trainable**: Set the LoRA parameters as trainable by setting their `requires_grad` \n", "attribute to `\u001b[3;92mTrue\u001b[0m`. This will allow the LoRA parameters to be updated during the fine-tuning process.\n", "\n", "Here's a sample code snippet to illustrate this:\n", "```python\n", "import torch\n", "import torch.nn as nn\n", "\n", "# Assume \u001b[32m'model'\u001b[0m is your pre-trained model\n", "model = \u001b[33m...\u001b[0m\n", "\n", "# Freeze the original model weights\n", "for param in \u001b[1;35mmodel.parameters\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m:\n", " param.requires_grad = \u001b[3;91mFalse\u001b[0m\n", "\n", "# Create LoRA parameters\n", "lora_params = \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n", "for name, module in \u001b[1;35mmodel.named_modules\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m:\n", " if \u001b[1;35misinstance\u001b[0m\u001b[1m(\u001b[0mmodule, nn.Linear\u001b[1m)\u001b[0m: # or any other module you want to fine-tune\n", " lora_param = \u001b[1;35mnn.Parameter\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mtorch.randn\u001b[0m\u001b[1m(\u001b[0mmodule.weight.shape\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m\n", " \u001b[1;35mlora_params.append\u001b[0m\u001b[1m(\u001b[0mlora_param\u001b[1m)\u001b[0m\n", " \u001b[1;35msetattr\u001b[0m\u001b[1m(\u001b[0mmodel, f\"\u001b[1m{\u001b[0mname\u001b[1m}\u001b[0m_lora\", lora_param\u001b[1m)\u001b[0m\n", "\n", "# Set LoRA parameters as trainable\n", "for param in lora_params:\n", " param.requires_grad = \u001b[3;92mTrue\u001b[0m\n", "\n", "# Fine-tune the model with LoRA parameters\n", "optimizer = \u001b[1;35mtorch.optim.Adam\u001b[0m\u001b[1m(\u001b[0mlora_params, \u001b[33mlr\u001b[0m=\u001b[1;36m1e\u001b[0m\u001b[1;36m-4\u001b[0m\u001b[1m)\u001b[0m\n", "```\n", "By following these steps, you can ensure that only the LoRA parameters are trainable during fine-tuning, while \n", "keeping the original model weights frozen.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "simple_agent = Agent(client,\n", " model=MODEL_ID, \n", " instructions=\"You are a helpful assistant that can answer questions about the Torchtune project.\")\n", "for example in examples:\n", " simple_session_id = simple_agent.create_session(session_name=f\"simple_session_{uuid.uuid4()}\")\n", " response = simple_agent.create_turn(\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": example[\"input_query\"]\n", " }\n", " ],\n", " session_id=simple_session_id,\n", " stream=False\n", " )\n", " rich.print(f\"[bold cyan]Question:[/bold cyan] {example['input_query']}\")\n", " rich.print(f\"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1 Evaluate Agent Responses\n", "Let's gather up the agent's logs and evaluate the agent's performance. We can see that our agent's response is quite bad and off from the expected answer." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ScoringScoreResponse(\n", "│ results={\n", "│ │ 'braintrust::factuality': ScoringResult(\n", "│ │ │ aggregated_results={'average': {'average': 0.3}},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.0,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'D',\n", "│ │ │ │ │ │ 'rationale': '1. **Expert Answer**: The expert states that Torchtune supports two precision formats: fp32 (full-precision) and bfloat16 (half-precision).\\n\\n2. **Submitted Answer**: The submission mentions that Torchtune supports full precision (FP32) and mixed precision (FP16). It also speculates about potential future support for other formats like INT8 and BF16, but emphasizes the current focus on FP32 and FP16.\\n\\n3. **Comparison**:\\n - Both answers agree on the support for FP32.\\n - The expert mentions bfloat16 (BF16), while the submission mentions FP16 and speculates about BF16 in the future. This is a key difference as the expert confirms BF16 support, whereas the submission does not.\\n - The submission introduces FP16, which is not mentioned by the expert.\\n - The submission also speculates about future support for INT8 and BF16, which is not addressed by the expert.\\n\\n4. **Conclusion**: There is a disagreement between the submitted answer and the expert answer regarding the precision formats supported by Torchtune. The expert confirms BF16 support, while the submission does not, and instead mentions FP16, which the expert does not confirm. Therefore, the correct choice is (D).'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.0,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'D',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation\".\\n2. The submitted answer states that DoRA stands for \"Decoupled Optimizer for Reparameterized Architectures\".\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer regarding what DoRA stands for in the context of torchtune.\\n5. Therefore, the correct choice is (D) There is a disagreement between the submitted answer and the expert answer.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on the CPU and performing optimizer steps on the CPU. It also mentions the optional offloading of gradients to the CPU.\\n2. The submitted answer describes a broader mechanism of offloading computations from the GPU to the CPU, including identifying offloadable operations, minimizing data transfer, optimizing CPU-GPU synchronization, and dynamic memory allocation.\\n3. The submitted answer does not explicitly mention keeping optimizer states on the CPU or performing optimizer steps on the CPU, which are key points in the expert answer.\\n4. The submitted answer provides additional details about the process of offloading operations and its benefits, which are not mentioned in the expert answer.\\n5. The submitted answer does not conflict with the expert answer but rather expands on the concept of offloading to the CPU with additional mechanisms and benefits.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it, as it includes all the information from the expert answer and adds more details.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': \"1. **Expert Answer Summary**: The expert answer provides a concise method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params(lora_model)` and setting them as trainable with `set_trainable_params(lora_model, lora_params)`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Summary**: The submitted answer provides a more detailed explanation, including steps to freeze the original model weights, create LoRA parameters, and set them as trainable. It includes a code snippet demonstrating these steps, using PyTorch to manually set `requires_grad` attributes.\\n\\n3. **Comparison**:\\n - Both answers aim to ensure only LoRA parameters are trainable.\\n - The expert answer uses torchtune's utility functions, while the submitted answer provides a manual method using PyTorch.\\n - The submitted answer includes additional steps and a code example, which are not present in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer. It includes all the information from the expert answer (ensuring only LoRA parameters are trainable) and adds more detail on how to achieve this manually. There is no conflict between the two answers, as they both achieve the same goal using different methods.\\n\\nTherefore, the correct choice is (B) The submitted answer is a superset of the expert answer and is fully consistent with it.\"\n", "│ │ │ │ │ }\n", "│ │ │ │ }\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mScoringScoreResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mresults\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'braintrust::factuality'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1;36m0.3\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'D'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. **Expert Answer**: The expert states that Torchtune supports two precision formats: fp32 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mfull-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and bfloat16 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mhalf-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.\\n\\n2. **Submitted Answer**: The submission mentions that Torchtune supports full precision \u001b[0m\u001b[32m(\u001b[0m\u001b[32mFP32\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and mixed precision \u001b[0m\u001b[32m(\u001b[0m\u001b[32mFP16\u001b[0m\u001b[32m)\u001b[0m\u001b[32m. It also speculates about potential future support for other formats like INT8 and BF16, but emphasizes the current focus on FP32 and FP16.\\n\\n3. **Comparison**:\\n - Both answers agree on the support for FP32.\\n - The expert mentions bfloat16 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mBF16\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, while the submission mentions FP16 and speculates about BF16 in the future. This is a key difference as the expert confirms BF16 support, whereas the submission does not.\\n - The submission introduces FP16, which is not mentioned by the expert.\\n - The submission also speculates about future support for INT8 and BF16, which is not addressed by the expert.\\n\\n4. **Conclusion**: There is a disagreement between the submitted answer and the expert answer regarding the precision formats supported by Torchtune. The expert confirms BF16 support, while the submission does not, and instead mentions FP16, which the expert does not confirm. Therefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mD\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'D'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation\".\\n2. The submitted answer states that DoRA stands for \"Decoupled Optimizer for Reparameterized Architectures\".\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer regarding what DoRA stands for in the context of torchtune.\\n5. Therefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mD\u001b[0m\u001b[32m)\u001b[0m\u001b[32m There is a disagreement between the submitted answer and the expert answer.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on the CPU and performing optimizer steps on the CPU. It also mentions the optional offloading of gradients to the CPU.\\n2. The submitted answer describes a broader mechanism of offloading computations from the GPU to the CPU, including identifying offloadable operations, minimizing data transfer, optimizing CPU-GPU synchronization, and dynamic memory allocation.\\n3. The submitted answer does not explicitly mention keeping optimizer states on the CPU or performing optimizer steps on the CPU, which are key points in the expert answer.\\n4. The submitted answer provides additional details about the process of offloading operations and its benefits, which are not mentioned in the expert answer.\\n5. The submitted answer does not conflict with the expert answer but rather expands on the concept of offloading to the CPU with additional mechanisms and benefits.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it, as it includes all the information from the expert answer and adds more details.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m\"1. **Expert Answer Summary**: The expert answer provides a concise method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model\u001b[0m\u001b[32m)\u001b[0m\u001b[32m` and setting them as trainable with `set_trainable_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model, lora_params\u001b[0m\u001b[32m)\u001b[0m\u001b[32m`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Summary**: The submitted answer provides a more detailed explanation, including steps to freeze the original model weights, create LoRA parameters, and set them as trainable. It includes a code snippet demonstrating these steps, using PyTorch to manually set `requires_grad` attributes.\\n\\n3. **Comparison**:\\n - Both answers aim to ensure only LoRA parameters are trainable.\\n - The expert answer uses torchtune's utility functions, while the submitted answer provides a manual method using PyTorch.\\n - The submitted answer includes additional steps and a code example, which are not present in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer. It includes all the information from the expert answer \u001b[0m\u001b[32m(\u001b[0m\u001b[32mensuring only LoRA parameters are trainable\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and adds more detail on how to achieve this manually. There is no conflict between the two answers, as they both achieve the same goal using different methods.\\n\\nTherefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mB\u001b[0m\u001b[32m)\u001b[0m\u001b[32m The submitted answer is a superset of the expert answer and is fully consistent with it.\"\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eval_rows = []\n", "for i, session_id in enumerate(simple_agent.sessions):\n", " session_response = client.agents.session.retrieve(agent_id=simple_agent.agent_id, session_id=session_id)\n", " for turn in session_response.turns:\n", " eval_rows.append({\n", " \"input_query\": examples[i][\"input_query\"],\n", " \"expected_answer\": examples[i][\"expected_answer\"],\n", " \"generated_answer\": turn.output_message.content,\n", " })\n", "\n", "scoring_params = {\n", " \"braintrust::factuality\": None,\n", "}\n", "scoring_response = client.scoring.score(\n", " input_rows=eval_rows,\n", " scoring_functions=scoring_params,\n", ")\n", "pprint(scoring_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Search Agent\n", "\n", "Now, let's see how we can improve the agent's performance by adding a search tool." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Question: What precision formats does torchtune support?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What precision formats does torchtune support?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: Torchtune supports the following precision formats:\n", "\n", "* bf16 (16-bit floating-point format)\n", "* fp32 (32-bit floating-point format, also known as \"full-precision\")\n", "\n", "It's worth noting that torchtune also provides support for mixed-precision techniques, which allow for the use of \n", "different precision formats for different parts of the model or during different stages of training.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m Torchtune supports the following precision formats:\n", "\n", "* bf16 \u001b[1m(\u001b[0m\u001b[1;36m16\u001b[0m-bit floating-point format\u001b[1m)\u001b[0m\n", "* fp32 \u001b[1m(\u001b[0m\u001b[1;36m32\u001b[0m-bit floating-point format, also known as \u001b[32m\"full-precision\"\u001b[0m\u001b[1m)\u001b[0m\n", "\n", "It's worth noting that torchtune also provides support for mixed-precision techniques, which allow for the use of \n", "different precision formats for different parts of the model or during different stages of training.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: What does DoRA stand for in torchtune?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What does DoRA stand for in torchtune?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: DoRA stands for \"Decoupled Orthogonal Random Adaptation\" in torchtune.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m DoRA stands for \u001b[32m\"Decoupled Orthogonal Random Adaptation\"\u001b[0m in torchtune.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How does the CPUOffloadOptimizer reduce GPU memory usage?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How does the CPUOffloadOptimizer reduce GPU memory usage?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: The CPUOffloadOptimizer reduces GPU memory usage by offloading gradients and trainable parameters to \n", "the CPU, allowing for more efficient use of GPU memory. This can be achieved by setting `offload_gradients=True` in\n", "the CPUOffloadOptimizer, which frees gradients once device-to-host transfer finishes. Additionally, using paged \n", "Adam with `optimizer_in_bwd=True` can also help reduce memory usage. However, it's important to note that the \n", "actual memory usage may vary depending on the specific use case and model architecture.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m The CPUOffloadOptimizer reduces GPU memory usage by offloading gradients and trainable parameters to \n", "the CPU, allowing for more efficient use of GPU memory. This can be achieved by setting `\u001b[33moffload_gradients\u001b[0m=\u001b[3;92mTrue\u001b[0m` in\n", "the CPUOffloadOptimizer, which frees gradients once device-to-host transfer finishes. Additionally, using paged \n", "Adam with `\u001b[33moptimizer_in_bwd\u001b[0m=\u001b[3;92mTrue\u001b[0m` can also help reduce memory usage. However, it's important to note that the \n", "actual memory usage may vary depending on the specific use case and model architecture.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How do I ensure only LoRA parameters are trainable when fine-tuning?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How do I ensure only LoRA parameters are trainable when fine-tuning?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: To ensure only LoRA parameters are trainable when fine-tuning, you can use the `set_trainable_params`\n", "function from the `torchtune.modules.peft.peft_utils` module. This function allows you to specify which parameters \n", "to make trainable, and you can use it to set only the LoRA parameters as trainable.\n", "\n", "Here is an example of how to do this:\n", "```\n", "import torch\n", "from torchtune.models.llama2 import llama2_7b, lora_llama2_7b\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Load the model and adapter\n", "model = llama2_7b()\n", "adapter = lora_llama2_7b()\n", "\n", "# Get the adapter parameters\n", "adapter_params = get_adapter_params(adapter)\n", "\n", "# Set only the adapter parameters as trainable\n", "set_trainable_params(model, adapter_params)\n", "```\n", "This code loads the LLaMA-2 model and the LoRA adapter, gets the adapter parameters, and then sets only those \n", "parameters as trainable using the `set_trainable_params` function. This ensures that only the LoRA parameters are \n", "updated during fine-tuning, while the rest of the model remains frozen.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m To ensure only LoRA parameters are trainable when fine-tuning, you can use the `set_trainable_params`\n", "function from the `torchtune.modules.peft.peft_utils` module. This function allows you to specify which parameters \n", "to make trainable, and you can use it to set only the LoRA parameters as trainable.\n", "\n", "Here is an example of how to do this:\n", "```\n", "import torch\n", "from torchtune.models.llama2 import llama2_7b, lora_llama2_7b\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Load the model and adapter\n", "model = \u001b[1;35mllama2_7b\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", "adapter = \u001b[1;35mlora_llama2_7b\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", "\n", "# Get the adapter parameters\n", "adapter_params = \u001b[1;35mget_adapter_params\u001b[0m\u001b[1m(\u001b[0madapter\u001b[1m)\u001b[0m\n", "\n", "# Set only the adapter parameters as trainable\n", "\u001b[1;35mset_trainable_params\u001b[0m\u001b[1m(\u001b[0mmodel, adapter_params\u001b[1m)\u001b[0m\n", "```\n", "This code loads the LLaMA-\u001b[1;36m2\u001b[0m model and the LoRA adapter, gets the adapter parameters, and then sets only those \n", "parameters as trainable using the `set_trainable_params` function. This ensures that only the LoRA parameters are \n", "updated during fine-tuning, while the rest of the model remains frozen.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "search_agent = Agent(client, \n", " model=MODEL_ID,\n", " instructions=\"You are a helpful assistant that can answer questions about the Torchtune project. You should always use the search tool to answer questions.\",\n", " tools=[\"builtin::websearch\"])\n", "for example in examples:\n", " search_session_id = search_agent.create_session(session_name=f\"search_session_{uuid.uuid4()}\")\n", " response = search_agent.create_turn(\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": example[\"input_query\"]\n", " }\n", " ],\n", " session_id=search_session_id,\n", " stream=False\n", " )\n", " rich.print(f\"[bold cyan]Question:[/bold cyan] {example['input_query']}\")\n", " rich.print(f\"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 Evaluate Agent Responses\n", "\n", "We can see that with a search tool, the agent's performance is much better, and have less hallucinations. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ScoringScoreResponse(\n", "│ results={\n", "│ │ 'braintrust::factuality': ScoringResult(\n", "│ │ │ aggregated_results={'average': {'average': 0.44999999999999996}},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. **Expert Answer Details**: The expert answer states that Torchtune supports two precision formats: fp32 (full-precision) and bfloat16 (half-precision).\\n\\n2. **Submitted Answer Details**: The submitted answer mentions two precision formats: bf16 (16-bit floating-point format) and fp32 (32-bit floating-point format, also known as \"full-precision\"). It also adds that Torchtune supports mixed-precision techniques.\\n\\n3. **Comparison of Precision Formats**:\\n - The expert answer uses \"bfloat16\" while the submitted answer uses \"bf16\". These are equivalent terms, as \"bf16\" is a common abbreviation for \"bfloat16\".\\n - Both answers mention \"fp32\" as a supported precision format.\\n\\n4. **Additional Information in Submission**: The submitted answer includes additional information about mixed-precision techniques, which is not mentioned in the expert answer.\\n\\n5. **Consistency Check**: The submitted answer includes all the information from the expert answer and adds more details about mixed-precision techniques. There is no conflict between the two answers.\\n\\nBased on the above analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.0,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'D',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer states that DoRA stands for \"Decoupled Orthogonal Random Adaptation.\"\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer regarding what DoRA stands for in torchtune.\\n5. Therefore, the correct choice is (D) There is a disagreement between the submitted answer and the expert answer.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. **Expert Answer Analysis**: The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on the CPU and performing optimizer steps on the CPU. It also mentions the optional offloading of gradients to the CPU by setting `offload_gradients=True`.\\n\\n2. **Submitted Answer Analysis**: The submitted answer mentions offloading gradients and trainable parameters to the CPU, which allows for more efficient use of GPU memory. It specifies the use of `offload_gradients=True` to free gradients after device-to-host transfer. Additionally, it introduces the concept of using paged Adam with `optimizer_in_bwd=True` to help reduce memory usage. It also notes that actual memory usage may vary depending on the use case and model architecture.\\n\\n3. **Comparison**:\\n - Both answers mention offloading gradients to the CPU using `offload_gradients=True`.\\n - The expert answer focuses on keeping optimizer states and performing optimizer steps on the CPU, while the submitted answer expands on this by mentioning trainable parameters and the use of paged Adam.\\n - The submitted answer provides additional context about memory usage variability and the use of paged Adam, which is not mentioned in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer as it includes all the information from the expert answer and adds more details about trainable parameters, paged Adam, and memory usage variability. There is no conflict between the two answers, and the additional information in the submitted answer is consistent with the expert answer.\\n\\nTherefore, the correct choice is (B) The submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': \"1. **Expert Answer Analysis**: The expert answer provides a method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params(lora_model)` and setting them as trainable with `set_trainable_params(lora_model, lora_params)`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Analysis**: The submitted answer provides a detailed example of how to ensure only LoRA parameters are trainable. It uses the `set_trainable_params` function from `torchtune.modules.peft.peft_utils` and provides a code example that includes loading a model and adapter, fetching adapter parameters, and setting them as trainable.\\n\\n3. **Comparison**:\\n - Both answers mention the use of `set_trainable_params` to set LoRA parameters as trainable.\\n - Both answers involve fetching LoRA parameters using a function (`get_adapter_params`).\\n - The submitted answer provides additional context by including a code example and specifying the module path for the functions used.\\n - The expert answer mentions that the LoRA recipe handles this automatically, which is not explicitly stated in the submitted answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer. It includes all the information from the expert answer and adds more detail, such as a code example and specific module paths. There is no conflict between the two answers, and the additional information in the submitted answer is consistent with the expert answer.\"\n", "│ │ │ │ │ }\n", "│ │ │ │ }\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mScoringScoreResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mresults\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'braintrust::factuality'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1;36m0.44999999999999996\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. **Expert Answer Details**: The expert answer states that Torchtune supports two precision formats: fp32 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mfull-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and bfloat16 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mhalf-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.\\n\\n2. **Submitted Answer Details**: The submitted answer mentions two precision formats: bf16 \u001b[0m\u001b[32m(\u001b[0m\u001b[32m16-bit floating-point format\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and fp32 \u001b[0m\u001b[32m(\u001b[0m\u001b[32m32-bit floating-point format, also known as \"full-precision\"\u001b[0m\u001b[32m)\u001b[0m\u001b[32m. It also adds that Torchtune supports mixed-precision techniques.\\n\\n3. **Comparison of Precision Formats**:\\n - The expert answer uses \"bfloat16\" while the submitted answer uses \"bf16\". These are equivalent terms, as \"bf16\" is a common abbreviation for \"bfloat16\".\\n - Both answers mention \"fp32\" as a supported precision format.\\n\\n4. **Additional Information in Submission**: The submitted answer includes additional information about mixed-precision techniques, which is not mentioned in the expert answer.\\n\\n5. **Consistency Check**: The submitted answer includes all the information from the expert answer and adds more details about mixed-precision techniques. There is no conflict between the two answers.\\n\\nBased on the above analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'D'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer states that DoRA stands for \"Decoupled Orthogonal Random Adaptation.\"\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer regarding what DoRA stands for in torchtune.\\n5. Therefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mD\u001b[0m\u001b[32m)\u001b[0m\u001b[32m There is a disagreement between the submitted answer and the expert answer.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. **Expert Answer Analysis**: The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on the CPU and performing optimizer steps on the CPU. It also mentions the optional offloading of gradients to the CPU by setting `\u001b[0m\u001b[32moffload_gradients\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m`.\\n\\n2. **Submitted Answer Analysis**: The submitted answer mentions offloading gradients and trainable parameters to the CPU, which allows for more efficient use of GPU memory. It specifies the use of `\u001b[0m\u001b[32moffload_gradients\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m` to free gradients after device-to-host transfer. Additionally, it introduces the concept of using paged Adam with `\u001b[0m\u001b[32moptimizer_in_bwd\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m` to help reduce memory usage. It also notes that actual memory usage may vary depending on the use case and model architecture.\\n\\n3. **Comparison**:\\n - Both answers mention offloading gradients to the CPU using `\u001b[0m\u001b[32moffload_gradients\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m`.\\n - The expert answer focuses on keeping optimizer states and performing optimizer steps on the CPU, while the submitted answer expands on this by mentioning trainable parameters and the use of paged Adam.\\n - The submitted answer provides additional context about memory usage variability and the use of paged Adam, which is not mentioned in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer as it includes all the information from the expert answer and adds more details about trainable parameters, paged Adam, and memory usage variability. There is no conflict between the two answers, and the additional information in the submitted answer is consistent with the expert answer.\\n\\nTherefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mB\u001b[0m\u001b[32m)\u001b[0m\u001b[32m The submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m\"1. **Expert Answer Analysis**: The expert answer provides a method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model\u001b[0m\u001b[32m)\u001b[0m\u001b[32m` and setting them as trainable with `set_trainable_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model, lora_params\u001b[0m\u001b[32m)\u001b[0m\u001b[32m`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Analysis**: The submitted answer provides a detailed example of how to ensure only LoRA parameters are trainable. It uses the `set_trainable_params` function from `torchtune.modules.peft.peft_utils` and provides a code example that includes loading a model and adapter, fetching adapter parameters, and setting them as trainable.\\n\\n3. **Comparison**:\\n - Both answers mention the use of `set_trainable_params` to set LoRA parameters as trainable.\\n - Both answers involve fetching LoRA parameters using a function \u001b[0m\u001b[32m(\u001b[0m\u001b[32m`get_adapter_params`\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.\\n - The submitted answer provides additional context by including a code example and specifying the module path for the functions used.\\n - The expert answer mentions that the LoRA recipe handles this automatically, which is not explicitly stated in the submitted answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer. It includes all the information from the expert answer and adds more detail, such as a code example and specific module paths. There is no conflict between the two answers, and the additional information in the submitted answer is consistent with the expert answer.\"\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eval_rows = []\n", "for i, session_id in enumerate(search_agent.sessions):\n", " session_response = client.agents.session.retrieve(agent_id=search_agent.agent_id, session_id=session_id)\n", " for turn in session_response.turns:\n", " eval_rows.append({\n", " \"input_query\": examples[i][\"input_query\"],\n", " \"expected_answer\": examples[i][\"expected_answer\"],\n", " \"generated_answer\": turn.output_message.content,\n", " })\n", "\n", "scoring_params = {\n", " \"braintrust::factuality\": None,\n", "}\n", "scoring_response = client.scoring.score(\n", " input_rows=eval_rows,\n", " scoring_functions=scoring_params,\n", ")\n", "pprint(scoring_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. RAG Agent\n", "\n", "Now, let's see how we can improve the agent's performance by adding a RAG tool that explicitly retrieves information from Torchtune's documentation. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from llama_stack_client.types import Document\n", "urls = [\n", " \"memory_optimizations.rst\",\n", " \"chat.rst\",\n", " \"llama3.rst\",\n", " \"qat_finetune.rst\",\n", " \"lora_finetune.rst\",\n", "]\n", "documents = [\n", " Document(\n", " document_id=f\"num-{i}\",\n", " content=f\"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}\",\n", " mime_type=\"text/plain\",\n", " metadata={},\n", " )\n", " for i, url in enumerate(urls)\n", "]\n", "\n", "vector_providers = [\n", " provider for provider in client.providers.list() if provider.api == \"vector_io\"\n", "]\n", "selected_vector_provider = vector_providers[0]\n", "vector_db_id = f\"test_vector_db_{uuid.uuid4()}\"\n", "client.vector_dbs.register(\n", " vector_db_id=vector_db_id,\n", " embedding_model=\"all-MiniLM-L6-v2\",\n", " embedding_dimension=384,\n", " provider_id=selected_vector_provider.provider_id,\n", ")\n", "\n", "client.tool_runtime.rag_tool.insert(\n", " documents=documents,\n", " vector_db_id=vector_db_id,\n", " chunk_size_in_tokens=512,\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Question: What precision formats does torchtune support?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What precision formats does torchtune support?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: Torchtune supports the following precision formats:\n", "\n", "* bfloat16 (half-precision)\n", "* fp32 (full-precision)\n", "* int8 (integer 8-bit)\n", "* int4 (integer 4-bit)\n", "\n", "Note that mixed-precision training is not currently supported in torchtune.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m Torchtune supports the following precision formats:\n", "\n", "* bfloat16 \u001b[1m(\u001b[0mhalf-precision\u001b[1m)\u001b[0m\n", "* fp32 \u001b[1m(\u001b[0mfull-precision\u001b[1m)\u001b[0m\n", "* int8 \u001b[1m(\u001b[0minteger \u001b[1;36m8\u001b[0m-bit\u001b[1m)\u001b[0m\n", "* int4 \u001b[1m(\u001b[0minteger \u001b[1;36m4\u001b[0m-bit\u001b[1m)\u001b[0m\n", "\n", "Note that mixed-precision training is not currently supported in torchtune.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: What does DoRA stand for in torchtune?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What does DoRA stand for in torchtune?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: DoRA stands for \"Decoupled Orthogonal Random Axes\" in the context of the Torchtune project.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m DoRA stands for \u001b[32m\"Decoupled Orthogonal Random Axes\"\u001b[0m in the context of the Torchtune project.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How does the CPUOffloadOptimizer reduce GPU memory usage?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How does the CPUOffloadOptimizer reduce GPU memory usage?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: The CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to CPU,\n",
       "thus reducing the memory usage on the GPU. This is especially useful when training large models or when using \n",
       "stateful optimizers, as it can significantly reduce the memory requirements. However, it may come at the cost of \n",
       "increased CPU RAM usage and potentially slower training speeds.\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;33mAgent Answer:\u001b[0m The CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to CPU,\n",
       "thus reducing the memory usage on the GPU. This is especially useful when training large models or when using \n",
       "stateful optimizers, as it can significantly reduce the memory requirements. However, it may come at the cost of \n",
       "increased CPU RAM usage and potentially slower training speeds.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Question: How do I ensure only LoRA parameters are trainable when fine-tuning?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How do I ensure only LoRA parameters are trainable when fine-tuning?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: To ensure only LoRA parameters are trainable when fine-tuning, you can use the `get_adapter_params` \n", "and `set_trainable_params` functions from `torchtune.modules.peft.peft_utils`. \n", "\n", "Here is how to do it:\n", "\n", "```python\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Fetch all params from the model that are associated with LoRA.\n", "lora_params = get_adapter_params(lora_model)\n", "\n", "# Set requires_grad=True on lora_params, and requires_grad=False on all others.\n", "set_trainable_params(lora_model, lora_params)\n", "```\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m To ensure only LoRA parameters are trainable when fine-tuning, you can use the `get_adapter_params` \n", "and `set_trainable_params` functions from `torchtune.modules.peft.peft_utils`. \n", "\n", "Here is how to do it:\n", "\n", "```python\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Fetch all params from the model that are associated with LoRA.\n", "lora_params = \u001b[1;35mget_adapter_params\u001b[0m\u001b[1m(\u001b[0mlora_model\u001b[1m)\u001b[0m\n", "\n", "# Set \u001b[33mrequires_grad\u001b[0m=\u001b[3;92mTrue\u001b[0m on lora_params, and \u001b[33mrequires_grad\u001b[0m=\u001b[3;91mFalse\u001b[0m on all others.\n", "\u001b[1;35mset_trainable_params\u001b[0m\u001b[1m(\u001b[0mlora_model, lora_params\u001b[1m)\u001b[0m\n", "```\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rag_agent = Agent(\n", " client,\n", " model=MODEL_ID,\n", " instructions=\"You are a helpful assistant that can answer questions about the Torchtune project. You should always use the RAG tool to answer questions.\",\n", " tools=[{\n", " \"name\": \"builtin::rag\",\n", " \"args\": {\"vector_db_ids\": [vector_db_id]},\n", " }],\n", ")\n", "\n", "for example in examples:\n", " rag_session_id = rag_agent.create_session(session_name=f\"rag_session_{uuid.uuid4()}\")\n", " response = rag_agent.create_turn(\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": example[\"input_query\"]\n", " }\n", " ],\n", " session_id=rag_session_id,\n", " stream=False\n", " )\n", " rich.print(f\"[bold cyan]Question:[/bold cyan] {example['input_query']}\")\n", " rich.print(f\"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ScoringScoreResponse(\n", "│ results={\n", "│ │ 'braintrust::factuality': ScoringResult(\n", "│ │ │ aggregated_results={'average': {'average': 0.3}},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.0,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'D',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that Torchtune supports two precision formats: fp32 and bfloat16.\\n2. The submitted answer lists four precision formats: bfloat16, fp32, int8, and int4.\\n3. The submitted answer includes the two formats mentioned by the expert (bfloat16 and fp32), but also adds int8 and int4, which are not mentioned by the expert.\\n4. The submitted answer also states that mixed-precision training is not supported, which is not addressed in the expert answer.\\n5. Since the submitted answer includes additional precision formats (int8 and int4) that are not mentioned by the expert, there is a factual disagreement between the two answers regarding the supported precision formats.\\n6. Therefore, the correct choice is (D) There is a disagreement between the submitted answer and the expert answer.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.0,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'D',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer states that DoRA stands for \"Decoupled Orthogonal Random Axes.\"\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer.\\n5. Therefore, the correct choice is (D) There is a disagreement between the submitted answer and the expert answer.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It also mentions the optional offloading of gradients to CPU using offload_gradients=True.\\n2. The submitted answer states that the CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to CPU. It also mentions that this is useful for large models or stateful optimizers and notes potential downsides like increased CPU RAM usage and slower training speeds.\\n3. The submitted answer includes all the points mentioned in the expert answer: offloading optimizer states and optionally gradients to CPU.\\n4. Additionally, the submitted answer provides extra context about the usefulness for large models and potential downsides, which are not mentioned in the expert answer.\\n5. There is no factual disagreement between the two answers; the submitted answer simply provides more information.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': \"1. **Identify the core content of both answers:**\\n - The expert answer explains how to set only LoRA parameters as trainable using torchtune's utility functions by fetching all LoRA parameters with `get_adapter_params(lora_model)` and setting them as trainable with `set_trainable_params(lora_model, lora_params)`. It also mentions that the LoRA recipe handles this automatically.\\n - The submitted answer provides a similar explanation, detailing the use of `get_adapter_params` and `set_trainable_params` from `torchtune.modules.peft.peft_utils` to ensure only LoRA parameters are trainable. It includes a code snippet demonstrating the process.\\n\\n2. **Compare the factual content:**\\n - Both answers describe the same process of fetching LoRA parameters and setting them as trainable using the same functions.\\n - The submitted answer includes additional details such as the import statement and a code snippet, which are not present in the expert answer.\\n - The expert answer mentions that the LoRA recipe handles this automatically, which is not mentioned in the submission.\\n\\n3. **Determine the relationship between the answers:**\\n - The submitted answer is a superset of the expert answer because it includes all the information provided by the expert and adds more details, such as the import statement and code snippet.\\n - There is no conflict between the two answers; the submission expands on the expert's explanation.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.\"\n", "│ │ │ │ │ }\n", "│ │ │ │ }\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mScoringScoreResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mresults\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'braintrust::factuality'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1;36m0.3\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'D'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that Torchtune supports two precision formats: fp32 and bfloat16.\\n2. The submitted answer lists four precision formats: bfloat16, fp32, int8, and int4.\\n3. The submitted answer includes the two formats mentioned by the expert \u001b[0m\u001b[32m(\u001b[0m\u001b[32mbfloat16 and fp32\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, but also adds int8 and int4, which are not mentioned by the expert.\\n4. The submitted answer also states that mixed-precision training is not supported, which is not addressed in the expert answer.\\n5. Since the submitted answer includes additional precision formats \u001b[0m\u001b[32m(\u001b[0m\u001b[32mint8 and int4\u001b[0m\u001b[32m)\u001b[0m\u001b[32m that are not mentioned by the expert, there is a factual disagreement between the two answers regarding the supported precision formats.\\n6. Therefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mD\u001b[0m\u001b[32m)\u001b[0m\u001b[32m There is a disagreement between the submitted answer and the expert answer.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'D'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer states that DoRA stands for \"Decoupled Orthogonal Random Axes.\"\\n3. The two answers provide completely different expansions for the acronym DoRA.\\n4. Since the expansions are different, there is a clear disagreement between the submitted answer and the expert answer.\\n5. Therefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mD\u001b[0m\u001b[32m)\u001b[0m\u001b[32m There is a disagreement between the submitted answer and the expert answer.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It also mentions the optional offloading of gradients to CPU using \u001b[0m\u001b[32moffload_gradients\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m.\\n2. The submitted answer states that the CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to CPU. It also mentions that this is useful for large models or stateful optimizers and notes potential downsides like increased CPU RAM usage and slower training speeds.\\n3. The submitted answer includes all the points mentioned in the expert answer: offloading optimizer states and optionally gradients to CPU.\\n4. Additionally, the submitted answer provides extra context about the usefulness for large models and potential downsides, which are not mentioned in the expert answer.\\n5. There is no factual disagreement between the two answers; the submitted answer simply provides more information.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m\"1. **Identify the core content of both answers:**\\n - The expert answer explains how to set only LoRA parameters as trainable using torchtune's utility functions by fetching all LoRA parameters with `get_adapter_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model\u001b[0m\u001b[32m)\u001b[0m\u001b[32m` and setting them as trainable with `set_trainable_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model, lora_params\u001b[0m\u001b[32m)\u001b[0m\u001b[32m`. It also mentions that the LoRA recipe handles this automatically.\\n - The submitted answer provides a similar explanation, detailing the use of `get_adapter_params` and `set_trainable_params` from `torchtune.modules.peft.peft_utils` to ensure only LoRA parameters are trainable. It includes a code snippet demonstrating the process.\\n\\n2. **Compare the factual content:**\\n - Both answers describe the same process of fetching LoRA parameters and setting them as trainable using the same functions.\\n - The submitted answer includes additional details such as the import statement and a code snippet, which are not present in the expert answer.\\n - The expert answer mentions that the LoRA recipe handles this automatically, which is not mentioned in the submission.\\n\\n3. **Determine the relationship between the answers:**\\n - The submitted answer is a superset of the expert answer because it includes all the information provided by the expert and adds more details, such as the import statement and code snippet.\\n - There is no conflict between the two answers; the submission expands on the expert's explanation.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.\"\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eval_rows = []\n", "for i, session_id in enumerate(rag_agent.sessions):\n", " session_response = client.agents.session.retrieve(agent_id=rag_agent.agent_id, session_id=session_id)\n", " for turn in session_response.turns:\n", " eval_rows.append({\n", " \"input_query\": examples[i][\"input_query\"],\n", " \"expected_answer\": examples[i][\"expected_answer\"],\n", " \"generated_answer\": turn.output_message.content,\n", " })\n", "\n", "scoring_params = {\n", " \"braintrust::factuality\": None,\n", "}\n", "scoring_response = client.scoring.score(\n", " input_rows=eval_rows,\n", " scoring_functions=scoring_params,\n", ")\n", "pprint(scoring_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Deep dive into RAG Tool Performance\n", "- Now, let's take a closer look at how the RAG tool is doing, specifically on the second example where the agent's answer is not correct on identifying what DoRA stands for. \n", "- Notice that the issue lies with the retrieval step, where the retrieved document is not relevant to the question. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
[\n", "│ Turn(\n", "│ │ input_messages=[UserMessage(content='What does DoRA stand for in torchtune?', role='user', context=None)],\n", "│ │ output_message=CompletionMessage(\n", "│ │ │ content='DoRA stands for \"Decoupled Orthogonal Random Axes\" in the context of the Torchtune project.',\n", "│ │ │ role='assistant',\n", "│ │ │ stop_reason='end_of_turn',\n", "│ │ │ tool_calls=[]\n", "│ │ ),\n", "│ │ session_id='b5b5b9c5-1f14-404a-9677-cdb413b9f328',\n", "│ │ started_at=datetime.datetime(2025, 3, 7, 10, 35, 24, 235903, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=57600))),\n", "│ │ steps=[\n", "│ │ │ InferenceStep(\n", "│ │ │ │ api_model_response=CompletionMessage(\n", "│ │ │ │ │ content='',\n", "│ │ │ │ │ role='assistant',\n", "│ │ │ │ │ stop_reason='end_of_turn',\n", "│ │ │ │ │ tool_calls=[\n", "│ │ │ │ │ │ ToolCall(\n", "│ │ │ │ │ │ │ arguments={'query': 'DoRA meaning in Torchtune'},\n", "│ │ │ │ │ │ │ call_id='c2c088b9-cf2f-41b5-a050-dd5743112f48',\n", "│ │ │ │ │ │ │ tool_name='knowledge_search'\n", "│ │ │ │ │ │ )\n", "│ │ │ │ │ ]\n", "│ │ │ │ ),\n", "│ │ │ │ step_id='27ba55cd-0252-4cff-8141-129b3b8dd021',\n", "│ │ │ │ step_type='inference',\n", "│ │ │ │ turn_id='bb111412-e2e9-40ca-9cd2-87df200807ab',\n", "│ │ │ │ completed_at=datetime.datetime(2025, 3, 7, 10, 35, 26, 226185, tzinfo=TzInfo(-08:00)),\n", "│ │ │ │ started_at=datetime.datetime(2025, 3, 7, 10, 35, 24, 236359, tzinfo=TzInfo(-08:00))\n", "│ │ │ ),\n", "│ │ │ ToolExecutionStep(\n", "│ │ │ │ step_id='e7da6bb1-a704-4a2e-9954-5d54d8a1fc5d',\n", "│ │ │ │ step_type='tool_execution',\n", "│ │ │ │ tool_calls=[\n", "│ │ │ │ │ ToolCall(\n", "│ │ │ │ │ │ arguments={'query': 'DoRA meaning in Torchtune'},\n", "│ │ │ │ │ │ call_id='c2c088b9-cf2f-41b5-a050-dd5743112f48',\n", "│ │ │ │ │ │ tool_name='knowledge_search'\n", "│ │ │ │ │ )\n", "│ │ │ │ ],\n", "│ │ │ │ tool_responses=[\n", "│ │ │ │ │ ToolResponse(\n", "│ │ │ │ │ │ call_id='c2c088b9-cf2f-41b5-a050-dd5743112f48',\n", "│ │ │ │ │ │ content=[\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text='knowledge_search tool found 5 chunks:\\nBEGIN of knowledge_search tool results.\\n',\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text='Result 1:\\nDocument_id:num-0\\nContent: etune\\n:func:`torchtune.models.llama3.llama3_8b` with DoRA, you would use :func:`torchtune.models.llama3.lora_llama3_8b` with ``use_dora=True``:\\n\\n.. code-block:: bash\\n\\n tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n model.use_dora=True\\n\\n.. code-block:: yaml\\n\\n model:\\n _component_: torchtune.models.lora_llama3_8b\\n use_dora: True\\n\\nSince DoRA extends LoRA, the parameters for :ref:`customizing LoRA <glossary_lora>` are identical. You can also quantize the base model weights like in :ref:`glossary_qlora` by using ``quantize=True`` to reap\\neven more memory savings!\\n\\n.. code-block:: bash\\n\\n tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n model.apply_lora_to_mlp=True \\\\\\n model.lora_attn_modules=[\"q_proj\",\"k_proj\",\"v_proj\"] \\\\\\n model.lora_rank=16 \\\\\\n model.lora_alpha=32 \\\\\\n model.use_dora=True \\\\\\n model.quantize_base=True\\n\\n.. code-block:: yaml\\n\\n model:\\n _component_: torchtune.models.lora_llama3_8b\\n apply_lora_to_mlp: True\\n lora_attn_modules: [\"q_proj\", \"k_proj\", \"v_proj\"]\\n lora_rank: 16\\n lora_alpha: 32\\n use_dora: True\\n quantize_base: True\\n\\n\\n.. note::\\n\\n Under the hood, we\\'ve enabled DoRA by adding the :class:`~torchtune.modules.peft.DoRALinear` module, which we swap\\n out for :class:`~torchtune.modules.peft.LoRALinear` when ``use_dora=True``.\\n\\n.. _glossary_distrib:\\n\\n\\n.. TODO\\n\\n.. Distributed\\n.. -----------\\n\\n.. .. _glossary_fsdp:\\n\\n.. Fully Sharded Data Parallel (FSDP)\\n.. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n.. All our ``_distributed`` recipes use `FSDP <https://pytorch.org/docs/stable/fsdp.html>`.\\n.. .. _glossary_fsdp2:\\n\\n',\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text='Result 2:\\nDocument_id:num-1\\nContent: conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. For any\\ncustom local dataset we always need to specify ``source``, ``data_files``, and ``split`` for any dataset\\nbuilder in torchtune. For :func:`~torchtune.datasets.chat_dataset`, we additionally need to specify\\n``conversation_column`` and ``conversation_style``. Our data follows the ``\"sharegpt\"`` format, so\\nwe can specify that here. Altogether, our :func:`~torchtune.datasets.chat_dataset` call should\\nlook like so:\\n\\n.. code-block:: python\\n\\n from torchtune.datasets import chat_dataset\\n from torchtune.models.llama3 import llama3_tokenizer\\n\\n tokenizer = llama3_tokenizer(\"/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model\")\\n ds = chat_dataset(\\n tokenizer=tokenizer,\\n source=\"json\",\\n data_files=\"data/my_data.json\",\\n split=\"train\",\\n conversation_column=\"dialogue\",\\n conversation_style=\"sharegpt\",\\n )\\n\\n.. code-block:: yaml\\n\\n # In config\\n tokenizer:\\n _component_: torchtune.models.llama3.llama3_tokenizer\\n path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model\\n\\n dataset:\\n _component_: torchtune.datasets.chat_dataset\\n source: json\\n data_files: data/my_data.json\\n split: train\\n conversation_column: dialogue\\n conversation_style: sharegpt\\n\\n.. note::\\n You can pass in any keyword argument for `load_dataset <https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/loading_methods#datasets.load_dataset>`_ into all our\\n Dataset classes and they will honor them. This is useful for common parameters\\n such as specifying the data split with :code:`split` or configuration with\\n :code:`name`\\n\\nIf you needed to add a prompt template, you would simply pass it into the tokenizer.\\nSince we\\'re fine-tuning Llama3, the tokenizer will handle all formatting for\\nus and prompt templates are optional. Other models such as Mistral\\'s :class:`~torchtune.models.mistral._tokenizer.MistralTokenizer`,\\nuse a chat template by default (:class:`~torchtune.models.mistral.MistralChatTemplate`) to format\\nall messages according to their `recommendations <https://\\n',\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text=\"Result 3:\\nDocument_id:num-5\\nContent: .. _lora_finetune_label:\\n\\n============================\\nFine-Tuning Llama2 with LoRA\\n============================\\n\\nThis guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a parameter-efficient finetuning technique,\\nand show you how you can use torchtune to finetune a Llama2 model with LoRA.\\nIf you already know what LoRA is and want to get straight to running\\nyour own LoRA finetune in torchtune, you can jump to :ref:`LoRA finetuning recipe in torchtune<lora_recipe_label>`.\\n\\n.. grid:: 2\\n\\n .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn\\n\\n * What LoRA is and how it saves memory during finetuning\\n * An overview of LoRA components in torchtune\\n * How to run a LoRA finetune using torchtune\\n * How to experiment with different LoRA configurations\\n\\n .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites\\n\\n * Be familiar with :ref:`torchtune<overview_label>`\\n * Make sure to :ref:`install torchtune<install_label>`\\n * Make sure you have downloaded the :ref:`Llama2-7B model weights<download_llama_label>`\\n\\nWhat is LoRA?\\n-------------\\n\\n`LoRA <https://arxiv.org/abs/2106.09685>`_ is an adapter-based method for\\nparameter-efficient finetuning that adds trainable low-rank decomposition matrices to different layers of a neural network,\\nthen freezes the network's remaining parameters. LoRA is most commonly applied to\\ntransformer models, in which case it is common to add the low-rank matrices\\nto some of the linear projections in each transformer layer's self-attention.\\n\\n.. note::\\n\\n If you're unfamiliar, check out these references for the `definition of rank <https://en.wikipedia.org/wiki/Rank_(linear_algebra)>`_\\n and discussion of `low-rank approximations <https://en.wikipedia.org/wiki/Low-rank_approximation>`_.\\n\\nBy finetuning with LoRA (as opposed to finetuning all model parameters),\\nyou can expect to see memory savings due to a substantial reduction in the\\nnumber of parameters with gradients. When using an optimizer with momentum,\\nlike `AdamW <https://py\\n\",\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text='Result 4:\\nDocument_id:num-0\\nContent: use the :class:`torch.optim.AdamW` optimizer with ``fused=True`` as the base optimizer. For example, to use this optimizer to offload\\nboth optimizer states and gradients to CPU:\\n\\n.. code-block:: bash\\n\\n tune run <RECIPE> --config <CONFIG> \\\\\\n optimizer=optimizer=torchao.prototype.low_bit_optim.CPUOffloadOptimizer \\\\\\n optimizer.offload_gradients=True \\\\\\n lr=4e-5\\n\\n\\nor by directly :ref:`modifying a config file<config_tutorial_label>`:\\n\\n.. code-block:: yaml\\n\\n optimizer:\\n _component_: torchao.prototype.low_bit_optim.CPUOffloadOptimizer\\n offload_gradients: True\\n # additional key-word arguments can be passed to torch.optim.AdamW\\n lr: 4e-5\\n\\nor using it directly in your code, which allows you to change the base optimizer:\\n\\n.. code-block:: python\\n\\n from torchao.prototype.low_bit_optim import CPUOffloadOptimizer\\n from torch.optim import Adam\\n\\n optimizer = CPUOffloadOptimizer(\\n model.parameters(), # your model here\\n Adam,\\n lr=1e-5,\\n fused=True\\n )\\n\\nSome helpful hints from the ``torchao`` `CPUOffloadOptimizer page <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_:\\n\\n* The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step to amortize the offloading time (e.g. larger batch size with activation checkpointing, gradient accumulation).\\n* Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass.\\n* This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size.\\n* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` instead. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details and `FSDP1 vs FSDP2 <https://github.com/pytorch/torchtitan/blob/main/docs/fsdp\\n',\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(\n", "│ │ │ │ │ │ │ │ text='Result 5:\\nDocument_id:num-5\\nContent: from our Llama2\\nmodel without any wrappers or custom checkpoint conversion logic.\\n\\n.. code-block:: python\\n\\n # Assuming that base_model already has the pretrained Llama2 weights,\\n # this will directly load them into your LoRA model without any conversion necessary.\\n lora_model.load_state_dict(base_model.state_dict(), strict=False)\\n\\n.. note::\\n Whenever loading weights with :code:`strict=False`, you should verify that any missing or extra keys in\\n the loaded :code:`state_dict` are as expected. torchtune\\'s LoRA recipes do this by default via\\n :func:`validate_missing_and_unexpected_for_lora() <torchtune.modules.peft.validate_missing_and_unexpected_for_lora>`.\\n\\nOnce we\\'ve loaded the base model weights, we also want to set only LoRA parameters to trainable.\\n\\n.. _setting_trainable_params:\\n\\n.. code-block:: python\\n\\n from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\\n\\n # Fetch all params from the model that are associated with LoRA.\\n lora_params = get_adapter_params(lora_model)\\n\\n # Set requires_grad=True on lora_params, and requires_grad=False on all others.\\n set_trainable_params(lora_model, lora_params)\\n\\n # Print the total number of parameters\\n total_params = sum([p.numel() for p in lora_model.parameters()])\\n trainable_params = sum([p.numel() for p in lora_model.parameters() if p.requires_grad])\\n print(\\n f\"\"\"\\n {total_params} total params,\\n {trainable_params}\" trainable params,\\n {(100.0 * trainable_params / total_params):.2f}% of all params are trainable.\\n \"\"\"\\n )\\n\\n 6742609920 total params,\\n 4194304 trainable params,\\n 0.06% of all params are trainable.\\n\\n.. note::\\n If you are directly using the LoRA recipe (as detailed :ref:`here<lora_recipe_label>`), you need only pass the\\n relevant checkpoint path. Loading model weights and setting trainable parameters will be taken care\\n of in the recipe.\\n\\n\\n.. _lora_recipe_label:\\n\\nLoRA finetuning recipe in torchtune\\n-----------------------------------\\n\\nFinally, we can put it all together and finetune a model using torchtune\\'s `LoRA recipe <https://github.com/pytorch/torchtune/blob/48626d19d2108f92\\n',\n", "│ │ │ │ │ │ │ │ type='text'\n", "│ │ │ │ │ │ │ ),\n", "│ │ │ │ │ │ │ TextContentItem(text='END of knowledge_search tool results.\\n', type='text')\n", "│ │ │ │ │ │ ],\n", "│ │ │ │ │ │ tool_name='knowledge_search',\n", "│ │ │ │ │ │ metadata={'document_ids': ['num-0', 'num-1', 'num-5', 'num-0', 'num-5']}\n", "│ │ │ │ │ )\n", "│ │ │ │ ],\n", "│ │ │ │ turn_id='bb111412-e2e9-40ca-9cd2-87df200807ab',\n", "│ │ │ │ completed_at=datetime.datetime(2025, 3, 7, 10, 35, 26, 339563, tzinfo=TzInfo(-08:00)),\n", "│ │ │ │ started_at=datetime.datetime(2025, 3, 7, 10, 35, 26, 264752, tzinfo=TzInfo(-08:00))\n", "│ │ │ ),\n", "│ │ │ InferenceStep(\n", "│ │ │ │ api_model_response=CompletionMessage(\n", "│ │ │ │ │ content='DoRA stands for \"Decoupled Orthogonal Random Axes\" in the context of the Torchtune project.',\n", "│ │ │ │ │ role='assistant',\n", "│ │ │ │ │ stop_reason='end_of_turn',\n", "│ │ │ │ │ tool_calls=[]\n", "│ │ │ │ ),\n", "│ │ │ │ step_id='400e49e1-f33e-41da-b22a-f1d2338a27c8',\n", "│ │ │ │ step_type='inference',\n", "│ │ │ │ turn_id='bb111412-e2e9-40ca-9cd2-87df200807ab',\n", "│ │ │ │ completed_at=datetime.datetime(2025, 3, 7, 10, 35, 27, 281430, tzinfo=TzInfo(-08:00)),\n", "│ │ │ │ started_at=datetime.datetime(2025, 3, 7, 10, 35, 26, 351029, tzinfo=TzInfo(-08:00))\n", "│ │ │ )\n", "│ │ ],\n", "│ │ turn_id='bb111412-e2e9-40ca-9cd2-87df200807ab',\n", "│ │ completed_at=datetime.datetime(2025, 3, 7, 10, 35, 27, 294253, tzinfo=TzInfo(-08:00)),\n", "│ │ output_attachments=[]\n", "│ )\n", "]\n", "\n" ], "text/plain": [ "\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1;35mTurn\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[33minput_messages\u001b[0m=\u001b[1m[\u001b[0m\u001b[1;35mUserMessage\u001b[0m\u001b[1m(\u001b[0m\u001b[33mcontent\u001b[0m=\u001b[32m'What does DoRA stand for in torchtune?'\u001b[0m, \u001b[33mrole\u001b[0m=\u001b[32m'user'\u001b[0m, \u001b[33mcontext\u001b[0m=\u001b[3;35mNone\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[33moutput_message\u001b[0m=\u001b[1;35mCompletionMessage\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mcontent\u001b[0m=\u001b[32m'DoRA stands for \"Decoupled Orthogonal Random Axes\" in the context of the Torchtune project.'\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mrole\u001b[0m=\u001b[32m'assistant'\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mstop_reason\u001b[0m=\u001b[32m'end_of_turn'\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mtool_calls\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[33msession_id\u001b[0m=\u001b[32m'b5b5b9c5-1f14-404a-9677-cdb413b9f328'\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[33mstarted_at\u001b[0m=\u001b[1;35mdatetime\u001b[0m\u001b[1;35m.datetime\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m2025\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m7\u001b[0m, \u001b[1;36m10\u001b[0m, \u001b[1;36m35\u001b[0m, \u001b[1;36m24\u001b[0m, \u001b[1;36m235903\u001b[0m, \u001b[33mtzinfo\u001b[0m=\u001b[1;35mdatetime\u001b[0m\u001b[1;35m.timezone\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mdatetime.timedelta\u001b[0m\u001b[1m(\u001b[0m\u001b[33mdays\u001b[0m=\u001b[1;36m-1\u001b[0m, \u001b[33mseconds\u001b[0m=\u001b[1;36m57600\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[33msteps\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1;35mInferenceStep\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mapi_model_response\u001b[0m=\u001b[1;35mCompletionMessage\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[33mcontent\u001b[0m=\u001b[32m''\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[33mrole\u001b[0m=\u001b[32m'assistant'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[33mstop_reason\u001b[0m=\u001b[32m'end_of_turn'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[33mtool_calls\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[1;35mToolCall\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[33marguments\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'query'\u001b[0m: \u001b[32m'DoRA meaning in Torchtune'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[33mcall_id\u001b[0m=\u001b[32m'c2c088b9-cf2f-41b5-a050-dd5743112f48'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[33mtool_name\u001b[0m=\u001b[32m'knowledge_search'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mstep_id\u001b[0m=\u001b[32m'27ba55cd-0252-4cff-8141-129b3b8dd021'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mstep_type\u001b[0m=\u001b[32m'inference'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mturn_id\u001b[0m=\u001b[32m'bb111412-e2e9-40ca-9cd2-87df200807ab'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mcompleted_at\u001b[0m=\u001b[1;35mdatetime\u001b[0m\u001b[1;35m.datetime\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m2025\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m7\u001b[0m, \u001b[1;36m10\u001b[0m, \u001b[1;36m35\u001b[0m, \u001b[1;36m26\u001b[0m, \u001b[1;36m226185\u001b[0m, \u001b[33mtzinfo\u001b[0m=\u001b[1;35mTzInfo\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m-08\u001b[0m:\u001b[1;36m00\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mstarted_at\u001b[0m=\u001b[1;35mdatetime\u001b[0m\u001b[1;35m.datetime\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m2025\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m7\u001b[0m, \u001b[1;36m10\u001b[0m, \u001b[1;36m35\u001b[0m, \u001b[1;36m24\u001b[0m, \u001b[1;36m236359\u001b[0m, \u001b[33mtzinfo\u001b[0m=\u001b[1;35mTzInfo\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m-08\u001b[0m:\u001b[1;36m00\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1;35mToolExecutionStep\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mstep_id\u001b[0m=\u001b[32m'e7da6bb1-a704-4a2e-9954-5d54d8a1fc5d'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mstep_type\u001b[0m=\u001b[32m'tool_execution'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mtool_calls\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1;35mToolCall\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[33marguments\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'query'\u001b[0m: \u001b[32m'DoRA meaning in Torchtune'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[33mcall_id\u001b[0m=\u001b[32m'c2c088b9-cf2f-41b5-a050-dd5743112f48'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[33mtool_name\u001b[0m=\u001b[32m'knowledge_search'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[33mtool_responses\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1;35mToolResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[33mcall_id\u001b[0m=\u001b[32m'c2c088b9-cf2f-41b5-a050-dd5743112f48'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[33mcontent\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[1;35mTextContentItem\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ │ │ \u001b[0m\u001b[33mtext\u001b[0m=\u001b[32m'knowledge_search tool found 5 chunks:\\nBEGIN of knowledge_search tool results.\\n'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ │ │ \u001b[0m\u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[1m)\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ │ \u001b[0m\u001b[1;35mTextContentItem\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ │ │ \u001b[0m\u001b[33mtext\u001b[0m=\u001b[32m'Result 1:\\nDocument_id:num-0\\nContent: etune\\n:func:`torchtune.models.llama3.llama3_8b` with DoRA, you would use :func:`torchtune.models.llama3.lora_llama3_8b` with ``\u001b[0m\u001b[32muse_dora\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m``:\\n\\n.. code-block:: bash\\n\\n tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n model.\u001b[0m\u001b[32muse_dora\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m\\n\\n.. code-block:: yaml\\n\\n model:\\n _component_: torchtune.models.lora_llama3_8b\\n use_dora: True\\n\\nSince DoRA extends LoRA, the parameters for :ref:`customizing LoRA \u001b[0m\u001b[32m<\u001b[0m\u001b[32mglossary_lora\u001b[0m\u001b[32m>` are identical. You can also quantize the base model weights like in :ref:`glossary_qlora` by using ``\u001b[0m\u001b[32mquantize\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m`` to reap\\neven more memory savings!\\n\\n.. code-block:: bash\\n\\n tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n model.\u001b[0m\u001b[32mapply_lora_to_mlp\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m \\\\\\n model.\u001b[0m\u001b[32mlora_attn_modules\u001b[0m\u001b[32m=\u001b[0m\u001b[32m[\u001b[0m\u001b[32m\"q_proj\",\"k_proj\",\"v_proj\"\u001b[0m\u001b[32m]\u001b[0m\u001b[32m \\\\\\n model.\u001b[0m\u001b[32mlora_rank\u001b[0m\u001b[32m=\u001b[0m\u001b[32m16\u001b[0m\u001b[32m \\\\\\n model.\u001b[0m\u001b[32mlora_alpha\u001b[0m\u001b[32m=\u001b[0m\u001b[32m32\u001b[0m\u001b[32m \\\\\\n model.\u001b[0m\u001b[32muse_dora\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m \\\\\\n model.\u001b[0m\u001b[32mquantize_base\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m\\n\\n.. code-block:: yaml\\n\\n model:\\n _component_: torchtune.models.lora_llama3_8b\\n apply_lora_to_mlp: True\\n lora_attn_modules: \u001b[0m\u001b[32m[\u001b[0m\u001b[32m\"q_proj\", \"k_proj\", \"v_proj\"\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\\n lora_rank: 16\\n lora_alpha: 32\\n use_dora: True\\n quantize_base: True\\n\\n\\n.. note::\\n\\n Under the hood, we\\'ve enabled DoRA by adding the :class:`~torchtune.modules.peft.DoRALinear` module, which we swap\\n out for :class:`~torchtune.modules.peft.LoRALinear` when ``\u001b[0m\u001b[32muse_dora\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m``.\\n\\n.. _glossary_distrib:\\n\\n\\n.. TODO\\n\\n.. Distributed\\n.. -----------\\n\\n.. .. _glossary_fsdp:\\n\\n.. Fully Sharded Data Parallel \u001b[0m\u001b[32m(\u001b[0m\u001b[32mFSDP\u001b[0m\u001b[32m)\u001b[0m\u001b[32m\\n.. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n.. All our ``_distributed`` recipes use `FSDP
Agent Answer: Torchtune supports two precision formats: `fp32` (full-precision) and `bfloat16` (half-precision). \n", "The `bfloat16` format uses 2 bytes per model parameter, which is half the memory of `fp32`, and also improves \n", "training speed.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m Torchtune supports two precision formats: `fp32` \u001b[1m(\u001b[0mfull-precision\u001b[1m)\u001b[0m and `bfloat16` \u001b[1m(\u001b[0mhalf-precision\u001b[1m)\u001b[0m. \n", "The `bfloat16` format uses \u001b[1;36m2\u001b[0m bytes per model parameter, which is half the memory of `fp32`, and also improves \n", "training speed.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: What does DoRA stand for in torchtune?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m What does DoRA stand for in torchtune?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: DoRA stands for Weight-Decomposed Low-Rank Adaptation. It is a variant of LoRA (Low-Rank Adaptation) \n", "that further decomposes the pre-trained weights into two components: magnitude and direction. The magnitude \n", "component is a scalar vector that adjusts the scale, while the direction component corresponds to the original LoRA\n", "decomposition and updates the orientation of weights. DoRA adds a small overhead to LoRA training due to the \n", "addition of the magnitude parameter, but it has been shown to improve the performance of LoRA, particularly at low \n", "ranks.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m DoRA stands for Weight-Decomposed Low-Rank Adaptation. It is a variant of LoRA \u001b[1m(\u001b[0mLow-Rank Adaptation\u001b[1m)\u001b[0m \n", "that further decomposes the pre-trained weights into two components: magnitude and direction. The magnitude \n", "component is a scalar vector that adjusts the scale, while the direction component corresponds to the original LoRA\n", "decomposition and updates the orientation of weights. DoRA adds a small overhead to LoRA training due to the \n", "addition of the magnitude parameter, but it has been shown to improve the performance of LoRA, particularly at low \n", "ranks.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Question: How does the CPUOffloadOptimizer reduce GPU memory usage?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How does the CPUOffloadOptimizer reduce GPU memory usage?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: The CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to the \n",
       "CPU, and performing optimizer steps on the CPU. This can significantly reduce GPU memory usage at the cost of CPU \n",
       "RAM and training speed.\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;33mAgent Answer:\u001b[0m The CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to the \n",
       "CPU, and performing optimizer steps on the CPU. This can significantly reduce GPU memory usage at the cost of CPU \n",
       "RAM and training speed.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Question: How do I ensure only LoRA parameters are trainable when fine-tuning?\n",
       "\n"
      ],
      "text/plain": [
       "\u001b[1;36mQuestion:\u001b[0m How do I ensure only LoRA parameters are trainable when fine-tuning?\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Agent Answer: To ensure only LoRA parameters are trainable when fine-tuning, you can use the `set_trainable_params`\n", "function from `torchtune.modules.peft.peft_utils` to set the `requires_grad` attribute of the LoRA parameters to \n", "`True` and the `requires_grad` attribute of the other parameters to `False`.\n", "\n", "Here is an example:\n", "```python\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Get the LoRA parameters\n", "lora_params = get_adapter_params(model)\n", "\n", "# Set the LoRA parameters to trainable and the other parameters to non-trainable\n", "set_trainable_params(model, lora_params)\n", "```\n", "This will ensure that only the LoRA parameters are updated during fine-tuning, while the other parameters remain \n", "frozen.\n", "\n", "Alternatively, you can also use the `lora_finetune` recipe in torchtune, which automatically sets the LoRA \n", "parameters to trainable and the other parameters to non-trainable. You can run the recipe using the following \n", "command:\n", "```bash\n", "tune run lora_finetune --config llama2/7B_lora\n", "```\n", "This will fine-tune the LoRA parameters of the Llama2 model using the default settings. You can modify the config \n", "file to change the hyperparameters or the model architecture.\n", "\n" ], "text/plain": [ "\u001b[1;33mAgent Answer:\u001b[0m To ensure only LoRA parameters are trainable when fine-tuning, you can use the `set_trainable_params`\n", "function from `torchtune.modules.peft.peft_utils` to set the `requires_grad` attribute of the LoRA parameters to \n", "`\u001b[3;92mTrue\u001b[0m` and the `requires_grad` attribute of the other parameters to `\u001b[3;91mFalse\u001b[0m`.\n", "\n", "Here is an example:\n", "```python\n", "from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n", "\n", "# Get the LoRA parameters\n", "lora_params = \u001b[1;35mget_adapter_params\u001b[0m\u001b[1m(\u001b[0mmodel\u001b[1m)\u001b[0m\n", "\n", "# Set the LoRA parameters to trainable and the other parameters to non-trainable\n", "\u001b[1;35mset_trainable_params\u001b[0m\u001b[1m(\u001b[0mmodel, lora_params\u001b[1m)\u001b[0m\n", "```\n", "This will ensure that only the LoRA parameters are updated during fine-tuning, while the other parameters remain \n", "frozen.\n", "\n", "Alternatively, you can also use the `lora_finetune` recipe in torchtune, which automatically sets the LoRA \n", "parameters to trainable and the other parameters to non-trainable. You can run the recipe using the following \n", "command:\n", "```bash\n", "tune run lora_finetune --config llama2/7B_lora\n", "```\n", "This will fine-tune the LoRA parameters of the Llama2 model using the default settings. You can modify the config \n", "file to change the hyperparameters or the model architecture.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "urls = [\n", " \"memory_optimizations.rst\",\n", " \"chat.rst\",\n", " \"llama3.rst\",\n", " \"qat_finetune.rst\",\n", " \"lora_finetune.rst\",\n", "]\n", "\n", "attachments = [\n", " {\n", " \"content\": {\n", " \"uri\": f\"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}\",\n", " },\n", " \"mime_type\": \"text/plain\",\n", " }\n", "\n", " for i, url in enumerate(urls)\n", "]\n", "\n", "rag_attachment_agent = Agent(\n", " client,\n", " model=MODEL_ID,\n", " instructions=\"You are a helpful assistant that can answer questions about the Torchtune project. Use context from attached documentation for Torchtune to answer questions.\",\n", ")\n", "\n", "for example in examples:\n", " session_id = rag_attachment_agent.create_session(session_name=f\"rag_attachment_session_{uuid.uuid4()}\")\n", " response = rag_attachment_agent.create_turn(\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": example[\"input_query\"]\n", " }\n", " ],\n", " session_id=session_id,\n", " documents=attachments,\n", " stream=False\n", " )\n", " rich.print(f\"[bold cyan]Question:[/bold cyan] {example['input_query']}\")\n", " rich.print(f\"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}\")\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ScoringScoreResponse(\n", "│ results={\n", "│ │ 'braintrust::factuality': ScoringResult(\n", "│ │ │ aggregated_results={'average': {'average': 0.6}},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. Both the expert and the submitted answers mention that Torchtune supports two precision formats: `fp32` (full-precision) and `bfloat16` (half-precision).\\n2. The expert answer specifies that `fp32` uses 4 bytes per model and optimizer parameter, while `bfloat16` uses 2 bytes per model and optimizer parameter.\\n3. The submitted answer also mentions that `bfloat16` uses 2 bytes per model parameter, which is consistent with the expert answer.\\n4. The submitted answer adds that `bfloat16` improves training speed, which is additional information not present in the expert answer.\\n5. There is no conflict between the submitted answer and the expert answer; the submitted answer simply provides more information.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer provides the definition of DoRA as \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer also states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation,\" which matches the expert answer.\\n3. The submitted answer includes additional information about DoRA, explaining that it is a variant of LoRA and describing how it decomposes pre-trained weights into magnitude and direction components.\\n4. The submitted answer further explains the role of the magnitude component and the direction component, and mentions the performance improvement and overhead associated with DoRA.\\n5. The additional details in the submitted answer do not contradict the expert answer; instead, they expand upon it.\\n6. Therefore, the submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': '1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It also mentions the optional offloading of gradients to CPU with the parameter offload_gradients=True.\\n\\n2. The submitted answer states that the CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to the CPU, and performing optimizer steps on the CPU. It adds that this can significantly reduce GPU memory usage at the cost of CPU RAM and training speed.\\n\\n3. Comparing both answers:\\n - Both answers agree on offloading optimizer states to the CPU and performing optimizer steps on the CPU.\\n - Both mention the offloading of gradients to the CPU, but the expert answer specifies it as optional with a parameter, while the submission does not specify this detail.\\n - The submission adds additional information about the trade-off involving CPU RAM and training speed, which is not mentioned in the expert answer.\\n\\n4. The submitted answer includes all the details from the expert answer and adds more information about the trade-offs, making it a superset of the expert answer.\\n\\nTherefore, the correct choice is (B) The submitted answer is a superset of the expert answer and is fully consistent with it.'\n", "│ │ │ │ │ }\n", "│ │ │ │ },\n", "│ │ │ │ {\n", "│ │ │ │ │ 'score': 0.6,\n", "│ │ │ │ │ 'metadata': {\n", "│ │ │ │ │ │ 'choice': 'B',\n", "│ │ │ │ │ │ 'rationale': \"1. **Expert Answer Analysis**: The expert answer provides a method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params(lora_model)` and setting them as trainable with `set_trainable_params(lora_model, lora_params)`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Analysis**: The submitted answer provides a similar method using `set_trainable_params` to set the `requires_grad` attribute of LoRA parameters to `True` and other parameters to `False`. It includes a code example demonstrating this process. Additionally, it mentions using the `lora_finetune` recipe in torchtune, which automatically sets the LoRA parameters to trainable.\\n\\n3. **Comparison**: The submitted answer includes all the details from the expert answer regarding the use of `get_adapter_params` and `set_trainable_params`. It also provides additional information about setting the `requires_grad` attribute and using the `lora_finetune` recipe, which is not mentioned in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer as it contains all the information from the expert answer and additional details. There is no conflict between the two answers, and the additional information in the submission is consistent with the expert's explanation.\\n\\nTherefore, the correct choice is (B) The submitted answer is a superset of the expert answer and is fully consistent with it.\"\n", "│ │ │ │ │ }\n", "│ │ │ │ }\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mScoringScoreResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mresults\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'braintrust::factuality'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'average'\u001b[0m: \u001b[1;36m0.6\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. Both the expert and the submitted answers mention that Torchtune supports two precision formats: `fp32` \u001b[0m\u001b[32m(\u001b[0m\u001b[32mfull-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m and `bfloat16` \u001b[0m\u001b[32m(\u001b[0m\u001b[32mhalf-precision\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.\\n2. The expert answer specifies that `fp32` uses 4 bytes per model and optimizer parameter, while `bfloat16` uses 2 bytes per model and optimizer parameter.\\n3. The submitted answer also mentions that `bfloat16` uses 2 bytes per model parameter, which is consistent with the expert answer.\\n4. The submitted answer adds that `bfloat16` improves training speed, which is additional information not present in the expert answer.\\n5. There is no conflict between the submitted answer and the expert answer; the submitted answer simply provides more information.\\n\\nBased on this analysis, the submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer provides the definition of DoRA as \"Weight-Decomposed Low-Rank Adaptation.\"\\n2. The submitted answer also states that DoRA stands for \"Weight-Decomposed Low-Rank Adaptation,\" which matches the expert answer.\\n3. The submitted answer includes additional information about DoRA, explaining that it is a variant of LoRA and describing how it decomposes pre-trained weights into magnitude and direction components.\\n4. The submitted answer further explains the role of the magnitude component and the direction component, and mentions the performance improvement and overhead associated with DoRA.\\n5. The additional details in the submitted answer do not contradict the expert answer; instead, they expand upon it.\\n6. Therefore, the submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m'1. The expert answer states that the CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It also mentions the optional offloading of gradients to CPU with the parameter \u001b[0m\u001b[32moffload_gradients\u001b[0m\u001b[32m=\u001b[0m\u001b[32mTrue\u001b[0m\u001b[32m.\\n\\n2. The submitted answer states that the CPUOffloadOptimizer reduces GPU memory usage by offloading optimizer states and gradients to the CPU, and performing optimizer steps on the CPU. It adds that this can significantly reduce GPU memory usage at the cost of CPU RAM and training speed.\\n\\n3. Comparing both answers:\\n - Both answers agree on offloading optimizer states to the CPU and performing optimizer steps on the CPU.\\n - Both mention the offloading of gradients to the CPU, but the expert answer specifies it as optional with a parameter, while the submission does not specify this detail.\\n - The submission adds additional information about the trade-off involving CPU RAM and training speed, which is not mentioned in the expert answer.\\n\\n4. The submitted answer includes all the details from the expert answer and adds more information about the trade-offs, making it a superset of the expert answer.\\n\\nTherefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mB\u001b[0m\u001b[32m)\u001b[0m\u001b[32m The submitted answer is a superset of the expert answer and is fully consistent with it.'\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[32m'metadata'\u001b[0m: \u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'choice'\u001b[0m: \u001b[32m'B'\u001b[0m,\n", "\u001b[2;32m│ │ │ │ │ │ \u001b[0m\u001b[32m'rationale'\u001b[0m: \u001b[32m\"1. **Expert Answer Analysis**: The expert answer provides a method to ensure only LoRA parameters are trainable by using torchtune's utility functions. It mentions fetching LoRA parameters with `get_adapter_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model\u001b[0m\u001b[32m)\u001b[0m\u001b[32m` and setting them as trainable with `set_trainable_params\u001b[0m\u001b[32m(\u001b[0m\u001b[32mlora_model, lora_params\u001b[0m\u001b[32m)\u001b[0m\u001b[32m`. It also notes that the LoRA recipe handles this automatically.\\n\\n2. **Submitted Answer Analysis**: The submitted answer provides a similar method using `set_trainable_params` to set the `requires_grad` attribute of LoRA parameters to `True` and other parameters to `False`. It includes a code example demonstrating this process. Additionally, it mentions using the `lora_finetune` recipe in torchtune, which automatically sets the LoRA parameters to trainable.\\n\\n3. **Comparison**: The submitted answer includes all the details from the expert answer regarding the use of `get_adapter_params` and `set_trainable_params`. It also provides additional information about setting the `requires_grad` attribute and using the `lora_finetune` recipe, which is not mentioned in the expert answer.\\n\\n4. **Conclusion**: The submitted answer is a superset of the expert answer as it contains all the information from the expert answer and additional details. There is no conflict between the two answers, and the additional information in the submission is consistent with the expert's explanation.\\n\\nTherefore, the correct choice is \u001b[0m\u001b[32m(\u001b[0m\u001b[32mB\u001b[0m\u001b[32m)\u001b[0m\u001b[32m The submitted answer is a superset of the expert answer and is fully consistent with it.\"\u001b[0m\n", "\u001b[2;32m│ │ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eval_rows = []\n", "for i, session_id in enumerate(rag_attachment_agent.sessions):\n", " session_response = client.agents.session.retrieve(agent_id=rag_attachment_agent.agent_id, session_id=session_id)\n", " for turn in session_response.turns:\n", " eval_rows.append({\n", " \"input_query\": examples[i][\"input_query\"],\n", " \"expected_answer\": examples[i][\"expected_answer\"],\n", " \"generated_answer\": turn.output_message.content,\n", " })\n", "\n", "scoring_params = {\n", " \"braintrust::factuality\": None,\n", "}\n", "scoring_response = client.scoring.score(\n", " input_rows=eval_rows,\n", " scoring_functions=scoring_params,\n", ")\n", "pprint(scoring_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "master", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 2 }