llama-stack/docs/zero_to_hero_guide/05_Memory101.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/zero_to_hero_guide/05_Memory101.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Memory "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting Started with Memory API Tutorial 🚀\n",
    "Welcome! This interactive tutorial will guide you through using the Memory API, a powerful tool for document storage and retrieval. Whether you're new to vector databases or an experienced developer, this notebook will help you understand the basics and get up and running quickly.\n",
    "What you'll learn:\n",
    "\n",
    "How to set up and configure the Memory API client\n",
    "Creating and managing memory banks (vector stores)\n",
    "Different ways to insert documents into the system\n",
    "How to perform intelligent queries on your documents\n",
    "\n",
    "Prerequisites:\n",
    "\n",
    "Basic Python knowledge\n",
    "A running instance of the Memory API server (we'll use localhost in \n",
    "this tutorial)\n",
    "\n",
    "Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
    "\n",
    "Let's start by installing the required packages:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up your connection parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "HOST = \"localhost\"  # Replace with your host\n",
    "PORT = 5000        # Replace with your port"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install the client library and a helper package for colored output\n",
    "#!pip install llama-stack-client termcolor\n",
    "\n",
    "# 💡 Note: If you're running this in a new environment, you might need to restart\n",
    "# your kernel after installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. **Initial Setup**\n",
    "\n",
    "First, we'll import the necessary libraries and set up some helper functions. Let's break down what each import does:\n",
    "\n",
    "llama_stack_client: Our main interface to the Memory API\n",
    "base64: Helps us encode files for transmission\n",
    "mimetypes: Determines file types automatically\n",
    "termcolor: Makes our output prettier with colors\n",
    "\n",
    "❓ Question: Why do we need to convert files to data URLs?\n",
    "Answer: Data URLs allow us to embed file contents directly in our requests, making it easier to transmit files to the API without needing separate file uploads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "import json\n",
    "import mimetypes\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "from llama_stack_client import LlamaStackClient\n",
    "from llama_stack_client.types.memory_insert_params import Document\n",
    "from termcolor import cprint\n",
    "\n",
    "# Helper function to convert files to data URLs\n",
    "def data_url_from_file(file_path: str) -> str:\n",
    "    \"\"\"Convert a file to a data URL for API transmission\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the file to convert\n",
    "\n",
    "    Returns:\n",
    "        str: Data URL containing the file's contents\n",
    "\n",
    "    Example:\n",
    "        >>> url = data_url_from_file('example.txt')\n",
    "        >>> print(url[:30])  # Preview the start of the URL\n",
    "        'data:text/plain;base64,SGVsbG8='\n",
    "    \"\"\"\n",
    "    if not os.path.exists(file_path):\n",
    "        raise FileNotFoundError(f\"File not found: {file_path}\")\n",
    "\n",
    "    with open(file_path, \"rb\") as file:\n",
    "        file_content = file.read()\n",
    "\n",
    "    base64_content = base64.b64encode(file_content).decode(\"utf-8\")\n",
    "    mime_type, _ = mimetypes.guess_type(file_path)\n",
    "\n",
    "    data_url = f\"data:{mime_type};base64,{base64_content}\"\n",
    "    return data_url"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. **Initialize Client and Create Memory Bank**\n",
    "\n",
    "Now we'll set up our connection to the Memory API and create our first memory bank. A memory bank is like a specialized database that stores document embeddings for semantic search.\n",
    "❓ Key Concepts:\n",
    "\n",
    "embedding_model: The model used to convert text into vector representations\n",
    "chunk_size: How large each piece of text should be when splitting documents\n",
    "overlap_size: How much overlap between chunks (helps maintain context)\n",
    "\n",
    "✨ Pro Tip: Choose your chunk size based on your use case. Smaller chunks (256-512 tokens) are better for precise retrieval, while larger chunks (1024+ tokens) maintain more context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Available providers:\n",
      "{'inference': [ProviderInfo(provider_id='meta-reference', provider_type='meta-reference'), ProviderInfo(provider_id='meta1', provider_type='meta-reference')], 'safety': [ProviderInfo(provider_id='meta-reference', provider_type='meta-reference')], 'agents': [ProviderInfo(provider_id='meta-reference', provider_type='meta-reference')], 'memory': [ProviderInfo(provider_id='meta-reference', provider_type='meta-reference')], 'telemetry': [ProviderInfo(provider_id='meta-reference', provider_type='meta-reference')]}\n"
     ]
    }
   ],
   "source": [
    "# Configure connection parameters\n",
    "HOST = \"localhost\"  # Replace with your host if using a remote server\n",
    "PORT = 5000       # Replace with your port if different\n",
    "\n",
    "# Initialize client\n",
    "client = LlamaStackClient(\n",
    "    base_url=f\"http://{HOST}:{PORT}\",\n",
    ")\n",
    "\n",
    "# Let's see what providers are available\n",
    "# Providers determine where and how your data is stored\n",
    "providers = client.providers.list()\n",
    "print(\"Available providers:\")\n",
    "#print(json.dumps(providers, indent=2))\n",
    "print(providers)\n",
    "# Create a memory bank with optimized settings for general use\n",
    "client.memory_banks.register(\n",
    "    memory_bank={\n",
    "        \"identifier\": \"tutorial_bank\",  # A unique name for your memory bank\n",
    "        \"embedding_model\": \"all-MiniLM-L6-v2\",  # A lightweight but effective model\n",
    "        \"chunk_size_in_tokens\": 512,  # Good balance between precision and context\n",
    "        \"overlap_size_in_tokens\": 64,  # Helps maintain context between chunks\n",
    "        \"provider_id\": providers[\"memory\"][0].provider_id,  # Use the first available provider\n",
    "    }\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. **Insert Documents**\n",
    "   \n",
    "The Memory API supports multiple ways to add documents. We'll demonstrate two common approaches:\n",
    "\n",
    "Loading documents from URLs\n",
    "Loading documents from local files\n",
    "\n",
    "❓ Important Concepts:\n",
    "\n",
    "Each document needs a unique document_id\n",
    "Metadata helps organize and filter documents later\n",
    "The API automatically processes and chunks documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Documents inserted successfully!\n"
     ]
    }
   ],
   "source": [
    "# Example URLs to documentation\n",
    "# 💡 Replace these with your own URLs or use the examples\n",
    "urls = [\n",
    "    \"memory_optimizations.rst\",\n",
    "    \"chat.rst\",\n",
    "    \"llama3.rst\",\n",
    "]\n",
    "\n",
    "# Create documents from URLs\n",
    "# We add metadata to help organize our documents\n",
    "url_documents = [\n",
    "    Document(\n",
    "        document_id=f\"url-doc-{i}\",  # Unique ID for each document\n",
    "        content=f\"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}\",\n",
    "        mime_type=\"text/plain\",\n",
    "        metadata={\"source\": \"url\", \"filename\": url},  # Metadata helps with organization\n",
    "    )\n",
    "    for i, url in enumerate(urls)\n",
    "]\n",
    "\n",
    "# Example with local files\n",
    "# 💡 Replace these with your actual files\n",
    "local_files = [\"example.txt\", \"readme.md\"]\n",
    "file_documents = [\n",
    "    Document(\n",
    "        document_id=f\"file-doc-{i}\",\n",
    "        content=data_url_from_file(path),\n",
    "        metadata={\"source\": \"local\", \"filename\": path},\n",
    "    )\n",
    "    for i, path in enumerate(local_files)\n",
    "    if os.path.exists(path)\n",
    "]\n",
    "\n",
    "# Combine all documents\n",
    "all_documents = url_documents + file_documents\n",
    "\n",
    "# Insert documents into memory bank\n",
    "response = client.memory.insert(\n",
    "    bank_id=\"tutorial_bank\",\n",
    "    documents=all_documents,\n",
    ")\n",
    "\n",
    "print(\"Documents inserted successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. **Query the Memory Bank**\n",
    "   \n",
    "Now for the exciting part - querying our documents! The Memory API uses semantic search to find relevant content based on meaning, not just keywords.\n",
    "❓ Understanding Scores:\n",
    "\n",
    "Generally, scores above 0.7 indicate strong relevance\n",
    "Consider your use case when deciding on score thresholds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Query: How do I use LoRA?\n",
      "--------------------------------------------------\n",
      "\n",
      "Result 1 (Score: 1.322)\n",
      "========================================\n",
      "Chunk(content=\"_peft:\\n\\nParameter Efficient Fine-Tuning (PEFT)\\n--------------------------------------\\n\\n.. _glossary_lora:\\n\\nLow Rank Adaptation (LoRA)\\n^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n\\n*What's going on here?*\\n\\nYou can read our tutorial on :ref:`finetuning Llama2 with LoRA<lora_finetune_label>` to understand how LoRA works, and how to use it.\\nSimply stated, LoRA greatly reduces the number of trainable parameters, thus saving significant gradient and optimizer\\nmemory during training.\\n\\n*Sounds great! How do I use it?*\\n\\nYou can finetune using any of our recipes with the ``lora_`` prefix, e.g. :ref:`lora_finetune_single_device<lora_finetune_recipe_label>`. These recipes utilize\\nLoRA-enabled model builders, which we support for all our models, and also use the ``lora_`` prefix, e.g.\\nthe :func:`torchtune.models.llama3.llama3` model has a corresponding :func:`torchtune.models.llama3.lora_llama3`.\\nWe aim to provide a comprehensive set of configurations to allow you to get started with training with LoRA quickly,\\njust specify any config with ``_lora`` in its name, e.g:\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n\\nThere are two sets of parameters to customize LoRA to suit your needs. Firstly, the parameters which control\\nwhich linear layers LoRA should be applied to in the model:\\n\\n* ``lora_attn_modules: List[str]`` accepts a list of strings specifying which layers of the model to apply\\n  LoRA to:\\n\\n  * ``q_proj`` applies LoRA to the query projection layer.\\n  * ``k_proj`` applies LoRA to the key projection layer.\\n  * ``v_proj`` applies LoRA to the value projection layer.\\n  * ``output_proj`` applies LoRA to the attention output projection layer.\\n\\n  Whilst adding more layers to be fine-tuned may improve model accuracy,\\n  this will come at the cost of increased memory usage and reduced training speed.\\n\\n* ``apply_lora_to_mlp: Bool`` applies LoRA to the MLP in each transformer layer.\\n* ``apply_lora_to_output: Bool`` applies LoRA to the model's final output projection.\\n  This is usually a projection to vocabulary space (e.g. in language models),\", document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 2 (Score: 1.322)\n",
      "========================================\n",
      "Chunk(content=\"_peft:\\n\\nParameter Efficient Fine-Tuning (PEFT)\\n--------------------------------------\\n\\n.. _glossary_lora:\\n\\nLow Rank Adaptation (LoRA)\\n^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n\\n*What's going on here?*\\n\\nYou can read our tutorial on :ref:`finetuning Llama2 with LoRA<lora_finetune_label>` to understand how LoRA works, and how to use it.\\nSimply stated, LoRA greatly reduces the number of trainable parameters, thus saving significant gradient and optimizer\\nmemory during training.\\n\\n*Sounds great! How do I use it?*\\n\\nYou can finetune using any of our recipes with the ``lora_`` prefix, e.g. :ref:`lora_finetune_single_device<lora_finetune_recipe_label>`. These recipes utilize\\nLoRA-enabled model builders, which we support for all our models, and also use the ``lora_`` prefix, e.g.\\nthe :func:`torchtune.models.llama3.llama3` model has a corresponding :func:`torchtune.models.llama3.lora_llama3`.\\nWe aim to provide a comprehensive set of configurations to allow you to get started with training with LoRA quickly,\\njust specify any config with ``_lora`` in its name, e.g:\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n\\nThere are two sets of parameters to customize LoRA to suit your needs. Firstly, the parameters which control\\nwhich linear layers LoRA should be applied to in the model:\\n\\n* ``lora_attn_modules: List[str]`` accepts a list of strings specifying which layers of the model to apply\\n  LoRA to:\\n\\n  * ``q_proj`` applies LoRA to the query projection layer.\\n  * ``k_proj`` applies LoRA to the key projection layer.\\n  * ``v_proj`` applies LoRA to the value projection layer.\\n  * ``output_proj`` applies LoRA to the attention output projection layer.\\n\\n  Whilst adding more layers to be fine-tuned may improve model accuracy,\\n  this will come at the cost of increased memory usage and reduced training speed.\\n\\n* ``apply_lora_to_mlp: Bool`` applies LoRA to the MLP in each transformer layer.\\n* ``apply_lora_to_output: Bool`` applies LoRA to the model's final output projection.\\n  This is usually a projection to vocabulary space (e.g. in language models),\", document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 3 (Score: 1.322)\n",
      "========================================\n",
      "Chunk(content=\"_peft:\\n\\nParameter Efficient Fine-Tuning (PEFT)\\n--------------------------------------\\n\\n.. _glossary_lora:\\n\\nLow Rank Adaptation (LoRA)\\n^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n\\n*What's going on here?*\\n\\nYou can read our tutorial on :ref:`finetuning Llama2 with LoRA<lora_finetune_label>` to understand how LoRA works, and how to use it.\\nSimply stated, LoRA greatly reduces the number of trainable parameters, thus saving significant gradient and optimizer\\nmemory during training.\\n\\n*Sounds great! How do I use it?*\\n\\nYou can finetune using any of our recipes with the ``lora_`` prefix, e.g. :ref:`lora_finetune_single_device<lora_finetune_recipe_label>`. These recipes utilize\\nLoRA-enabled model builders, which we support for all our models, and also use the ``lora_`` prefix, e.g.\\nthe :func:`torchtune.models.llama3.llama3` model has a corresponding :func:`torchtune.models.llama3.lora_llama3`.\\nWe aim to provide a comprehensive set of configurations to allow you to get started with training with LoRA quickly,\\njust specify any config with ``_lora`` in its name, e.g:\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n\\nThere are two sets of parameters to customize LoRA to suit your needs. Firstly, the parameters which control\\nwhich linear layers LoRA should be applied to in the model:\\n\\n* ``lora_attn_modules: List[str]`` accepts a list of strings specifying which layers of the model to apply\\n  LoRA to:\\n\\n  * ``q_proj`` applies LoRA to the query projection layer.\\n  * ``k_proj`` applies LoRA to the key projection layer.\\n  * ``v_proj`` applies LoRA to the value projection layer.\\n  * ``output_proj`` applies LoRA to the attention output projection layer.\\n\\n  Whilst adding more layers to be fine-tuned may improve model accuracy,\\n  this will come at the cost of increased memory usage and reduced training speed.\\n\\n* ``apply_lora_to_mlp: Bool`` applies LoRA to the MLP in each transformer layer.\\n* ``apply_lora_to_output: Bool`` applies LoRA to the model's final output projection.\\n  This is usually a projection to vocabulary space (e.g. in language models),\", document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Query: Tell me about memory optimizations\n",
      "--------------------------------------------------\n",
      "\n",
      "Result 1 (Score: 1.260)\n",
      "========================================\n",
      "Chunk(content='.. _memory_optimization_overview_label:\\n\\n============================\\nMemory Optimization Overview\\n============================\\n\\n**Author**: `Salman Mohammadi <https://github.com/SalmanMohammadi>`_\\n\\ntorchtune comes with a host of plug-and-play memory optimization components which give you lots of flexibility\\nto ``tune`` our recipes to your hardware. This page provides a brief glossary of these components and how you might use them.\\nTo make things easy, we\\'ve summarized these components in the following table:\\n\\n.. csv-table:: Memory optimization components\\n   :header: \"Component\", \"When to use?\"\\n   :widths: auto\\n\\n   \":ref:`glossary_precision`\", \"You\\'ll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter instead of 4 bytes when using ``float32``.\"\\n   \":ref:`glossary_act_ckpt`\", \"Use when you\\'re memory constrained and want to use a larger model, batch size or context length. Be aware that it will slow down training speed.\"\\n   \":ref:`glossary_act_off`\", \"Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing.\"\\n   \":ref:`glossary_grad_accm`\", \"Helpful when memory-constrained to simulate larger batch sizes. Not compatible with optimizer in backward. Use it when you can already fit at least one sample without OOMing, but not enough of them.\"\\n   \":ref:`glossary_low_precision_opt`\", \"Use when you want to reduce the size of the optimizer state. This is relevant when training large models and using optimizers with momentum, like Adam. Note that lower precision optimizers may reduce training stability/accuracy.\"\\n   \":ref:`glossary_opt_in_bwd`\", \"Use it when you have large gradients and can fit a large enough batch size, since this is not compatible with ``gradient_accumulation_steps``.\"\\n   \":ref:`glossary_cpu_offload`\", \"Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough.\"\\n   \":ref:`glossary_lora`\", \"When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory', document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 2 (Score: 1.260)\n",
      "========================================\n",
      "Chunk(content='.. _memory_optimization_overview_label:\\n\\n============================\\nMemory Optimization Overview\\n============================\\n\\n**Author**: `Salman Mohammadi <https://github.com/SalmanMohammadi>`_\\n\\ntorchtune comes with a host of plug-and-play memory optimization components which give you lots of flexibility\\nto ``tune`` our recipes to your hardware. This page provides a brief glossary of these components and how you might use them.\\nTo make things easy, we\\'ve summarized these components in the following table:\\n\\n.. csv-table:: Memory optimization components\\n   :header: \"Component\", \"When to use?\"\\n   :widths: auto\\n\\n   \":ref:`glossary_precision`\", \"You\\'ll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter instead of 4 bytes when using ``float32``.\"\\n   \":ref:`glossary_act_ckpt`\", \"Use when you\\'re memory constrained and want to use a larger model, batch size or context length. Be aware that it will slow down training speed.\"\\n   \":ref:`glossary_act_off`\", \"Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing.\"\\n   \":ref:`glossary_grad_accm`\", \"Helpful when memory-constrained to simulate larger batch sizes. Not compatible with optimizer in backward. Use it when you can already fit at least one sample without OOMing, but not enough of them.\"\\n   \":ref:`glossary_low_precision_opt`\", \"Use when you want to reduce the size of the optimizer state. This is relevant when training large models and using optimizers with momentum, like Adam. Note that lower precision optimizers may reduce training stability/accuracy.\"\\n   \":ref:`glossary_opt_in_bwd`\", \"Use it when you have large gradients and can fit a large enough batch size, since this is not compatible with ``gradient_accumulation_steps``.\"\\n   \":ref:`glossary_cpu_offload`\", \"Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough.\"\\n   \":ref:`glossary_lora`\", \"When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory', document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 3 (Score: 1.260)\n",
      "========================================\n",
      "Chunk(content='.. _memory_optimization_overview_label:\\n\\n============================\\nMemory Optimization Overview\\n============================\\n\\n**Author**: `Salman Mohammadi <https://github.com/SalmanMohammadi>`_\\n\\ntorchtune comes with a host of plug-and-play memory optimization components which give you lots of flexibility\\nto ``tune`` our recipes to your hardware. This page provides a brief glossary of these components and how you might use them.\\nTo make things easy, we\\'ve summarized these components in the following table:\\n\\n.. csv-table:: Memory optimization components\\n   :header: \"Component\", \"When to use?\"\\n   :widths: auto\\n\\n   \":ref:`glossary_precision`\", \"You\\'ll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter instead of 4 bytes when using ``float32``.\"\\n   \":ref:`glossary_act_ckpt`\", \"Use when you\\'re memory constrained and want to use a larger model, batch size or context length. Be aware that it will slow down training speed.\"\\n   \":ref:`glossary_act_off`\", \"Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing.\"\\n   \":ref:`glossary_grad_accm`\", \"Helpful when memory-constrained to simulate larger batch sizes. Not compatible with optimizer in backward. Use it when you can already fit at least one sample without OOMing, but not enough of them.\"\\n   \":ref:`glossary_low_precision_opt`\", \"Use when you want to reduce the size of the optimizer state. This is relevant when training large models and using optimizers with momentum, like Adam. Note that lower precision optimizers may reduce training stability/accuracy.\"\\n   \":ref:`glossary_opt_in_bwd`\", \"Use it when you have large gradients and can fit a large enough batch size, since this is not compatible with ``gradient_accumulation_steps``.\"\\n   \":ref:`glossary_cpu_offload`\", \"Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough.\"\\n   \":ref:`glossary_lora`\", \"When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory', document_id='url-doc-0', token_count=512)\n",
      "========================================\n",
      "\n",
      "Query: What are the key features of Llama 3?\n",
      "--------------------------------------------------\n",
      "\n",
      "Result 1 (Score: 0.964)\n",
      "========================================\n",
      "Chunk(content=\"8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\\n\\n|\\n\\nGetting access to Llama3-8B-Instruct\\n------------------------------------\\n\\nFor this tutorial, we will be using the instruction-tuned version of Llama3-8B. First, let's download the model from Hugging Face. You will need to follow the instructions\\non the `official Meta page <https://github.com/meta-llama/llama3/blob/main/README.md>`_ to gain access to the model.\\nNext, make sure you grab your Hugging Face token from `here <https://huggingface.co/settings/tokens>`_.\\n\\n\\n.. code-block:: bash\\n\\n    tune download meta-llama/Meta-Llama-3-8B-Instruct \\\\\\n        --output-dir <checkpoint_dir> \\\\\\n        --hf-token <ACCESS TOKEN>\\n\\n|\\n\\nFine-tuning Llama3-8B-Instruct in torchtune\\n-------------------------------------------\\n\\ntorchtune provides `LoRA <https://arxiv.org/abs/2106.09685>`_, `QLoRA <https://arxiv.org/abs/2305.14314>`_, and full fine-tuning\\nrecipes for fine-tuning Llama3-8B on one or more GPUs. For more on LoRA in torchtune, see our :ref:`LoRA Tutorial <lora_finetune_label>`.\\nFor more on QLoRA in torchtune, see our :ref:`QLoRA Tutorial <qlora_finetune_label>`.\\n\\nLet's take a look at how we can fine-tune Llama3-8B-Instruct with LoRA on a single device using torchtune. In this example, we will fine-tune\\nfor one epoch on a common instruct dataset for illustrative purposes. The basic command for a single-device LoRA fine-tune is\\n\\n.. code-block:: bash\\n\\n    tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n.. note::\\n    To see a full list of recipes and their corresponding configs, simply run ``tune ls`` from the command line.\\n\\nWe can also add :ref:`command-line overrides <cli_override>` as needed, e.g.\\n\\n.. code-block:: bash\\n\\n    tune run lora\", document_id='url-doc-2', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 2 (Score: 0.964)\n",
      "========================================\n",
      "Chunk(content=\"8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\\n\\n|\\n\\nGetting access to Llama3-8B-Instruct\\n------------------------------------\\n\\nFor this tutorial, we will be using the instruction-tuned version of Llama3-8B. First, let's download the model from Hugging Face. You will need to follow the instructions\\non the `official Meta page <https://github.com/meta-llama/llama3/blob/main/README.md>`_ to gain access to the model.\\nNext, make sure you grab your Hugging Face token from `here <https://huggingface.co/settings/tokens>`_.\\n\\n\\n.. code-block:: bash\\n\\n    tune download meta-llama/Meta-Llama-3-8B-Instruct \\\\\\n        --output-dir <checkpoint_dir> \\\\\\n        --hf-token <ACCESS TOKEN>\\n\\n|\\n\\nFine-tuning Llama3-8B-Instruct in torchtune\\n-------------------------------------------\\n\\ntorchtune provides `LoRA <https://arxiv.org/abs/2106.09685>`_, `QLoRA <https://arxiv.org/abs/2305.14314>`_, and full fine-tuning\\nrecipes for fine-tuning Llama3-8B on one or more GPUs. For more on LoRA in torchtune, see our :ref:`LoRA Tutorial <lora_finetune_label>`.\\nFor more on QLoRA in torchtune, see our :ref:`QLoRA Tutorial <qlora_finetune_label>`.\\n\\nLet's take a look at how we can fine-tune Llama3-8B-Instruct with LoRA on a single device using torchtune. In this example, we will fine-tune\\nfor one epoch on a common instruct dataset for illustrative purposes. The basic command for a single-device LoRA fine-tune is\\n\\n.. code-block:: bash\\n\\n    tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n.. note::\\n    To see a full list of recipes and their corresponding configs, simply run ``tune ls`` from the command line.\\n\\nWe can also add :ref:`command-line overrides <cli_override>` as needed, e.g.\\n\\n.. code-block:: bash\\n\\n    tune run lora\", document_id='url-doc-2', token_count=512)\n",
      "========================================\n",
      "\n",
      "Result 3 (Score: 0.964)\n",
      "========================================\n",
      "Chunk(content=\"8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\\n\\n|\\n\\nGetting access to Llama3-8B-Instruct\\n------------------------------------\\n\\nFor this tutorial, we will be using the instruction-tuned version of Llama3-8B. First, let's download the model from Hugging Face. You will need to follow the instructions\\non the `official Meta page <https://github.com/meta-llama/llama3/blob/main/README.md>`_ to gain access to the model.\\nNext, make sure you grab your Hugging Face token from `here <https://huggingface.co/settings/tokens>`_.\\n\\n\\n.. code-block:: bash\\n\\n    tune download meta-llama/Meta-Llama-3-8B-Instruct \\\\\\n        --output-dir <checkpoint_dir> \\\\\\n        --hf-token <ACCESS TOKEN>\\n\\n|\\n\\nFine-tuning Llama3-8B-Instruct in torchtune\\n-------------------------------------------\\n\\ntorchtune provides `LoRA <https://arxiv.org/abs/2106.09685>`_, `QLoRA <https://arxiv.org/abs/2305.14314>`_, and full fine-tuning\\nrecipes for fine-tuning Llama3-8B on one or more GPUs. For more on LoRA in torchtune, see our :ref:`LoRA Tutorial <lora_finetune_label>`.\\nFor more on QLoRA in torchtune, see our :ref:`QLoRA Tutorial <qlora_finetune_label>`.\\n\\nLet's take a look at how we can fine-tune Llama3-8B-Instruct with LoRA on a single device using torchtune. In this example, we will fine-tune\\nfor one epoch on a common instruct dataset for illustrative purposes. The basic command for a single-device LoRA fine-tune is\\n\\n.. code-block:: bash\\n\\n    tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n.. note::\\n    To see a full list of recipes and their corresponding configs, simply run ``tune ls`` from the command line.\\n\\nWe can also add :ref:`command-line overrides <cli_override>` as needed, e.g.\\n\\n.. code-block:: bash\\n\\n    tune run lora\", document_id='url-doc-2', token_count=512)\n",
      "========================================\n"
     ]
    }
   ],
   "source": [
    "def print_query_results(query: str):\n",
    "    \"\"\"Helper function to print query results in a readable format\n",
    "\n",
    "    Args:\n",
    "        query (str): The search query to execute\n",
    "    \"\"\"\n",
    "    print(f\"\\nQuery: {query}\")\n",
    "    print(\"-\" * 50)\n",
    "    response = client.memory.query(\n",
    "        bank_id=\"tutorial_bank\",\n",
    "        query=[query],  # The API accepts multiple queries at once!\n",
    "    )\n",
    "\n",
    "    for i, (chunk, score) in enumerate(zip(response.chunks, response.scores)):\n",
    "        print(f\"\\nResult {i+1} (Score: {score:.3f})\")\n",
    "        print(\"=\" * 40)\n",
    "        print(chunk)\n",
    "        print(\"=\" * 40)\n",
    "\n",
    "# Let's try some example queries\n",
    "queries = [\n",
    "    \"How do I use LoRA?\",  # Technical question\n",
    "    \"Tell me about memory optimizations\",  # General topic\n",
    "    \"What are the key features of Llama 3?\"  # Product-specific\n",
    "]\n",
    "\n",
    "\n",
    "for query in queries:\n",
    "    print_query_results(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome, now we can embed all our notes with Llama-stack and ask it about the meaning of life :)\n",
    "\n",
    "Next up, we will learn about the safety features and how to use them: [notebook link](./05_Safety101.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}