llama-stack-mirror/docs/notebooks/nvidia/tool_calling/1_data_preparation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Preparing Datasets for Fine-tuning and Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook showcases transforming a dataset for finetuning and evaluating an LLM for tool calling with NeMo Microservices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy NeMo Microservices\n",
    "Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.2-1b-instruct`. Please refer to the [installation guide](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-platform/index.html) for instructions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can verify the `meta/llama-3.1-8b-instruct` is deployed by querying the NIM endpoint. The response should include a model with an `id` of `meta/llama-3.1-8b-instruct`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "# URL to NeMo deployment management service\n",
    "export NEMO_URL=\"http://nemo.test\"\n",
    "\n",
    "curl -X GET \"$NEMO_URL/v1/models\" \\\n",
    "  -H \"Accept: application/json\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up Developer Environment\n",
    "Set up your development environment on your machine. The project uses `uv` to manage Python dependencies. From the root of the project, install dependencies and create your virtual environment:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv sync --extra dev\n",
    "uv pip install -e .\n",
    "source .venv/bin/activate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build Llama Stack Image\n",
    "Build the Llama Stack image using the virtual environment you just created. For local development, set `LLAMA_STACK_DIR` to ensure your local code is use in the image. To use the production version of `llama-stack`, omit `LLAMA_STACK_DIR`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import random\n",
    "from pprint import pprint\n",
    "from typing import Any, Dict, List, Union\n",
    "\n",
    "import numpy as np\n",
    "import torch\n",
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set a random seed for reproducibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "SEED = 1234\n",
    "\n",
    "# Limits to at most N tool properties\n",
    "LIMIT_TOOL_PROPERTIES = 8\n",
    "\n",
    "torch.manual_seed(SEED)\n",
    "torch.cuda.manual_seed_all(SEED)\n",
    "np.random.seed(SEED)\n",
    "random.seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the data root directory and create necessary directoryies for storing processed data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Processed data will be stored here\n",
    "DATA_ROOT = os.path.join(os.getcwd(), \"sample_data\")\n",
    "CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, \"customization\")\n",
    "VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, \"validation\")\n",
    "EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, \"evaluation\")\n",
    "\n",
    "os.makedirs(DATA_ROOT, exist_ok=True)\n",
    "os.makedirs(CUSTOMIZATION_DATA_ROOT, exist_ok=True)\n",
    "os.makedirs(VALIDATION_DATA_ROOT, exist_ok=True)\n",
    "os.makedirs(EVALUATION_DATA_ROOT, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Download xLAM Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This step loads the xLAM dataset from Hugging Face.\n",
    "\n",
    "Ensure that you have followed the prerequisites mentioned above, obtained a Hugging Face access token, and configured it in config.py. In addition to getting an access token, you need to apply for access to the xLAM dataset [here](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), which will be approved instantly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from config import HF_TOKEN\n",
    "\n",
    "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
    "os.environ[\"HF_ENDPOINT\"] = \"https://huggingface.co\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download from Hugging Face\n",
    "dataset = load_dataset(\"Salesforce/xlam-function-calling-60k\")\n",
    "\n",
    "# Inspect a sample\n",
    "example = dataset['train'][0]\n",
    "pprint(example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more details on the structure of this data, refer to the [data structure of the xLAM dataset](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k#structure) in the Hugging Face documentation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Prepare Data for Customization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For Customization, the NeMo Microservices platform leverages the OpenAI data format, comprised of messages and tools:\n",
    "- `messages` include the user query, as well as the ground truth `assistant` response to the query. This response contains the function name(s) and associated argument(s) in a `tool_calls` dict\n",
    "- `tools` include a list of functions and parameters available to the LLM to choose from, as well as their descriptions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following helper functions convert a single xLAM JSON data point into OpenAI format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_type(param_type: str) -> str:\n",
    "    \"\"\"\n",
    "    Normalize Python type hints and parameter definitions to OpenAI function spec types.\n",
    "\n",
    "    Args:\n",
    "        param_type: Type string that could include default values or complex types\n",
    "\n",
    "    Returns:\n",
    "        Normalized type string according to OpenAI function spec\n",
    "    \"\"\"\n",
    "    # Remove whitespace\n",
    "    param_type = param_type.strip()\n",
    "\n",
    "    # Handle types with default values (e.g. \"str, default='London'\")\n",
    "    if \",\" in param_type and \"default\" in param_type:\n",
    "        param_type = param_type.split(\",\")[0].strip()\n",
    "\n",
    "    # Handle types with just default values (e.g. \"default='London'\")\n",
    "    if param_type.startswith(\"default=\"):\n",
    "        return \"string\"  # Default to string if only default value is given\n",
    "\n",
    "    # Remove \", optional\" suffix if present\n",
    "    param_type = param_type.replace(\", optional\", \"\").strip()\n",
    "\n",
    "    # Handle complex types\n",
    "    if param_type.startswith(\"Callable\"):\n",
    "        return \"string\"  # Represent callable as string in JSON schema\n",
    "    if param_type.startswith(\"Tuple\"):\n",
    "        return \"array\"  # Represent tuple as array in JSON schema\n",
    "    if param_type.startswith(\"List[\"):\n",
    "        return \"array\"\n",
    "    if param_type.startswith(\"Set\") or param_type == \"set\":\n",
    "        return \"array\"  # Represent set as array in JSON schema\n",
    "\n",
    "    # Map common type variations to OpenAI spec types\n",
    "    type_mapping: Dict[str, str] = {\n",
    "        \"str\": \"string\",\n",
    "        \"int\": \"integer\",\n",
    "        \"float\": \"number\",\n",
    "        \"bool\": \"boolean\",\n",
    "        \"list\": \"array\",\n",
    "        \"dict\": \"object\",\n",
    "        \"List\": \"array\",\n",
    "        \"Dict\": \"object\",\n",
    "        \"set\": \"array\",\n",
    "        \"Set\": \"array\"\n",
    "    }\n",
    "\n",
    "    if param_type in type_mapping:\n",
    "        return type_mapping[param_type]\n",
    "    else:\n",
    "        print(f\"Unknown type: {param_type}\")\n",
    "        return \"string\"  # Default to string for unknown types\n",
    "\n",
    "\n",
    "def convert_tools_to_openai_spec(tools: Union[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:\n",
    "    # If tools is a string, try to parse it as JSON\n",
    "    if isinstance(tools, str):\n",
    "        try:\n",
    "            tools = json.loads(tools)\n",
    "        except json.JSONDecodeError as e:\n",
    "            print(f\"Failed to parse tools string as JSON: {e}\")\n",
    "            return []\n",
    "\n",
    "    # Ensure tools is a list\n",
    "    if not isinstance(tools, list):\n",
    "        print(f\"Expected tools to be a list, but got {type(tools)}\")\n",
    "        return []\n",
    "\n",
    "    openai_tools: List[Dict[str, Any]] = []\n",
    "    for tool in tools:\n",
    "        # Check if tool is a dictionary\n",
    "        if not isinstance(tool, dict):\n",
    "            print(f\"Expected tool to be a dictionary, but got {type(tool)}\")\n",
    "            continue\n",
    "\n",
    "        # Check if 'parameters' is a dictionary\n",
    "        if not isinstance(tool.get(\"parameters\"), dict):\n",
    "            print(f\"Expected 'parameters' to be a dictionary, but got {type(tool.get('parameters'))} for tool: {tool}\")\n",
    "            continue\n",
    "\n",
    "    \n",
    "\n",
    "        normalized_parameters: Dict[str, Dict[str, Any]] = {}\n",
    "        for param_name, param_info in tool[\"parameters\"].items():\n",
    "            if not isinstance(param_info, dict):\n",
    "                print(\n",
    "                    f\"Expected parameter info to be a dictionary, but got {type(param_info)} for parameter: {param_name}\"\n",
    "                )\n",
    "                continue\n",
    "\n",
    "            # Create parameter info without default first\n",
    "            param_dict = {\n",
    "                \"description\": param_info.get(\"description\", \"\"),\n",
    "                \"type\": normalize_type(param_info.get(\"type\", \"\")),\n",
    "            }\n",
    "\n",
    "            # Only add default if it exists, is not None, and is not an empty string\n",
    "            default_value = param_info.get(\"default\")\n",
    "            if default_value is not None and default_value != \"\":\n",
    "                param_dict[\"default\"] = default_value\n",
    "\n",
    "            normalized_parameters[param_name] = param_dict\n",
    "\n",
    "        openai_tool = {\n",
    "            \"type\": \"function\",\n",
    "            \"function\": {\n",
    "                \"name\": tool[\"name\"],\n",
    "                \"description\": tool[\"description\"],\n",
    "                \"parameters\": {\"type\": \"object\", \"properties\": normalized_parameters},\n",
    "            },\n",
    "        }\n",
    "        openai_tools.append(openai_tool)\n",
    "    return openai_tools\n",
    "\n",
    "\n",
    "def save_jsonl(filename, data):\n",
    "    \"\"\"Write a list of json objects to a .jsonl file\"\"\"\n",
    "    with open(filename, \"w\") as f:\n",
    "        for entry in data:\n",
    "            f.write(json.dumps(entry) + \"\\n\")\n",
    "\n",
    "\n",
    "def convert_tool_calls(xlam_tools):\n",
    "    \"\"\"Convert XLAM tool format to OpenAI's tool schema.\"\"\"\n",
    "    tools = []\n",
    "    for tool in json.loads(xlam_tools):\n",
    "        tools.append({\"type\": \"function\", \"function\": {\"name\": tool[\"name\"], \"arguments\": tool.get(\"arguments\", {})}})\n",
    "    return tools\n",
    "\n",
    "\n",
    "def convert_example(example, dataset_type='single'):\n",
    "    \"\"\"Convert an XLAM dataset example to OpenAI format.\"\"\"\n",
    "    obj = {\"messages\": []}\n",
    "\n",
    "    # User message\n",
    "    obj[\"messages\"].append({\"role\": \"user\", \"content\": example[\"query\"]})\n",
    "\n",
    "    # Tools\n",
    "    if example.get(\"tools\"):\n",
    "        obj[\"tools\"] = convert_tools_to_openai_spec(example[\"tools\"])\n",
    "\n",
    "    # Assistant message\n",
    "    assistant_message = {\"role\": \"assistant\", \"content\": \"\"}\n",
    "    if example.get(\"answers\"):\n",
    "        tool_calls = convert_tool_calls(example[\"answers\"])\n",
    "        \n",
    "        if dataset_type == \"single\":\n",
    "            # Only include examples with a single tool call\n",
    "            if len(tool_calls) == 1:\n",
    "                assistant_message[\"tool_calls\"] = tool_calls\n",
    "            else:\n",
    "                return None\n",
    "        else:\n",
    "            # For other dataset types, include all tool calls\n",
    "            assistant_message[\"tool_calls\"] = tool_calls\n",
    "                \n",
    "    obj[\"messages\"].append(assistant_message)\n",
    "\n",
    "    return obj"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code cell converts the example data to the OpenAI format required by NeMo Customizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_example(example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE**: The convert_example function by default only retains data points that have exactly one tool_call in the output.\n",
    "The llama-3.2-1b-instruct model does not support parallel tool calls.\n",
    "For more information, refer to the [supported models](https://docs.nvidia.com/nim/large-language-models/latest/function-calling.html#supported-models) in the NeMo documentation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Process Entire Dataset\n",
    "Convert each example by looping through the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_examples = []\n",
    "with open(os.path.join(DATA_ROOT, \"xlam_openai_format.jsonl\"), \"w\") as f:\n",
    "    for example in dataset[\"train\"]:\n",
    "        converted = convert_example(example)\n",
    "        if converted is not None:\n",
    "            all_examples.append(converted)\n",
    "            f.write(json.dumps(converted) + \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split Dataset\n",
    "This step splits the dataset into a train, validation, and test set. For demonstration, we use a smaller subset of all the examples.\n",
    "You may choose to modify `NUM_EXAMPLES` to leverage a larger subset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure to change the size of dataset to use\n",
    "NUM_EXAMPLES = 5000\n",
    "\n",
    "assert NUM_EXAMPLES <= len(all_examples), f\"{NUM_EXAMPLES} exceeds the total number of available ({len(all_examples)}) data points\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    " # Randomly choose a subset\n",
    "sampled_examples = random.sample(all_examples, NUM_EXAMPLES)\n",
    "\n",
    "# Split into 70% training, 15% validation, 15% testing\n",
    "train_size = int(0.7 * len(sampled_examples))\n",
    "val_size = int(0.15 * len(sampled_examples))\n",
    "\n",
    "train_data = sampled_examples[:train_size]\n",
    "val_data = sampled_examples[train_size : train_size + val_size]\n",
    "test_data = sampled_examples[train_size + val_size :]\n",
    "\n",
    "# Save the training and validation splits. We will use test split in the next section\n",
    "save_jsonl(os.path.join(CUSTOMIZATION_DATA_ROOT, \"training.jsonl\"), train_data)\n",
    "save_jsonl(os.path.join(VALIDATION_DATA_ROOT,\"validation.jsonl\"), val_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Prepare Data for Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For evaluation, the NeMo Microservices platform uses a format with a minor modification to the OpenAI format. This requires `tools_calls` to be brought out of messages to create a distinct parallel field.\n",
    "- `messages` includes the user querytools includes a list of functions and parameters available to the LLM to choose from, as well as their descriptions.\n",
    "- `tool_calls` is the ground truth response to the user query. This response contains the function name(s) and associated argument(s) in a \"tool_calls\" dict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following steps transform the test dataset into a format compatible with the NeMo Evaluator microservice.\n",
    "This dataset is for measuring accuracy metrics before and after customization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_example_eval(entry):\n",
    "    \"\"\"Convert a single entry in the dataset to the evaluator format\"\"\"\n",
    "\n",
    "    # Note: This is a WAR for a known bug with tool calling in NIM\n",
    "    for tool in entry[\"tools\"]:\n",
    "        if len(tool[\"function\"][\"parameters\"][\"properties\"]) > LIMIT_TOOL_PROPERTIES:\n",
    "            return None\n",
    "    \n",
    "    new_entry = {\n",
    "        \"messages\": [],\n",
    "        \"tools\": entry[\"tools\"],\n",
    "        \"tool_calls\": []\n",
    "    }\n",
    "    \n",
    "    for msg in entry[\"messages\"]:\n",
    "        if msg[\"role\"] == \"assistant\" and \"tool_calls\" in msg:\n",
    "            new_entry[\"tool_calls\"] = msg[\"tool_calls\"]\n",
    "        else:\n",
    "            new_entry[\"messages\"].append(msg)\n",
    "    \n",
    "    return new_entry\n",
    "\n",
    "def convert_dataset_eval(data):\n",
    "    \"\"\"Convert the entire dataset for evaluation by restructuring the data format.\"\"\"\n",
    "    return [result for entry in data if (result := convert_example_eval(entry)) is not None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`NOTE`: We have implemented a workaround for a known bug where tool calls freeze the NIM if a tool description includes a function with a larger number of parameters. As such, we have limited the dataset to use examples with available tools having at most 8 parameters. This will be resolved in the next NIM release."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_eval = convert_dataset_eval(test_data)\n",
    "save_jsonl(os.path.join(EVALUATION_DATA_ROOT, \"xlam-test-single.jsonl\"), test_data_eval)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}