Clean up instructions and implementation; reorganize notebooks

2025-12-17 19:42:36 +00:00 · 2025-04-18 16:27:19 -04:00 · 2025-04-18 16:27:19 -04:00 · 4131e8146f
commit 4131e8146f
parent 0d9d333a4e
29 changed files with 2756 additions and 89 deletions
--- a/docs/notebooks/nvidia/tool_calling/1_data_preparation.ipynb
+++ b/docs/notebooks/nvidia/tool_calling/1_data_preparation.ipynb
@ -0,0 +1,595 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Part 1: Preparing Datasets for Fine-tuning and Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook showcases transforming a dataset for finetuning and evaluating an LLM for tool calling with NeMo Microservices."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Deploy NeMo Microservices\n",
+    "Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.2-1b-instruct`. Please refer to the [installation guide](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-platform/index.html) for instructions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can verify the `meta/llama-3.1-8b-instruct` is deployed by querying the NIM endpoint. The response should include a model with an `id` of `meta/llama-3.1-8b-instruct`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "# URL to NeMo deployment management service\n",
+    "export NEMO_URL=\"http://nemo.test\"\n",
+    "\n",
+    "curl -X GET \"$NEMO_URL/v1/models\" \\\n",
+    "  -H \"Accept: application/json\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Set up Developer Environment\n",
+    "Set up your development environment on your machine. The project uses `uv` to manage Python dependencies. From the root of the project, install dependencies and create your virtual environment:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "uv sync --extra dev\n",
+    "uv pip install -e .\n",
+    "source .venv/bin/activate\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Build Llama Stack Image\n",
+    "Build the Llama Stack image using the virtual environment you just created. For local development, set `LLAMA_STACK_DIR` to ensure your local code is use in the image. To use the production version of `llama-stack`, omit `LLAMA_STACK_DIR`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, import the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import random\n",
+    "from pprint import pprint\n",
+    "from typing import Any, Dict, List, Union\n",
+    "\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "from datasets import load_dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Set a random seed for reproducibility."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SEED = 1234\n",
+    "\n",
+    "# Limits to at most N tool properties\n",
+    "LIMIT_TOOL_PROPERTIES = 8\n",
+    "\n",
+    "torch.manual_seed(SEED)\n",
+    "torch.cuda.manual_seed_all(SEED)\n",
+    "np.random.seed(SEED)\n",
+    "random.seed(SEED)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Define the data root directory and create necessary directoryies for storing processed data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Processed data will be stored here\n",
+    "DATA_ROOT = os.path.join(os.getcwd(), \"tmp\")\n",
+    "CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, \"customization\")\n",
+    "VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, \"validation\")\n",
+    "EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, \"evaluation\")\n",
+    "\n",
+    "os.makedirs(DATA_ROOT, exist_ok=True)\n",
+    "os.makedirs(CUSTOMIZATION_DATA_ROOT, exist_ok=True)\n",
+    "os.makedirs(VALIDATION_DATA_ROOT, exist_ok=True)\n",
+    "os.makedirs(EVALUATION_DATA_ROOT, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Download xLAM Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step loads the xLAM dataset from Hugging Face.\n",
+    "\n",
+    "Ensure that you have followed the prerequisites mentioned above, obtained a Hugging Face access token, and configured it in config.py. In addition to getting an access token, you need to apply for access to the xLAM dataset [here](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), which will be approved instantly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from config import HF_TOKEN\n",
+    "\n",
+    "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
+    "os.environ[\"HF_ENDPOINT\"] = \"https://huggingface.co\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download from Hugging Face\n",
+    "dataset = load_dataset(\"Salesforce/xlam-function-calling-60k\")\n",
+    "\n",
+    "# Inspect a sample\n",
+    "example = dataset['train'][0]\n",
+    "pprint(example)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For more details on the structure of this data, refer to the [data structure of the xLAM dataset](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k#structure) in the Hugging Face documentation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Prepare Data for Customization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For Customization, the NeMo Microservices platform leverages the OpenAI data format, comprised of messages and tools:\n",
+    "- `messages` include the user query, as well as the ground truth `assistant` response to the query. This response contains the function name(s) and associated argument(s) in a `tool_calls` dict\n",
+    "- `tools` include a list of functions and parameters available to the LLM to choose from, as well as their descriptions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following helper functions convert a single xLAM JSON data point into OpenAI format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def normalize_type(param_type: str) -> str:\n",
+    "    \"\"\"\n",
+    "    Normalize Python type hints and parameter definitions to OpenAI function spec types.\n",
+    "\n",
+    "    Args:\n",
+    "        param_type: Type string that could include default values or complex types\n",
+    "\n",
+    "    Returns:\n",
+    "        Normalized type string according to OpenAI function spec\n",
+    "    \"\"\"\n",
+    "    # Remove whitespace\n",
+    "    param_type = param_type.strip()\n",
+    "\n",
+    "    # Handle types with default values (e.g. \"str, default='London'\")\n",
+    "    if \",\" in param_type and \"default\" in param_type:\n",
+    "        param_type = param_type.split(\",\")[0].strip()\n",
+    "\n",
+    "    # Handle types with just default values (e.g. \"default='London'\")\n",
+    "    if param_type.startswith(\"default=\"):\n",
+    "        return \"string\"  # Default to string if only default value is given\n",
+    "\n",
+    "    # Remove \", optional\" suffix if present\n",
+    "    param_type = param_type.replace(\", optional\", \"\").strip()\n",
+    "\n",
+    "    # Handle complex types\n",
+    "    if param_type.startswith(\"Callable\"):\n",
+    "        return \"string\"  # Represent callable as string in JSON schema\n",
+    "    if param_type.startswith(\"Tuple\"):\n",
+    "        return \"array\"  # Represent tuple as array in JSON schema\n",
+    "    if param_type.startswith(\"List[\"):\n",
+    "        return \"array\"\n",
+    "    if param_type.startswith(\"Set\") or param_type == \"set\":\n",
+    "        return \"array\"  # Represent set as array in JSON schema\n",
+    "\n",
+    "    # Map common type variations to OpenAI spec types\n",
+    "    type_mapping: Dict[str, str] = {\n",
+    "        \"str\": \"string\",\n",
+    "        \"int\": \"integer\",\n",
+    "        \"float\": \"number\",\n",
+    "        \"bool\": \"boolean\",\n",
+    "        \"list\": \"array\",\n",
+    "        \"dict\": \"object\",\n",
+    "        \"List\": \"array\",\n",
+    "        \"Dict\": \"object\",\n",
+    "        \"set\": \"array\",\n",
+    "        \"Set\": \"array\"\n",
+    "    }\n",
+    "\n",
+    "    if param_type in type_mapping:\n",
+    "        return type_mapping[param_type]\n",
+    "    else:\n",
+    "        print(f\"Unknown type: {param_type}\")\n",
+    "        return \"string\"  # Default to string for unknown types\n",
+    "\n",
+    "\n",
+    "def convert_tools_to_openai_spec(tools: Union[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:\n",
+    "    # If tools is a string, try to parse it as JSON\n",
+    "    if isinstance(tools, str):\n",
+    "        try:\n",
+    "            tools = json.loads(tools)\n",
+    "        except json.JSONDecodeError as e:\n",
+    "            print(f\"Failed to parse tools string as JSON: {e}\")\n",
+    "            return []\n",
+    "\n",
+    "    # Ensure tools is a list\n",
+    "    if not isinstance(tools, list):\n",
+    "        print(f\"Expected tools to be a list, but got {type(tools)}\")\n",
+    "        return []\n",
+    "\n",
+    "    openai_tools: List[Dict[str, Any]] = []\n",
+    "    for tool in tools:\n",
+    "        # Check if tool is a dictionary\n",
+    "        if not isinstance(tool, dict):\n",
+    "            print(f\"Expected tool to be a dictionary, but got {type(tool)}\")\n",
+    "            continue\n",
+    "\n",
+    "        # Check if 'parameters' is a dictionary\n",
+    "        if not isinstance(tool.get(\"parameters\"), dict):\n",
+    "            print(f\"Expected 'parameters' to be a dictionary, but got {type(tool.get('parameters'))} for tool: {tool}\")\n",
+    "            continue\n",
+    "\n",
+    "    \n",
+    "\n",
+    "        normalized_parameters: Dict[str, Dict[str, Any]] = {}\n",
+    "        for param_name, param_info in tool[\"parameters\"].items():\n",
+    "            if not isinstance(param_info, dict):\n",
+    "                print(\n",
+    "                    f\"Expected parameter info to be a dictionary, but got {type(param_info)} for parameter: {param_name}\"\n",
+    "                )\n",
+    "                continue\n",
+    "\n",
+    "            # Create parameter info without default first\n",
+    "            param_dict = {\n",
+    "                \"description\": param_info.get(\"description\", \"\"),\n",
+    "                \"type\": normalize_type(param_info.get(\"type\", \"\")),\n",
+    "            }\n",
+    "\n",
+    "            # Only add default if it exists, is not None, and is not an empty string\n",
+    "            default_value = param_info.get(\"default\")\n",
+    "            if default_value is not None and default_value != \"\":\n",
+    "                param_dict[\"default\"] = default_value\n",
+    "\n",
+    "            normalized_parameters[param_name] = param_dict\n",
+    "\n",
+    "        openai_tool = {\n",
+    "            \"type\": \"function\",\n",
+    "            \"function\": {\n",
+    "                \"name\": tool[\"name\"],\n",
+    "                \"description\": tool[\"description\"],\n",
+    "                \"parameters\": {\"type\": \"object\", \"properties\": normalized_parameters},\n",
+    "            },\n",
+    "        }\n",
+    "        openai_tools.append(openai_tool)\n",
+    "    return openai_tools\n",
+    "\n",
+    "\n",
+    "def save_jsonl(filename, data):\n",
+    "    \"\"\"Write a list of json objects to a .jsonl file\"\"\"\n",
+    "    with open(filename, \"w\") as f:\n",
+    "        for entry in data:\n",
+    "            f.write(json.dumps(entry) + \"\\n\")\n",
+    "\n",
+    "\n",
+    "def convert_tool_calls(xlam_tools):\n",
+    "    \"\"\"Convert XLAM tool format to OpenAI's tool schema.\"\"\"\n",
+    "    tools = []\n",
+    "    for tool in json.loads(xlam_tools):\n",
+    "        tools.append({\"type\": \"function\", \"function\": {\"name\": tool[\"name\"], \"arguments\": tool.get(\"arguments\", {})}})\n",
+    "    return tools\n",
+    "\n",
+    "\n",
+    "def convert_example(example, dataset_type='single'):\n",
+    "    \"\"\"Convert an XLAM dataset example to OpenAI format.\"\"\"\n",
+    "    obj = {\"messages\": []}\n",
+    "\n",
+    "    # User message\n",
+    "    obj[\"messages\"].append({\"role\": \"user\", \"content\": example[\"query\"]})\n",
+    "\n",
+    "    # Tools\n",
+    "    if example.get(\"tools\"):\n",
+    "        obj[\"tools\"] = convert_tools_to_openai_spec(example[\"tools\"])\n",
+    "\n",
+    "    # Assistant message\n",
+    "    assistant_message = {\"role\": \"assistant\", \"content\": \"\"}\n",
+    "    if example.get(\"answers\"):\n",
+    "        tool_calls = convert_tool_calls(example[\"answers\"])\n",
+    "        \n",
+    "        if dataset_type == \"single\":\n",
+    "            # Only include examples with a single tool call\n",
+    "            if len(tool_calls) == 1:\n",
+    "                assistant_message[\"tool_calls\"] = tool_calls\n",
+    "            else:\n",
+    "                return None\n",
+    "        else:\n",
+    "            # For other dataset types, include all tool calls\n",
+    "            assistant_message[\"tool_calls\"] = tool_calls\n",
+    "                \n",
+    "    obj[\"messages\"].append(assistant_message)\n",
+    "\n",
+    "    return obj"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following code cell converts the example data to the OpenAI format required by NeMo Customizer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "convert_example(example)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**NOTE**: The convert_example function by default only retains data points that have exactly one tool_call in the output.\n",
+    "The llama-3.2-1b-instruct model does not support parallel tool calls.\n",
+    "For more information, refer to the [supported models](https://docs.nvidia.com/nim/large-language-models/latest/function-calling.html#supported-models) in the NeMo documentation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Process Entire Dataset\n",
+    "Convert each example by looping through the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_examples = []\n",
+    "with open(os.path.join(DATA_ROOT, \"xlam_openai_format.jsonl\"), \"w\") as f:\n",
+    "    for example in dataset[\"train\"]:\n",
+    "        converted = convert_example(example)\n",
+    "        if converted is not None:\n",
+    "            all_examples.append(converted)\n",
+    "            f.write(json.dumps(converted) + \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Split Dataset\n",
+    "This step splits the dataset into a train, validation, and test set. For demonstration, we use a smaller subset of all the examples.\n",
+    "You may choose to modify `NUM_EXAMPLES` to leverage a larger subset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Configure to change the size of dataset to use\n",
+    "NUM_EXAMPLES = 5000\n",
+    "\n",
+    "assert NUM_EXAMPLES <= len(all_examples), f\"{NUM_EXAMPLES} exceeds the total number of available ({len(all_examples)}) data points\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " # Randomly choose a subset\n",
+    "sampled_examples = random.sample(all_examples, NUM_EXAMPLES)\n",
+    "\n",
+    "# Split into 70% training, 15% validation, 15% testing\n",
+    "train_size = int(0.7 * len(sampled_examples))\n",
+    "val_size = int(0.15 * len(sampled_examples))\n",
+    "\n",
+    "train_data = sampled_examples[:train_size]\n",
+    "val_data = sampled_examples[train_size : train_size + val_size]\n",
+    "test_data = sampled_examples[train_size + val_size :]\n",
+    "\n",
+    "# Save the training and validation splits. We will use test split in the next section\n",
+    "save_jsonl(os.path.join(CUSTOMIZATION_DATA_ROOT, \"training.jsonl\"), train_data)\n",
+    "save_jsonl(os.path.join(VALIDATION_DATA_ROOT,\"validation.jsonl\"), val_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Prepare Data for Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For evaluation, the NeMo Microservices platform uses a format with a minor modification to the OpenAI format. This requires `tools_calls` to be brought out of messages to create a distinct parallel field.\n",
+    "- `messages` includes the user querytools includes a list of functions and parameters available to the LLM to choose from, as well as their descriptions.\n",
+    "- `tool_calls` is the ground truth response to the user query. This response contains the function name(s) and associated argument(s) in a \"tool_calls\" dict."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following steps transform the test dataset into a format compatible with the NeMo Evaluator microservice.\n",
+    "This dataset is for measuring accuracy metrics before and after customization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_example_eval(entry):\n",
+    "    \"\"\"Convert a single entry in the dataset to the evaluator format\"\"\"\n",
+    "\n",
+    "    # Note: This is a WAR for a known bug with tool calling in NIM\n",
+    "    for tool in entry[\"tools\"]:\n",
+    "        if len(tool[\"function\"][\"parameters\"][\"properties\"]) > LIMIT_TOOL_PROPERTIES:\n",
+    "            return None\n",
+    "    \n",
+    "    new_entry = {\n",
+    "        \"messages\": [],\n",
+    "        \"tools\": entry[\"tools\"],\n",
+    "        \"tool_calls\": []\n",
+    "    }\n",
+    "    \n",
+    "    for msg in entry[\"messages\"]:\n",
+    "        if msg[\"role\"] == \"assistant\" and \"tool_calls\" in msg:\n",
+    "            new_entry[\"tool_calls\"] = msg[\"tool_calls\"]\n",
+    "        else:\n",
+    "            new_entry[\"messages\"].append(msg)\n",
+    "    \n",
+    "    return new_entry\n",
+    "\n",
+    "def convert_dataset_eval(data):\n",
+    "    \"\"\"Convert the entire dataset for evaluation by restructuring the data format.\"\"\"\n",
+    "    return [result for entry in data if (result := convert_example_eval(entry)) is not None]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`NOTE`: We have implemented a workaround for a known bug where tool calls freeze the NIM if a tool description includes a function with a larger number of parameters. As such, we have limited the dataset to use examples with available tools having at most 8 parameters. This will be resolved in the next NIM release."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_data_eval = convert_dataset_eval(test_data)\n",
+    "save_jsonl(os.path.join(EVALUATION_DATA_ROOT, \"xlam-test-single.jsonl\"), test_data_eval)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}