{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Preparing Datasets for Fine-tuning and Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook showcases transforming a dataset for finetuning and evaluating an LLM for tool calling with NeMo Microservices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy NeMo Microservices\n", "Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.2-1b-instruct`. Please refer to the [installation guide](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-platform/index.html) for instructions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can verify the `meta/llama-3.1-8b-instruct` is deployed by querying the NIM endpoint. The response should include a model with an `id` of `meta/llama-3.1-8b-instruct`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "# URL to NeMo deployment management service\n", "export NEMO_URL=\"http://nemo.test\"\n", "\n", "curl -X GET \"$NEMO_URL/v1/models\" \\\n", " -H \"Accept: application/json\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up Developer Environment\n", "Set up your development environment on your machine. The project uses `uv` to manage Python dependencies. From the root of the project, install dependencies and create your virtual environment:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "uv sync --extra dev\n", "uv pip install -e .\n", "source .venv/bin/activate\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build Llama Stack Image\n", "Build the Llama Stack image using the virtual environment you just created. For local development, set `LLAMA_STACK_DIR` to ensure your local code is use in the image. To use the production version of `llama-stack`, omit `LLAMA_STACK_DIR`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "import random\n", "from pprint import pprint\n", "from typing import Any, Dict, List, Union\n", "\n", "import numpy as np\n", "import torch\n", "from datasets import load_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a random seed for reproducibility." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "SEED = 1234\n", "\n", "# Limits to at most N tool properties\n", "LIMIT_TOOL_PROPERTIES = 8\n", "\n", "torch.manual_seed(SEED)\n", "torch.cuda.manual_seed_all(SEED)\n", "np.random.seed(SEED)\n", "random.seed(SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define the data root directory and create necessary directoryies for storing processed data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Processed data will be stored here\n", "DATA_ROOT = os.path.join(os.getcwd(), \"sample_data\")\n", "CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, \"customization\")\n", "VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, \"validation\")\n", "EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, \"evaluation\")\n", "\n", "os.makedirs(DATA_ROOT, exist_ok=True)\n", "os.makedirs(CUSTOMIZATION_DATA_ROOT, exist_ok=True)\n", "os.makedirs(VALIDATION_DATA_ROOT, exist_ok=True)\n", "os.makedirs(EVALUATION_DATA_ROOT, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Download xLAM Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step loads the xLAM dataset from Hugging Face.\n", "\n", "Ensure that you have followed the prerequisites mentioned above, obtained a Hugging Face access token, and configured it in config.py. In addition to getting an access token, you need to apply for access to the xLAM dataset [here](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), which will be approved instantly." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from config import HF_TOKEN\n", "\n", "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n", "os.environ[\"HF_ENDPOINT\"] = \"https://huggingface.co\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download from Hugging Face\n", "dataset = load_dataset(\"Salesforce/xlam-function-calling-60k\")\n", "\n", "# Inspect a sample\n", "example = dataset['train'][0]\n", "pprint(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details on the structure of this data, refer to the [data structure of the xLAM dataset](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k#structure) in the Hugging Face documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Prepare Data for Customization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Customization, the NeMo Microservices platform leverages the OpenAI data format, comprised of messages and tools:\n", "- `messages` include the user query, as well as the ground truth `assistant` response to the query. This response contains the function name(s) and associated argument(s) in a `tool_calls` dict\n", "- `tools` include a list of functions and parameters available to the LLM to choose from, as well as their descriptions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following helper functions convert a single xLAM JSON data point into OpenAI format." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def normalize_type(param_type: str) -> str:\n", " \"\"\"\n", " Normalize Python type hints and parameter definitions to OpenAI function spec types.\n", "\n", " Args:\n", " param_type: Type string that could include default values or complex types\n", "\n", " Returns:\n", " Normalized type string according to OpenAI function spec\n", " \"\"\"\n", " # Remove whitespace\n", " param_type = param_type.strip()\n", "\n", " # Handle types with default values (e.g. \"str, default='London'\")\n", " if \",\" in param_type and \"default\" in param_type:\n", " param_type = param_type.split(\",\")[0].strip()\n", "\n", " # Handle types with just default values (e.g. \"default='London'\")\n", " if param_type.startswith(\"default=\"):\n", " return \"string\" # Default to string if only default value is given\n", "\n", " # Remove \", optional\" suffix if present\n", " param_type = param_type.replace(\", optional\", \"\").strip()\n", "\n", " # Handle complex types\n", " if param_type.startswith(\"Callable\"):\n", " return \"string\" # Represent callable as string in JSON schema\n", " if param_type.startswith(\"Tuple\"):\n", " return \"array\" # Represent tuple as array in JSON schema\n", " if param_type.startswith(\"List[\"):\n", " return \"array\"\n", " if param_type.startswith(\"Set\") or param_type == \"set\":\n", " return \"array\" # Represent set as array in JSON schema\n", "\n", " # Map common type variations to OpenAI spec types\n", " type_mapping: Dict[str, str] = {\n", " \"str\": \"string\",\n", " \"int\": \"integer\",\n", " \"float\": \"number\",\n", " \"bool\": \"boolean\",\n", " \"list\": \"array\",\n", " \"dict\": \"object\",\n", " \"List\": \"array\",\n", " \"Dict\": \"object\",\n", " \"set\": \"array\",\n", " \"Set\": \"array\"\n", " }\n", "\n", " if param_type in type_mapping:\n", " return type_mapping[param_type]\n", " else:\n", " print(f\"Unknown type: {param_type}\")\n", " return \"string\" # Default to string for unknown types\n", "\n", "\n", "def convert_tools_to_openai_spec(tools: Union[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:\n", " # If tools is a string, try to parse it as JSON\n", " if isinstance(tools, str):\n", " try:\n", " tools = json.loads(tools)\n", " except json.JSONDecodeError as e:\n", " print(f\"Failed to parse tools string as JSON: {e}\")\n", " return []\n", "\n", " # Ensure tools is a list\n", " if not isinstance(tools, list):\n", " print(f\"Expected tools to be a list, but got {type(tools)}\")\n", " return []\n", "\n", " openai_tools: List[Dict[str, Any]] = []\n", " for tool in tools:\n", " # Check if tool is a dictionary\n", " if not isinstance(tool, dict):\n", " print(f\"Expected tool to be a dictionary, but got {type(tool)}\")\n", " continue\n", "\n", " # Check if 'parameters' is a dictionary\n", " if not isinstance(tool.get(\"parameters\"), dict):\n", " print(f\"Expected 'parameters' to be a dictionary, but got {type(tool.get('parameters'))} for tool: {tool}\")\n", " continue\n", "\n", " \n", "\n", " normalized_parameters: Dict[str, Dict[str, Any]] = {}\n", " for param_name, param_info in tool[\"parameters\"].items():\n", " if not isinstance(param_info, dict):\n", " print(\n", " f\"Expected parameter info to be a dictionary, but got {type(param_info)} for parameter: {param_name}\"\n", " )\n", " continue\n", "\n", " # Create parameter info without default first\n", " param_dict = {\n", " \"description\": param_info.get(\"description\", \"\"),\n", " \"type\": normalize_type(param_info.get(\"type\", \"\")),\n", " }\n", "\n", " # Only add default if it exists, is not None, and is not an empty string\n", " default_value = param_info.get(\"default\")\n", " if default_value is not None and default_value != \"\":\n", " param_dict[\"default\"] = default_value\n", "\n", " normalized_parameters[param_name] = param_dict\n", "\n", " openai_tool = {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": tool[\"name\"],\n", " \"description\": tool[\"description\"],\n", " \"parameters\": {\"type\": \"object\", \"properties\": normalized_parameters},\n", " },\n", " }\n", " openai_tools.append(openai_tool)\n", " return openai_tools\n", "\n", "\n", "def save_jsonl(filename, data):\n", " \"\"\"Write a list of json objects to a .jsonl file\"\"\"\n", " with open(filename, \"w\") as f:\n", " for entry in data:\n", " f.write(json.dumps(entry) + \"\\n\")\n", "\n", "\n", "def convert_tool_calls(xlam_tools):\n", " \"\"\"Convert XLAM tool format to OpenAI's tool schema.\"\"\"\n", " tools = []\n", " for tool in json.loads(xlam_tools):\n", " tools.append({\"type\": \"function\", \"function\": {\"name\": tool[\"name\"], \"arguments\": tool.get(\"arguments\", {})}})\n", " return tools\n", "\n", "\n", "def convert_example(example, dataset_type='single'):\n", " \"\"\"Convert an XLAM dataset example to OpenAI format.\"\"\"\n", " obj = {\"messages\": []}\n", "\n", " # User message\n", " obj[\"messages\"].append({\"role\": \"user\", \"content\": example[\"query\"]})\n", "\n", " # Tools\n", " if example.get(\"tools\"):\n", " obj[\"tools\"] = convert_tools_to_openai_spec(example[\"tools\"])\n", "\n", " # Assistant message\n", " assistant_message = {\"role\": \"assistant\", \"content\": \"\"}\n", " if example.get(\"answers\"):\n", " tool_calls = convert_tool_calls(example[\"answers\"])\n", " \n", " if dataset_type == \"single\":\n", " # Only include examples with a single tool call\n", " if len(tool_calls) == 1:\n", " assistant_message[\"tool_calls\"] = tool_calls\n", " else:\n", " return None\n", " else:\n", " # For other dataset types, include all tool calls\n", " assistant_message[\"tool_calls\"] = tool_calls\n", " \n", " obj[\"messages\"].append(assistant_message)\n", "\n", " return obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code cell converts the example data to the OpenAI format required by NeMo Customizer." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "convert_example(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: The convert_example function by default only retains data points that have exactly one tool_call in the output.\n", "The llama-3.2-1b-instruct model does not support parallel tool calls.\n", "For more information, refer to the [supported models](https://docs.nvidia.com/nim/large-language-models/latest/function-calling.html#supported-models) in the NeMo documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process Entire Dataset\n", "Convert each example by looping through the dataset." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "all_examples = []\n", "with open(os.path.join(DATA_ROOT, \"xlam_openai_format.jsonl\"), \"w\") as f:\n", " for example in dataset[\"train\"]:\n", " converted = convert_example(example)\n", " if converted is not None:\n", " all_examples.append(converted)\n", " f.write(json.dumps(converted) + \"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split Dataset\n", "This step splits the dataset into a train, validation, and test set. For demonstration, we use a smaller subset of all the examples.\n", "You may choose to modify `NUM_EXAMPLES` to leverage a larger subset." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Configure to change the size of dataset to use\n", "NUM_EXAMPLES = 5000\n", "\n", "assert NUM_EXAMPLES <= len(all_examples), f\"{NUM_EXAMPLES} exceeds the total number of available ({len(all_examples)}) data points\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ " # Randomly choose a subset\n", "sampled_examples = random.sample(all_examples, NUM_EXAMPLES)\n", "\n", "# Split into 70% training, 15% validation, 15% testing\n", "train_size = int(0.7 * len(sampled_examples))\n", "val_size = int(0.15 * len(sampled_examples))\n", "\n", "train_data = sampled_examples[:train_size]\n", "val_data = sampled_examples[train_size : train_size + val_size]\n", "test_data = sampled_examples[train_size + val_size :]\n", "\n", "# Save the training and validation splits. We will use test split in the next section\n", "save_jsonl(os.path.join(CUSTOMIZATION_DATA_ROOT, \"training.jsonl\"), train_data)\n", "save_jsonl(os.path.join(VALIDATION_DATA_ROOT,\"validation.jsonl\"), val_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Prepare Data for Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For evaluation, the NeMo Microservices platform uses a format with a minor modification to the OpenAI format. This requires `tools_calls` to be brought out of messages to create a distinct parallel field.\n", "- `messages` includes the user querytools includes a list of functions and parameters available to the LLM to choose from, as well as their descriptions.\n", "- `tool_calls` is the ground truth response to the user query. This response contains the function name(s) and associated argument(s) in a \"tool_calls\" dict." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following steps transform the test dataset into a format compatible with the NeMo Evaluator microservice.\n", "This dataset is for measuring accuracy metrics before and after customization." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def convert_example_eval(entry):\n", " \"\"\"Convert a single entry in the dataset to the evaluator format\"\"\"\n", "\n", " # Note: This is a WAR for a known bug with tool calling in NIM\n", " for tool in entry[\"tools\"]:\n", " if len(tool[\"function\"][\"parameters\"][\"properties\"]) > LIMIT_TOOL_PROPERTIES:\n", " return None\n", " \n", " new_entry = {\n", " \"messages\": [],\n", " \"tools\": entry[\"tools\"],\n", " \"tool_calls\": []\n", " }\n", " \n", " for msg in entry[\"messages\"]:\n", " if msg[\"role\"] == \"assistant\" and \"tool_calls\" in msg:\n", " new_entry[\"tool_calls\"] = msg[\"tool_calls\"]\n", " else:\n", " new_entry[\"messages\"].append(msg)\n", " \n", " return new_entry\n", "\n", "def convert_dataset_eval(data):\n", " \"\"\"Convert the entire dataset for evaluation by restructuring the data format.\"\"\"\n", " return [result for entry in data if (result := convert_example_eval(entry)) is not None]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`NOTE`: We have implemented a workaround for a known bug where tool calls freeze the NIM if a tool description includes a function with a larger number of parameters. As such, we have limited the dataset to use examples with available tools having at most 8 parameters. This will be resolved in the next NIM release." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "test_data_eval = convert_dataset_eval(test_data)\n", "save_jsonl(os.path.join(EVALUATION_DATA_ROOT, \"xlam-test-single.jsonl\"), test_data_eval)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 2 }