[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Alpha_Llama_Stack_Post_Training.ipynb)

# [Alpha] Llama Stack Post Training
This notebook will use a real world problem (improve LLM as tax preparer) to walk through the main sets of APIs we offer with Llama stack for post training to improve the LLM performance for agentic apps (We support supervised finetune now, RLHF and knowledge distillation will come soon!).

We will also showcase how to leverage existing Llama stack [inference APIs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/inference/inference.py) (ollama as provider) to get the new model's output and the [eval APIs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/eval/eval.py) to help you better measure the new model performance. We hope the flywheel of post-training -> eval -> inference can greatly empower agentic apps development.


- Read more about Llama Stack: https://llamastack.github.io/
- Read more about post training APIs definition: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/post_training/post_training.py


Resource requirement:
- You can run this notebook with Llama 3.2 3B instruct model on Colab's **FREE** T4 GPU
- You can run this notebook with Llama 3.1 8B instruct model on Colab's A100 GPU or any GPU types with more than 22GB memory
- You need to spin up an ollama server on local host (will provider step by step instruction on this)

> **Note**: Llama Stack post training APIs are in alpha release stage and still under heavy development


# 0. Bootstrapping Llama Stack Library
In order to run post training on the Llama models, you will need to use a post training providers. Currently, the post training APIs are powered by **torchtune** as provider.

To learn more about torchtune: https://github.com/pytorch/torchtune

We will use [experimental-post-training](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/distributions/experimental-post-training) as the distribution template

####  0.0. Prerequisite: Have an OpenAI API key
In this showcase, we will use [braintrust](https://www.braintrust.dev/) as scoring provider for eval and it uses OpenAI model as judge model for scoring. So, you need to get an API key from [OpenAI developer platform](https://platform.openai.com/docs/overview).


> **Note:**
- Set the API Key in the Secrets of this notebook as `OPENAI_API_KEY`

You can choose from the list of [scoring providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/scoring) and scoring functions that fulfill your need.



In [None]:
!pip install git+https://github.com/meta-llama/llama-stack.git #TODO: update this after the next pkg release

Collecting git+https://github.com/meta-llama/llama-stack.git
  Cloning https://github.com/meta-llama/llama-stack.git (to revision hf_format_checkpointer) to /tmp/pip-req-build-j_1bxqzm
  Running command git clone --filter=blob:none --quiet https://github.com/meta-llama/llama-stack.git /tmp/pip-req-build-j_1bxqzm
  Running command git checkout -b hf_format_checkpointer --track origin/hf_format_checkpointer
  Switched to a new branch 'hf_format_checkpointer'
  Branch 'hf_format_checkpointer' set up to track remote branch 'hf_format_checkpointer' from 'origin'.
  Resolved https://github.com/meta-llama/llama-stack.git to commit 0fb674d77bb1a84d4e2dc9825102849ea06ba17b
  Running command git submodule update --init --recursive -q


In [None]:
!llama stack build --distro experimental-post-training --image-type venv --image-name __system__

Installing dependencies in system Python environment
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 176ms[0m[0m
Installing pip dependencies
[2mUsing Python 3.11.11 environment at: /usr[0m
[2K[2mResolved [1m130 packages[0m [2min 1.82s[0m[0m
[2K   [36m[1mBuilding[0m[39m fairscale[2m==0.4.13[0m
[2K[1A   [36m[1mBuilding[0m[39m fairscale[2m==0.4.13[0m
   [36m[1mBuilding[0m[39m antlr4-python3-runtime[2m==4.9.3[0m
[2K[2A   [36m[1mBuilding[0m[39m fairscale[2m==0.4.13[0m
   [36m[1mBuilding[0m[39m antlr4-python3-runtime[2m==4.9.3[0m
   [36m[1mBuilding[0m[39m zmq[2m==0.0.0[0m
[2K[3A   [36m[1mBuilding[0m[39m fairscale[2m==0.4.13[0m
   [36m[1mBuilding[0m[39m antlr4-python3-runtime[2m==4.9.3[0m
   [36m[1mBuilding[0m[39m zmq[2m==0.0.0[0m
[37m⠙[0m [2mPreparing packages...[0m (0/46)
[2K[4A   [36m[1mBuilding[0m[39m fairscale[2m==0.4.13[0m
   [36m[1mBuilding[0m[39m antlr4-python

#### 0.1. spin up ollama server

We need to spin up an [ollama](https://github.com/ollama/ollama) server on local host to run the inference and eval

First we install xterm so that we can run command line tools

In [None]:
!pip install uv colab-xterm
%load_ext colabxterm

Collecting uv
  Downloading uv-0.6.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting colab-xterm
  Downloading colab_xterm-0.2.0-py3-none-any.whl.metadata (1.2 kB)
Downloading uv-0.6.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.2/16.2 MB[0m [31m107.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colab_xterm-0.2.0-py3-none-any.whl (115 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.6/115.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv, colab-xterm
Successfully installed colab-xterm-0.2.0 uv-0.6.3


In [None]:
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13269    0 13269    0     0  37986      0 --:--:-- --:--:-- --:--:-- 38020
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Next, run xterm to run ollama as an independent process that stays alive. We choose Llama3.2 3B Instruct model for our tax preparation task, so we need to run llama3.2 3b instruct model on ollama


```
ollama serve &
ollama run llama3.2:3b --keepalive 120m
```

In [None]:
%xterm

# ollama serve &
# ollama run llama3.2:3b --keepalive 120m

Launching Xterm...

<IPython.core.display.Javascript object>

Check which model is running on ollama

In [None]:
!ollama ps

NAME           ID              SIZE      PROCESSOR    UNTIL            
llama3.2:3b    a80c4f17acd5    4.0 GB    100% GPU     2 hours from now    


In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/300.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.3.0


Start the llama stack server

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

from llama_stack.core.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("experimental-post-training")
_ = client.initialize()

INFO:datasets:PyTorch version 2.5.1+cu124 available.
INFO:datasets:Polars version 1.9.0 available.
INFO:datasets:Duckdb version 1.1.3 available.
INFO:datasets:TensorFlow version 2.18.0 available.
INFO:datasets:JAX version 0.4.33 available.
INFO:llama_stack.core.stack:Scoring_fns: basic::equality served by basic
INFO:llama_stack.core.stack:Scoring_fns: basic::subset_of served by basic
INFO:llama_stack.core.stack:Scoring_fns: basic::regex_parser_multiple_choice_answer served by basic
INFO:llama_stack.core.stack:Scoring_fns: braintrust::factuality served by braintrust
INFO:llama_stack.core.stack:Scoring_fns: braintrust::answer-correctness served by braintrust
INFO:llama_stack.core.stack:Scoring_fns: braintrust::answer-relevancy served by braintrust
INFO:llama_stack.core.stack:Scoring_fns: braintrust::answer-similarity served by braintrust
INFO:llama_stack.core.stack:Scoring_fns: braintrust::faithfulness served by braintrust
INFO:llama_stack.core.stack:Scoring_fns: braintrust::context-enti

## 1. Eval the native Llama model
First of all, we'd like to measure the native Llama 3.2 3B instruct model performance as a tax preparer.

#### 1.0. Prepare the eval dataset

We prepared a synthetic tax Q&A dataset from Llama 3.3 70B model [tax_preparation_eval.csv](https://gist.github.com/SLR722/0420c558ec681b00ed05fa1171505a38) (data source: https://github.com/shadi-fsai/modeluniversity/blob/main/test_questions.json).

- You can create your own eval dataset that repects Llama stack [eval dataset format](https://github.com/meta-llama/llama-stack/blob/91907b714e825a1bfbca5271e0f403aab5f10752/llama_stack/providers/utils/common/data_schema_validator.py#L43)



In [None]:
import requests

# Upload the example dataset from github to notebook
url = 'https://gist.githubusercontent.com/SLR722/0420c558ec681b00ed05fa1171505a38/raw/dbc7ab86e71e808c4bae50b68b8bff60c1d239a5/tax_preparation_eval.csv'
r = requests.get(url)
with open('tax_preparation_eval.csv', 'wb') as f:
    f.write(r.content)

# You can use the below comment out code to upload your local file to the notebook
# from google.colab import files

# uploaded = files.upload()

In [None]:
import mimetypes
import base64

# encode the dataset file into data_url
def data_url_from_file(file_path: str) -> str:
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    with open(file_path, "rb") as file:
        file_content = file.read()

    base64_content = base64.b64encode(file_content).decode("utf-8")
    mime_type, _ = mimetypes.guess_type(file_path)

    data_url = f"data:{mime_type};base64,{base64_content}"

    return data_url

data_url = data_url_from_file("tax_preparation_eval.csv")

# register the eval dataset
response = client.datasets.register(
    purpose="eval/messages-answer",
    source={
        "type": "uri",
        "uri": data_url,
    },
    dataset_id="eval_dataset",
)

[2m00:30:00.325[0m [35m[START][0m [2m/v1/datasets[0m


#### 1.1. Register the eval model candidate with [models APIs](https://github.com/meta-llama/llama-stack/blob/e3f187fb83f2c45d5f838663658a873fb0fcc6d9/llama_stack/apis/models/models.py)
Since we use ollama as provider for inference, we set provider_id to 'ollama' during model registration


In [None]:
from rich.pretty import pprint

response = client.models.register(
    model="meta-llama/Llama-3.2-3B-Instruct",
    provider_id="ollama",
    provider_model_id="llama3.2:3b",
    # base model id
    metadata={"llama_model": "meta-llama/Llama-3.2-3B-Instruct"},
)

pprint(response)

INFO:httpx:HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK"


[2m00:30:29.540[0m [35m[START][0m [2m/v1/models[0m


#### 1.2. Kick-off eval job
- More details on Llama-stack eval: https://llamastack.github.io/latest/references/evals_reference/index.html
  - Define an EvalCandidate
  - Run evaluate on datasets (we choose brainstrust's answer-similarity as scoring function with OpenAI's model as judge model)

  > **Note**: If the eval process is stuck, try to restart the ollama server and try again




In [None]:
eval_rows = client.datasetio.get_rows_paginated(
    dataset_id="eval_dataset",
    limit=-1,
)

from tqdm import tqdm

client.benchmarks.register(
    benchmark_id="llama3.2-3B-instruct:tax_eval",
    dataset_id="eval_dataset",
    scoring_functions=["braintrust::answer-similarity"]
)

response = client.eval.evaluate_rows(
    benchmark_id="llama3.2-3B-instruct:tax_eval",
    input_rows=eval_rows.data,
    scoring_functions=["braintrust::answer-similarity"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-3B-Instruct",
            "sampling_params": {
                "temperature": 0.0,
                "max_tokens": 4096,
                "top_p": 0.9,
                "repeat_penalty": 1.0,
            },
        }
    }
)
pprint(response)

[2m00:35:56.357[0m [35m[START][0m [2m/v1/datasetio/rows[0m
[2m00:35:56.357[0m [35m[END][0m [2m/v1/datasetio/rows[0m[0m [StatusCode.OK][0m (0.31ms)
[2m00:35:56.369[0m [35m[START][0m [2m/v1/eval/benchmarks[0m


  0%|          | 0/43 [00:00<?, ?it/s]

[2m00:35:56.378[0m [35m[END][0m [2m/v1/eval/benchmarks[0m[0m [StatusCode.OK][0m (8.48ms)
[2m00:35:56.397[0m [35m[START][0m [2m/v1/eval/benchmarks/llama3.2-3B-instruct:tax_eval/evaluations[0m


INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  2%|▏         | 1/43 [00:02<01:56,  2.78s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  5%|▍         | 2/43 [00:07<02:39,  3.89s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  7%|▋         | 3/43 [00:10<02:20,  3.52s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  9%|▉         | 4/43 [00:14<02:18,  3.56s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 12%|█▏        | 5/43 [00:17<02:17,  3.63s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 14%|█▍        | 6/43 [00:21<02:09,  3.49s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 16%|█▋        | 7/43 [00:24<02:07,  3.55s/it]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 19%|█

Now we have the results show that the native Llama3.2 3B instruct model got the avg score of 0.4899 on the tax Q&A eval dataset. Let's see if we can boost the LLM performance with post training.

# 2. Start Post Training
Currently, Llama stack post training APIs support [Supervised Fine-tune](https://cameronrwolfe.substack.com/p/understanding-and-using-supervised) which is a straightforward and effective way to boost model performance on specific tasks.

We start from [LoRA finetune algorithm](https://pytorch.org/torchtune/main/tutorials/lora_finetune.html#what-is-lora) that can significantly reduce finetune GPU memory usage as well as needs less data


#### 2.0. Download the base model
Download the Llama model using the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/guides/cli).

Since ollama takes huggingface safetensor format checkpoint, we need to output the finetuned checkpoint in hugging face format. We download the model checkpoint from huggingface source.

> You need to authenticate with Hugging Face by getting your token from [here](https://huggingface.co/settings/tokens) and running `huggingface-cli login`

In [None]:
!huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir ~/.llama/Llama-3.2-3B-Instruct

####  2.1. Prepare post training dataset
Llama stack supports 2 post training dataset formats (instruct and dialog), you can select which dataset format to use in step 2.1.
- instruct dataset:
  - schema:
      - chat_completion_input: string (list of UserMessage, the length of the list is 1)
      - expected_answer: string
  - this format is the abstract of single-turn QA style dataset. During training, tokenized chat_completion_input + expected_answer will be model input, expected_answer will be label to calculate loss
  - [example](https://gist.github.com/SLR722/b4ae7c8b05a0ea1a067e5262eb137ee2)

- dialog dataset
  - schema:
      - dialog: string (list of interleaved UserMessages and AssistantMessages)
  - this format is the abstract of multi-turn chat style dataset. During training, tokenized UserMessage content + AssistantMessage content + UserMessage content + AssistantMessage content ... concat together will be model input, AssistantMessage contents in the list will be label to calculate loss
  - [example](https://gist.github.com/SLR722/20b3929032bc3a94cce3b8cc57788216)


 - Example scripts of converting json format dataset to llama stack format dataset ([to_llama_stack_dataset_instruct.py](https://gist.github.com/SLR722/3a76491190ce3225be935cc63c5332e6), [to_llama_stack_dataset_dialog.py](https://gist.github.com/SLR722/89dd6e41fab4505c327bd3fa99ea2f54))



In our tax preparer example, we prepared a tax Q&A training dataset with synthetic data from Llama 3.3 70B model [tax_preparation_train.csv](https://gist.github.com/SLR722/49a8ce78fc705c0437523d3625c29b5d) (data source: https://github.com/shadi-fsai/modeluniversity/blob/main/trainable_data.json), which has no overlap with eval dataset.

Since the tax Q&A dataset is single round Q&A, we use intruct dataset format for the post training.

> **Note:** if you hit the input schema issue, you probably need to restart the runtime to apply your fix

In [None]:
import requests

# Upload the example dataset from github to notebook
url = 'https://gist.githubusercontent.com/SLR722/49a8ce78fc705c0437523d3625c29b5d/raw/045f05be9cb6ebd5171fbdfce3306644ee435469/tax_preparation_train.csv'
r = requests.get(url)
with open('tax_preparation_train.csv', 'wb') as f:
    f.write(r.content)

# You can use the below comment out code to upload your local file to the notebook
# from google.colab import files

# uploaded = files.upload()

In [None]:
import os
import mimetypes
import base64

# encode the dataset file into data_url
def data_url_from_file(file_path: str) -> str:
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    with open(file_path, "rb") as file:
        file_content = file.read()

    base64_content = base64.b64encode(file_content).decode("utf-8")
    mime_type, _ = mimetypes.guess_type(file_path)

    data_url = f"data:{mime_type};base64,{base64_content}"

    return data_url

data_url = data_url_from_file("tax_preparation_train.csv")

# register post training dataset
# use the below commented out version for dialog dataset
response = client.datasets.register(
    purpose="post-training/messages",
    source={
        "type": "uri",
        "uri": data_url,
    },
    dataset_id="post_training_dataset",
)


# response = client.datasets.register(
#     dataset_id="post_training_dataset",
#     provider_id="localfs",
#     url={"uri": data_url},
#     dataset_schema={
#         "dialog": {"type": "dialog"},
#     },
# )

[2m00:42:16.035[0m [35m[START][0m [2m/v1/datasets[0m


#### 2.2. Kick-off Post Training Job

You can find the definition of post-training configs and APIs [here for server side](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/post_training/post_training.py) and [here for client side](https://github.com/meta-llama/llama-stack-client-python/blob/d6f3ef24b740c996b29c0540bc6b4e996de0a168/src/llama_stack_client/types/post_training_supervised_fine_tune_params.py).

> **Noet**: If you meet 'Job xxx already exists' error, you may also want to check the error logging above it. Since we have retry logic, the 'Job xxx already exists' may not be the root cause of the job failure

In [None]:
from llama_stack_client.types.post_training_supervised_fine_tune_params import (
    TrainingConfig,
    TrainingConfigDataConfig,
    TrainingConfigEfficiencyConfig,
    TrainingConfigOptimizerConfig,
)
from llama_stack_client.types.algorithm_config_param import LoraFinetuningConfig
from rich.pretty import pprint

algorithm_config = LoraFinetuningConfig(
    type="LoRA",
    # List of which linear layers LoRA should be applied to in each self-attention block
    # Options are {"q_proj", "k_proj", "v_proj", "output_proj"}.
    lora_attn_modules=["q_proj", "v_proj", "output_proj"],
    # Whether to apply LoRA to the MLP in each transformer layer. Default: False
    apply_lora_to_mlp=True,
    # Whether to apply LoRA to the model's final output projection. Default: False
    apply_lora_to_output=False,
    # Rank of each low-rank approximation
    rank=8,
    # Scaling factor for the low-rank approximation
    alpha=16,
)

data_config = TrainingConfigDataConfig(
    # Identifier of the registered dataset for finetune
    # Use client.datasets.list() to check all the available datasets
    dataset_id="post_training_dataset",
    # Identifier of the registered dataset to validate the finetune model
    # on validation_loss and perplexity
    # Skip this if you don't want to run validatation on the model
    validation_dataset_id="post_training_dataset",
    # Training data batch size
    batch_size=2,
    # Whether to shuffle the dataset.
    shuffle=False,
    # dataset format, select from ['instruct', 'dialog']
    # change it to 'dialog' if you use dialog format dataset
    data_format='instruct',
)
optimizer_config = TrainingConfigOptimizerConfig(
    # Currently only support adamw
    optimizer_type="adamw",
    # Learning rate
    lr=3e-4,
    # adamw weight decay coefficient
    weight_decay=0.1,
    # The number of steps for the warmup phase for lr scheduler
    num_warmup_steps=10,
)
effiency_config = TrainingConfigEfficiencyConfig(
    # Help reduce memory by recalculating some intermediate activations
    # during backward
    enable_activation_checkpointing=True,
    # We offer another memory efficiency flag called enable_activation_offloading
    # which moves certain activations from GPU memory to CPU memory
    # This further reduces GPU memory usage at the cost of additional
    # data transfer overhead and possible slowdowns
    # enable_activation_offloading=False,
)
training_config = TrainingConfig(
    # num of training epochs
    n_epochs=1,
    data_config=data_config,
    efficiency_config=effiency_config,
    optimizer_config=optimizer_config,
    # max num of training steps per epoch
    max_steps_per_epoch=10000,
    # max num of steps for validation
    max_validation_steps=10,
    # Accumulate how many steps to calculate the gradient and update model parameters
    # This is to simulate large batch size training while memory is limited
    gradient_accumulation_steps=4,
)

# call supervised finetune API
training_job = client.post_training.supervised_fine_tune(
    job_uuid="1234",
    # Base Llama model to be finetuned on
    model="meta-llama/Llama-3.2-3B-Instruct",
    algorithm_config=algorithm_config,
    training_config=training_config,
    # Base model checkpoint dir
    # By default, the implementation will look at ~/.llama/checkpoints/<model>
    checkpoint_dir="null",
    # logger_config and hyperparam_search_config haven't been supported yet
    logger_config={},
    hyperparam_search_config={},
)

pprint(training_job)


DEBUG:torchtune.utils._logging:Setting manual seed to local seed 28602197. Local seed is seed + rank = 28602197 + 0
INFO:torchtune.utils._logging:Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.


[2m00:43:22.604[0m [35m[START][0m [2m/v1/post-training/supervised-fine-tune[0m


INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory allocation: 6.07 GiB
	GPU peak memory reserved: 6.11 GiB
	GPU peak memory active: 6.07 GiB
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Model is initialized with precision torch.bfloat16.
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Tokenizer is initialized.
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Optimizer is initialized.
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Loss is initialized.
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Dataset and Sampler are initialized.
INFO:llama_stack.providers.inline.post_training.torchtune.recipes.lora_finetuning_single_device:Learning rate scheduler is initialized.


Writing logs to /root/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/log/log_1740530605.txt


1|1|Loss: 1.389875888824463:   1%|          | 1/153 [00:02<06:02,  2.38s/it]INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory allocation: 6.30 GiB
	GPU peak memory reserved: 6.47 GiB
	GPU peak memory active: 6.30 GiB
1|2|Loss: 1.416195273399353:   1%|▏         | 2/153 [00:03<04:24,  1.75s/it]INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory allocation: 6.35 GiB
	GPU peak memory reserved: 6.47 GiB
	GPU peak memory active: 6.35 GiB
1|3|Loss: 1.5175566673278809:   2%|▏         | 3/153 [00:05<03:54,  1.56s/it]INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory allocation: 6.30 GiB
	GPU peak memory reserved: 6.50 GiB
	GPU peak memory active: 6.30 GiB
1|4|Loss: 1.463149905204773:   3%|▎         | 4/153 [00:06<03:55,  1.58s/it] INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory allocation: 6.33 GiB
	GPU peak memory reserved: 6.50 GiB
	GPU peak memory active: 6.33 GiB
1|5|Loss: 1.500417

#### 2.3. list all the post training jobs

In [None]:
job_list = client.post_training.job.list()
pprint(job_list)

[2m00:48:43.629[0m [35m[START][0m [2m/v1/post-training/jobs[0m


#### 2.4. query the job status of a given post training job
finetuned checkpoint metadata (validation metrics are included if available) and job metadata are provided in the status

In [None]:
job_status = client.post_training.job.status(job_uuid='1234')
pprint(job_status)

[2m00:49:06.134[0m [35m[START][0m [2m/v1/post-training/job/status[0m


#### 2.5. get list of post training job artifacts (finetuned checkpoints)

In [None]:
job_artifacts = client.post_training.job.artifacts(job_uuid='1234')
pprint(job_artifacts)

[2m00:49:12.609[0m [35m[START][0m [2m/v1/post-training/job/artifacts[0m


# 3. Run Inference on the new model
Woohoo! Now we have the new model finetuned on tax Q&A data ready! Now it's time to run inference to see some response from the model we just made!

#### 3.0. Create a new model on ollama
Please refer to [this doc](https://github.com/ollama/ollama/blob/main/docs/import.md) for more details on how to create a customized model from huggingface safetensor format adapter

We need to launch xterm and enter the below commands


```
mkdir adapter

# copy the adapter checkpoints of the finetuned model from Colab to xterm
cp /root/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter_config.json ./adapter/
cp /root/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter_model.safetensors ./adapter/

# create a Modelfile file
# You need to config the base model in FROM
# and the path of adapter checkpoints in ADAPTER
echo -e "FROM llama3.2\nADAPTER /content/adapter" >> Modelfile

# create the new model
ollama create llama_3_2_finetuned
ollama run llama_3_2_finetuned --keepalive 120m
```

> **TODO**: we plan to streamline this part by managing the finetuned checkpoints across post training and inference provider by /files API and put the above create customized model in ollama part with resigster_model method

In [None]:
%xterm

Launching Xterm...

<IPython.core.display.Javascript object>

check if the finetuned model is running on ollama server successfully

In [None]:
!ollama ps

NAME                          ID              SIZE      PROCESSOR    UNTIL            
llama_3_2_finetuned:latest    a73e7ad20955    4.0 GB    100% GPU     2 hours from now    


#### 3.1. Register the new model

In [None]:
response = client.models.register(
    # the model id here needs to be the finetuned checkpoint identifier
    model="meta-llama/Llama-3.2-3B-Instruct-sft-0",
    provider_id="ollama",
    provider_model_id="llama_3_2_finetuned:latest",
    # base model id
    metadata={"llama_model": "meta-llama/Llama-3.2-3B-Instruct"},
)

pprint(response)

INFO:httpx:HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK"


[2m00:53:05.319[0m [35m[START][0m [2m/v1/models[0m


#### 3.2 Call the Llama stack [inference APIs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/inference/inference.py) to run inference

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct-sft-0",
    messages=[
        {"role": "user", "content": "What is the primary purpose of a W-2 form in relation to income tax?"}
    ],
)

print(response.choices[0].message.content)

[2m00:53:56.013[0m [35m[START][0m [2m/v1/inference/chat-completion[0m


INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"


To report an employee's income and taxes withheld. My explanation: The W-2 form is used by employers to report an employee's income, taxes withheld, and other relevant information to the IRS.


# 4. Run evaluation on the finetuned checkpoints
The finetuned checkpoint is naturally compatiable with Llama stack [eval APIs](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/eval/eval.py).

Let's ru-run the evaluate sub-steps in step 1 to see if the post training gives us some meaningful improvments.

In [None]:
# We limit to 50 rows from the dataset to save time
eval_rows = client.datasetio.get_rows_paginated(
    dataset_id="eval_dataset",
    limit=-1,
)

from tqdm import tqdm


system_message = {
    "role": "system",
    "content": "You are a tax preparer.",
}

client.benchmarks.register(
    benchmark_id="Llama-3.2-3B-Instruct-sft-0:tax_eval",
    dataset_id="eval_dataset",
    scoring_functions=["braintrust::answer-similarity"]
)

response = client.eval.evaluate_rows(
    benchmark_id="Llama-3.2-3B-Instruct-sft-0:tax_eval",
    input_rows=eval_rows.data,
    scoring_functions=["braintrust::answer-similarity"],
    benchmark_config={
        "type": "benchmark",
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-3B-Instruct-sft-0",
            "sampling_params": {
                "temperature": 0.0,
                "max_tokens": 4096,
                "top_p": 0.9,
                "repeat_penalty": 1.0,
            },
            "system_message": system_message
        }
    }
)
pprint(response)

[2m00:55:41.833[0m [35m[START][0m [2m/v1/datasetio/rows[0m
[2m00:55:41.833[0m [35m[END][0m [2m/v1/datasetio/rows[0m[0m [StatusCode.OK][0m (0.21ms)
[2m00:55:41.848[0m [35m[START][0m [2m/v1/eval/benchmarks[0m
[2m00:55:41.858[0m [35m[END][0m [2m/v1/eval/benchmarks[0m[0m [StatusCode.OK][0m (9.47ms)
[2m00:55:41.874[0m [35m[START][0m [2m/v1/eval/benchmarks/Llama-3.2-3B-Instruct-sft-0:tax_eval/evaluations[0m


  0%|          | 0/43 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  2%|▏         | 1/43 [00:00<00:33,  1.27it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  5%|▍         | 2/43 [00:01<00:29,  1.40it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  7%|▋         | 3/43 [00:02<00:28,  1.42it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
  9%|▉         | 4/43 [00:02<00:25,  1.53it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 12%|█▏        | 5/43 [00:03<00:23,  1.64it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 14%|█▍        | 6/43 [00:03<00:21,  1.68it/s]INFO:httpx:HTTP Request: POST http://localhost:11434/api/generate "HTTP/1.1 200 OK"
 16%|█▋        | 7/43 [00:04<00:20,  1.74it/s]INFO:httpx:HTTP Request: POST http://localhost:11434

Wow, you see? we are able to improve the eval score from 0.4899 to 0.5803 (**18.5% improvement**) with a ~1000 samples dataset and a few mintutes training on a single GPU!


It's just a start. There are several tricks on parameters tuning, training dataset processing etc. to further boost the finetune performance for you to explore.

Now, it's time to enhance your own agentic application with post training. Happy tuning!