Commit graph

18 commits

Author SHA1 Message Date
Botao Chen
d0a72cc288 fix misc 2024-12-13 14:55:01 -08:00
Botao Chen
d55a8343ea merge 2024-12-13 12:55:21 -08:00
Botao Chen
e2a0dce8ad Merge branch 'main' into post_training_v3 2024-12-13 12:09:01 -08:00
Botao Chen
aeb76390fc
[1/n] torchtune <> llama-stack integration skeleton (#540)
### Context 
This is the 1st of series PRs that integrate torchtune with llama-stack
as meta reference post-training implementation. For MVP, we will focus
on single device LoRA SFT.

Though this PR is still WIP, we want to get early feedback on the high
level design of this skeleton while still working on several details

### Scope
To limit the scope of this PR, we focus on the skeleton of the
implementation.

**What are included?**
- refine the post-training SFT apis
- skeleton of supervised_fine_tune implementation. We verified that we
can call the supervised_fine_tune API successfully from llama stack
client SDK (client side PR:
https://github.com/meta-llama/llama-stack-client-python/pull/51)
- a very basic single device LoRA training recipe based on torchtune
core components
- parity check with torchtune library and post training api unit test

**What are not includes?**
- implementation of other job management, get training artifacts apis
(separate PR)
- refactor the meta reference inference logic to support eval on
finetuned model (separate PR)
- several necessary functionality in the training recipe such as
logging, validation etc (separate PR)
- interop with telemetry for tracing and metrics logging, currently
temporarily log to local disk (separate PR)

### Testing
**e2e test**
Although we haven't added detailed testing and numerical parity check
with torchtune yet, we did a simple E2E test from client to server
1. setup server with` llama stack build --template
experimental-post-training --image-type conda` and `llama stack run
experimental-post-training `
2. On client, run `llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 post_training
supervised_fine_tune`
3. Training finishes successfully. On server side, get the finetune
checkpoints under output dir. On client side, get the job uuid

server 
<img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM"
src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d">

client 
<img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM"
src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838">

**parity check**
torchtune dataloader output and llama-stack post training dataloader
output are same
<img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM"
src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee">

torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca
dataset with llama3.2 3B instruct model are numerical match

<img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM"
src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99">

<img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM"
src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443">

**unit test ** 
![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()
2024-12-13 11:05:35 -08:00
Botao Chen
e5993c565e misc 2024-12-10 15:24:46 -08:00
Botao Chen
214d0645ae add unit test 2024-12-10 14:57:03 -08:00
Botao Chen
c9a009b5e7 temp commit 2024-12-09 20:24:30 -08:00
Botao Chen
9c1ae088f9 refine 2024-12-09 13:35:44 -08:00
Botao Chen
9c80a57667 remove unnecessary provider apis from expermental post training template 2024-12-04 20:26:52 -08:00
Botao Chen
12eef58543 address comment 2024-12-04 15:19:54 -08:00
Botao Chen
2a15a8a005 temp commit 2024-12-04 13:59:40 -08:00
Botao Chen
41cf2bb0a7 refine api 2024-12-03 20:01:27 -08:00
Botao Chen
5838b7211d fix pre-commit 2024-12-02 17:59:53 -08:00
Botao Chen
79c525be94 temp commit 2024-12-02 17:24:25 -08:00
Botao Chen
6c709abc4d temp commit 2024-11-27 16:46:29 -08:00
Botao Chen
bfc782c054 temp commit 2024-11-27 15:22:55 -08:00
Botao Chen
9a976bcabd temp commit 2024-11-26 10:49:03 -08:00
Botao Chen
d7598c68d7 temp commit 2024-11-25 17:27:26 -08:00