llama-stack-mirror

phoenix-oss/llama-stack-mirror

Fork 1

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-01 12:08:39 +00:00

Commit graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Charlie Doern	6494658a10	feat: add finetune_multi_device recipe with fsdp support the HF SFTTrainer supports distributed training using FSDP. Add a new recipe, `finetune_multi_device` which supports multi-GPU (cuda) training using FSDP and optionally LoRA. transformers hides _alot_ of their usage of FSDP behind the training args: `a6b51e7341/src/transformers/training_args.py (L1535)` you need to pass both `fsdp` and `fsdp_config` to get it to work properly. However, it seems many of the `fsdp_config` entries are silently ignored. The key things to get this working were: full_shard offload (cpu offload) transformer_layer_cls_to_wrap (model specific wrapping) cpu_ram_efficient_loading sharding_strategy limit_all_gathers sync_module_states backward_prefetch use_orig_params these can be seen both in `fsdp=` and `fsdp_config=` int he `SFTConfig` call. I have tested this with different model architectures with and without LoRA with success. the user can now toggle `recipe` in their provider config between `single` and `multi` to access the two different recipes. for debugging purposes NCCL logging settings can now be accessed via the provider config as well Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-06-12 13:33:33 -04:00
Charlie Doern	f02f7b28c1	feat: add huggingface post_training impl (#2132 ) # What does this PR do? adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc the provider comes with one recipe `finetune_single_device` which works both with and without LoRA. any model that is a valid HF identifier can be given and the model will be pulled. this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box The provider processes the given dataset into the proper format, establishes the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every `save_steps` and at the end of training. ## Test Plan re-enabled post_training integration test suite with a singular test that loads the simpleqa dataset: https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The test now uses the llama stack client and the proper post_training API runs one step with a batch_size of 1. This test runs on CPU on the Ubuntu runner so it needs to be a small batch and a single step. [//]: # (## Documentation) --------- Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-05-16 14:41:28 -07:00

Charlie Doern

6494658a10

feat: add finetune_multi_device recipe with fsdp support

the HF SFTTrainer supports distributed training using FSDP.

Add a new recipe, `finetune_multi_device` which supports multi-GPU (cuda) training
using FSDP and optionally LoRA.

transformers hides _alot_ of their usage of FSDP behind the training args:
a6b51e7341/src/transformers/training_args.py (L1535)

you need to pass both `fsdp` and `fsdp_config` to get it to work properly. However,
it seems many of the `fsdp_config` entries are silently ignored. The key things to get this working were:
full_shard
offload (cpu offload)
transformer_layer_cls_to_wrap (model specific wrapping)
cpu_ram_efficient_loading
sharding_strategy
limit_all_gathers
sync_module_states
backward_prefetch
use_orig_params

these can be seen both in `fsdp=` and `fsdp_config=` int he `SFTConfig` call.

I have tested this with different model architectures with and without LoRA with success.

the user can now toggle `recipe` in their provider config between `single` and `multi` to access the two different recipes.

for debugging purposes NCCL logging settings can now be accessed via the provider config as well

Signed-off-by: Charlie Doern <cdoern@redhat.com>

2025-06-12 13:33:33 -04:00

Charlie Doern

f02f7b28c1

feat: add huggingface post_training impl (#2132 )

# What does this PR do?


adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.

any model that is a valid HF identifier can be given and the model will
be pulled.

this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box

The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.


## Test Plan

re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.

[//]: # (## Documentation)

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>

2025-05-16 14:41:28 -07:00

2 commits