llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-14 00:56:09 +00:00

History

Charlie Doern ce48d47543 feat: DistributedJobScheduler rather than handling multi-GPU training within a recipe, distributed training should be one of our scheduler offerings. Introduce the DistributedJobScheduler which kicks off a `finetune_handler.py` script using torchrun. This handler processes the training args via argparse and calls the right recipe as `post_training.py` used to do. Torchrun takes care of env variables like world_size, local_rank, etc. Signed-off-by: Charlie Doern <cdoern@redhat.com>		2025-06-12 16:05:45 -04:00
..
recipes	feat: DistributedJobScheduler	2025-06-12 16:05:45 -04:00
__init__.py	feat: add huggingface post_training impl (#2132 )	2025-05-16 14:41:28 -07:00
config.py	feat: add finetune_multi_device recipe with fsdp support	2025-06-12 13:33:33 -04:00
finetune_handler.py	feat: DistributedJobScheduler	2025-06-12 16:05:45 -04:00
post_training.py	feat: DistributedJobScheduler	2025-06-12 16:05:45 -04:00