mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-27 21:11:59 +00:00
rather than handling multi-GPU training within a recipe, distributed training should be one of our scheduler offerings. Introduce the DistributedJobScheduler which kicks off a `finetune_handler.py` script using torchrun. This handler processes the training args via argparse and calls the right recipe as `post_training.py` used to do. Torchrun takes care of env variables like world_size, local_rank, etc. Signed-off-by: Charlie Doern <cdoern@redhat.com> |
||
|---|---|---|
| .. | ||
| apis | ||
| cli | ||
| distribution | ||
| models | ||
| providers | ||
| strong_typing | ||
| templates | ||
| ui | ||
| __init__.py | ||
| env.py | ||
| log.py | ||
| schema_utils.py | ||