llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-12 16:16:09 +00:00

History

Charlie Doern ce48d47543 feat: DistributedJobScheduler rather than handling multi-GPU training within a recipe, distributed training should be one of our scheduler offerings. Introduce the DistributedJobScheduler which kicks off a `finetune_handler.py` script using torchrun. This handler processes the training args via argparse and calls the right recipe as `post_training.py` used to do. Torchrun takes care of env variables like world_size, local_rank, etc. Signed-off-by: Charlie Doern <cdoern@redhat.com>		2025-06-12 16:05:45 -04:00
..
inline	feat: DistributedJobScheduler	2025-06-12 16:05:45 -04:00
registry	feat: add deps dynamically based on metastore config (#2405 )	2025-06-05 14:07:25 -07:00
remote	fix(weaviate): handle case where distance is 0 by setting score to infinity (#2415 )	2025-06-12 11:23:59 -04:00
utils	feat: DistributedJobScheduler	2025-06-12 16:05:45 -04:00
__init__.py	API Updates (#73 )	2024-09-17 19:51:35 -07:00
datatypes.py	fix(tools): do not index tools, only index toolgroups (#2261 )	2025-05-25 13:27:52 -07:00