llama-stack-mirror/llama_stack/providers
Charlie Doern ce48d47543 feat: DistributedJobScheduler
rather than handling multi-GPU training within a recipe, distributed training should be one of our scheduler offerings. Introduce the DistributedJobScheduler which kicks off a `finetune_handler.py` script using torchrun. This handler processes the training args via argparse
and calls the right recipe as `post_training.py` used to do. Torchrun takes care of env variables like world_size, local_rank, etc.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-06-12 16:05:45 -04:00
..
inline feat: DistributedJobScheduler 2025-06-12 16:05:45 -04:00
registry feat: add deps dynamically based on metastore config (#2405) 2025-06-05 14:07:25 -07:00
remote fix(weaviate): handle case where distance is 0 by setting score to infinity (#2415) 2025-06-12 11:23:59 -04:00
utils feat: DistributedJobScheduler 2025-06-12 16:05:45 -04:00
__init__.py API Updates (#73) 2024-09-17 19:51:35 -07:00
datatypes.py fix(tools): do not index tools, only index toolgroups (#2261) 2025-05-25 13:27:52 -07:00