llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-05 10:23:44 +00:00

History

Charlie Doern 65b4fae51d fix: proper checkpointing logic for HF trainer (#2429 ) # What does this PR do? currently only the last saved model is reported as a checkpoint and associated with the job UUID. since the HF trainer handles checkpoint collection during training, we need to add all of the `checkpoint-*` folders as Checkpoint objects. Adjust the save strategy to be per-epoch to make this easier and to use less storage Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-06-27 17:36:25 -04:00
..
__init__.py	ci: add python package build test (#2457 )	2025-06-19 18:57:32 +05:30
finetune_single_device.py	fix: proper checkpointing logic for HF trainer (#2429 )	2025-06-27 17:36:25 -04:00

fix: proper checkpointing logic for HF trainer (#2429 )

# What does this PR do?

currently only the last saved model is reported as a checkpoint and
associated with the job UUID. since the HF trainer handles checkpoint
collection during training, we need to add all of the `checkpoint-*`
folders as Checkpoint objects. Adjust the save strategy to be per-epoch
to make this easier and to use less storage

Signed-off-by: Charlie Doern <cdoern@redhat.com>

2025-06-27 17:36:25 -04:00

__init__.py

ci: add python package build test (#2457 )

2025-06-19 18:57:32 +05:30

finetune_single_device.py

fix: proper checkpointing logic for HF trainer (#2429 )

2025-06-27 17:36:25 -04:00