llama-stack-mirror/llama_stack/providers/inline/post_training/torchtune
Charlie Doern 46c5b14a22 feat: handle graceful shutdown
currently this impl hangs because of `trainer.train()` blocking.

Re-write the implementation to kick off the model download, device instantiation, dataset processing, and training in a monitored subprocess.

All of these steps need to be in a subprocess or else different devices are used which causes torch errors.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 16:41:24 -04:00
..
common chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00
datasets chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00
recipes feat: handle graceful shutdown 2025-05-16 16:41:24 -04:00
__init__.py chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00
config.py chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00
post_training.py chore: enable pyupgrade fixes (#1806) 2025-05-01 14:23:50 -07:00