llama-stack/llama_stack/providers/inline/post_training/torchtune
Ihar Hrachyshka c1f7d7f005
fix: miscellaneous job management improvements in torchtune (#1136)
- **refactor: simplify job status extraction a bit**
- **torchtune: save job status on schedule**
- **refactor: get rid of job_list in torchtune job management code**

# What does this PR do?

A failed job is now registered in API, and one can consult its status.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
$ llama-stack-client post_training status --job-uuid test-jobe244b5b0-5053-4892-a4d9-d8fc8b116e73                                                      
JobStatusResponse(checkpoints=[], job_uuid='test-jobe244b5b0-5053-4892-a4d9-d8fc8b116e73', status='failed', completed_at=None, resources_allocated=None, scheduled_at=datetime.datetime(2025, 2, 18, 9, 4, 34, 3252), started_at=datetime.datetime(2025, 2, 18, 9, 4, 34, 10688))
```

[//]: # (## Documentation)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-02-19 19:09:37 -08:00
..
common chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
datasets build: format codebase imports using ruff linter (#1028) 2025-02-13 10:06:21 -08:00
recipes chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
__init__.py [1/n] torchtune <> llama-stack integration skeleton (#540) 2024-12-13 11:05:35 -08:00
config.py [1/n] torchtune <> llama-stack integration skeleton (#540) 2024-12-13 11:05:35 -08:00
post_training.py fix: miscellaneous job management improvements in torchtune (#1136) 2025-02-19 19:09:37 -08:00