[2/n][torchtune integration] implement job management and return training artifacts (#593)

### Context In this PR, we - Implement the post training job management and get training artifacts apis - get_training_jobs - get_training_job_status - get_training_job_artifacts - get_training_job_logstream is deleted since the trace can be directly accessed by UI with Jaeger https://llama-stack.readthedocs.io/en/latest/building_applications/telemetry.html#jaeger-to-visualize-traces - Refactor the post training and training types definition to make them more intuitive. - Rewrite the checkpointer to make it compatible with llama-stack file system and can be recognized during inference ### Test Unit test `pytest llama_stack/providers/tests/post_training/test_post_training.py -m "torchtune_post_training_huggingface_datasetio" -v -s --tb=short --disable-warnings` <img width="1506" alt="Screenshot 2024-12-10 at 4 06 17 PM" src="https://github.com/user-attachments/assets/16225029-bdb7-48c4-9d13-e580cc769c0a"> e2e test with client side call <img width="888" alt="Screenshot 2024-12-10 at 4 09 44 PM" src="https://github.com/user-attachments/assets/de375e4c-ef67-4dcc-a045-4037d9489191">
2024-12-13 15:00:04 -08:00 · 2024-12-13 15:00:04 -08:00 · c294a01c4b
commit c294a01c4b
parent 5764a95912
8 changed files with 331 additions and 67 deletions
--- a/llama_stack/apis/common/job_types.py
+++ b/llama_stack/apis/common/job_types.py
@ -18,3 +18,5 @@ class Job(BaseModel):
 class JobStatus(Enum):
    completed = "completed"
    in_progress = "in_progress"
+    failed = "failed"
+    scheduled = "scheduled"
--- a/llama_stack/apis/common/training_types.py
+++ b/llama_stack/apis/common/training_types.py
@ -4,13 +4,26 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from llama_models.llama3.api.datatypes import URL
+from datetime import datetime
+from typing import Optional
+
 from llama_models.schema_utils import json_schema_type
 from pydantic import BaseModel


+@json_schema_type
+class PostTrainingMetric(BaseModel):
+    epoch: int
+    train_loss: float
+    validation_loss: float
+    perplexity: float
+
+
@json_schema_type(schema={"description": "Checkpoint created during training runs"})
 class Checkpoint(BaseModel):
-    iters: int
-    path: URL
+    identifier: str
+    created_at: datetime
    epoch: int
+    post_training_job_id: str
+    path: str
+    training_metrics: Optional[PostTrainingMetric] = None