llama-stack/tests/unit
Ihar Hrachyshka 3ed4316ed5
feat: Implement async job execution for torchtune training (#1437)
# What does this PR do?

Now a separate thread is started to execute training jobs. Training
requests now return job ID before the job completes. (Which fixes API
timeouts for any jobs that take longer than a minute.)

Note: the scheduler code is meant to be spun out in the future into a
common provider service that can be reused for different APIs and
providers. It is also expected to back the /jobs API proposed here:

https://github.com/meta-llama/llama-stack/discussions/1238

Hence its somewhat generalized form which is expected to simplify its
adoption elsewhere in the future.

Note: this patch doesn't attempt to implement missing APIs (e.g. cancel
or job removal). This work will belong to follow-up PRs.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

Added unit tests for the scheduler module. For the API coverage, did
manual testing and was able to run a training cycle on GPU. The initial
call returned job ID before the training completed, as (now) expected.
Artifacts are returned as expected.

```
JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50')
```

The integration test is currently disabled for the provider. I will look
into how it can be enabled in a different PR / issue context.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-14 08:59:11 -07:00
..
cli refactor: tests/unittests -> tests/unit; tests/api -> tests/integration 2025-03-04 09:57:00 -08:00
distribution feat: ability to execute external providers (#1672) 2025-04-09 10:30:41 +02:00
models feat: support '-' in tool names (#1807) 2025-04-12 14:23:03 -07:00
providers feat: Implement async job execution for torchtune training (#1437) 2025-04-14 08:59:11 -07:00
rag chore: Get sqlite_vec and vector_store unit tests passing (#1413) 2025-03-05 13:20:13 -05:00
registry fix: handle registry errors gracefully (#1732) 2025-03-20 15:24:07 -07:00
server feat(server): add attribute based access control for resources (#1703) 2025-03-19 21:28:52 -07:00