llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Kevin Postlethwait	2aca7265b3	fix: add todo for schema validation (#1991 ) # What does this PR do? Change validation to TODO same as was done [here](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/eval/meta_reference/eval.py#L87) until validation can be implemented Closes #1849 ## Test Plan Signed-off-by: Kevin <kpostlet@redhat.com>	2025-04-29 09:59:35 +02:00
Ihar Hrachyshka	0cbb7f7f21	chore: fix mypy violations in post_training modules (#1548 ) # What does this PR do? Fixes a bunch of violations. Note: this patch touches all files but post_training.py that will be significantly changed by #1437, hence leaving it out of the picture for now. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Testing with https://github.com/meta-llama/llama-stack/pull/1543 Also checked that GPU training works with the change: ``` INFO: ::1:53316 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 200 OK INFO: ::1:53316 - "GET /v1/post-training/job/status?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK INFO: ::1:53316 - "GET /v1/post-training/job/artifacts?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK 21:24:01.161 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (32526.75ms) 21:23:28.769 [DEBUG] Setting manual seed to local seed 3918872849. Local seed is seed + rank = 3918872849 + 0 21:23:28.996 [INFO] Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights. 21:23:29.933 [INFO] Memory stats after model init: GPU peak memory allocation: 6.05 GiB GPU peak memory reserved: 6.10 GiB GPU peak memory active: 6.05 GiB 21:23:29.934 [INFO] Model is initialized with precision torch.bfloat16. 21:23:30.115 [INFO] Tokenizer is initialized. 21:23:30.118 [INFO] Optimizer is initialized. 21:23:30.119 [INFO] Loss is initialized. 21:23:30.896 [INFO] Dataset and Sampler are initialized. 21:23:30.898 [INFO] Learning rate scheduler is initialized. 21:23:31.618 [INFO] Memory stats after model init: GPU peak memory allocation: 6.24 GiB GPU peak memory reserved: 6.30 GiB GPU peak memory active: 6.24 GiB 21:23:31.620 [INFO] Starting checkpoint save... 21:23:59.428 [INFO] Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth 21:23:59.445 [INFO] Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-18 14:58:16 -07:00
Yuan Tang	34ab7a3b6c	Fix precommit check after moving to ruff (#927 ) Lint check in main branch is failing. This fixes the lint check after we moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We need to move to a `ruff.toml` file as well as fixing and ignoring some additional checks. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-02 06:46:45 -08:00
Botao Chen	25c1d9b037	[post training] define llama stack post training dataset format (#717 ) ## context In this PR, we defined 2 llama stack dataset formats (instruct, dialog) - For instruct dataset format, the column schema will be [chat_completion_input, expected_answer], which is consistent with the eval data format. This dataset format is the abstract of single turn QA style post training data - For dialog dataset format, the column schema will be [dialog], which is a list of user messages and assistant messages that interleave together. During training, the whole list will be the model input and the loss is calculated on assistant messages only. This dataset format is the abstract of multi turn chat style post training data ## changes - defined the 2 llama stack dataset formats - an adapter to convert llama stack dataset format to torchtune dataset format - move dataset format validation to post training level instead of torchtune level since it's not specific to torchtune - add localfs as datasetio provider ## test instruct format - use https://huggingface.co/datasets/llamastack/evals as dataset and the training works as expected <img width="1443" alt="Screenshot 2025-01-09 at 5 15 14 PM" src="https://github.com/user-attachments/assets/2c37a936-c67a-4726-90e0-23fa0ba7000f" /> - use my generated local dataset and the training works as expected <img width="1617" alt="Screenshot 2025-01-09 at 5 19 11 PM" src="https://github.com/user-attachments/assets/0bdccbbf-bac2-472a-a365-15213e49bbfa" /> dialog format - use my generated local dataset and the training works as expected <img width="1588" alt="Screenshot 2025-01-09 at 5 23 16 PM" src="https://github.com/user-attachments/assets/893915ba-41a3-4d51-948b-e872060ecede" />	2025-01-14 12:48:49 -08:00

4 commits