llama-stack

forked from phoenix-oss/llama-stack-mirror

History

Botao Chen 25c1d9b037 [post training] define llama stack post training dataset format (#717 ) ## context In this PR, we defined 2 llama stack dataset formats (instruct, dialog) - For instruct dataset format, the column schema will be [chat_completion_input, expected_answer], which is consistent with the eval data format. This dataset format is the abstract of single turn QA style post training data - For dialog dataset format, the column schema will be [dialog], which is a list of user messages and assistant messages that interleave together. During training, the whole list will be the model input and the loss is calculated on assistant messages only. This dataset format is the abstract of multi turn chat style post training data ## changes - defined the 2 llama stack dataset formats - an adapter to convert llama stack dataset format to torchtune dataset format - move dataset format validation to post training level instead of torchtune level since it's not specific to torchtune - add localfs as datasetio provider ## test instruct format - use https://huggingface.co/datasets/llamastack/evals as dataset and the training works as expected <img width="1443" alt="Screenshot 2025-01-09 at 5 15 14 PM" src="https://github.com/user-attachments/assets/2c37a936-c67a-4726-90e0-23fa0ba7000f" /> - use my generated local dataset and the training works as expected <img width="1617" alt="Screenshot 2025-01-09 at 5 19 11 PM" src="https://github.com/user-attachments/assets/0bdccbbf-bac2-472a-a365-15213e49bbfa" /> dialog format - use my generated local dataset and the training works as expected <img width="1588" alt="Screenshot 2025-01-09 at 5 23 16 PM" src="https://github.com/user-attachments/assets/893915ba-41a3-4d51-948b-e872060ecede" />		2025-01-14 12:48:49 -08:00
..
agents	Update spec	2025-01-13 23:16:53 -08:00
batch_inference	remove conflicting default for tool prompt format in chat completion (#742 )	2025-01-10 10:41:53 -08:00
common	[post training] define llama stack post training dataset format (#717 )	2025-01-14 12:48:49 -08:00
datasetio	[remove import ] clean up import 's (#689 )	2024-12-27 15:45:44 -08:00
datasets	Update the "InterleavedTextMedia" type (#635 )	2024-12-17 11:18:31 -08:00
eval	Import from the right path (#708 )	2025-01-02 13:15:31 -08:00
eval_tasks	Add version to REST API url (#478 )	2024-11-18 22:44:14 -08:00
inference	introduce and use a generic ContentDelta	2025-01-13 23:16:53 -08:00
inspect	add --version to llama stack CLI & /version endpoint (#732 )	2025-01-08 16:30:06 -08:00
memory	Update the "InterleavedTextMedia" type (#635 )	2024-12-17 11:18:31 -08:00
memory_banks	[tests] add client-sdk pytests & delete client.py (#638 )	2024-12-16 12:04:56 -08:00
models	[tests] add client-sdk pytests & delete client.py (#638 )	2024-12-16 12:04:56 -08:00
post_training	[post training] define llama stack post training dataset format (#717 )	2025-01-14 12:48:49 -08:00
safety	Update the "InterleavedTextMedia" type (#635 )	2024-12-17 11:18:31 -08:00
scoring	[rag evals] refactor & add ability to eval retrieval + generation in agentic eval pipeline (#664 )	2025-01-02 11:21:33 -08:00
scoring_functions	[/scoring] add ability to define aggregation functions for scoring functions & refactors (#597 )	2024-12-11 10:03:42 -08:00
shields	[tests] add client-sdk pytests & delete client.py (#638 )	2024-12-16 12:04:56 -08:00
synthetic_data_generation	[remove import ] clean up import 's (#689 )	2024-12-27 15:45:44 -08:00
telemetry	Update Telemetry API so OpenAPI generation can work (#640 )	2024-12-16 13:00:14 -08:00
tools	remove conflicting default for tool prompt format in chat completion (#742 )	2025-01-10 10:41:53 -08:00
__init__.py	API Updates (#73 )	2024-09-17 19:51:35 -07:00
resource.py	Tools API with brave and MCP providers (#639 )	2024-12-19 21:25:17 -08:00
version.py	Fix the pyopenapi generator avoid potential circular imports	2024-11-18 23:37:52 -08:00