llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ihar Hrachyshka	9e6561a1ec	chore: enable pyupgrade fixes (#1806 ) # What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-01 14:23:50 -07:00
Ashwin Bharambe	d072b5fa0c	test: add unit test to ensure all config types are instantiable (#1601 )	2025-03-12 22:29:58 -07:00
Botao Chen	123fb9eb24	feat: [post training] support save hf safetensor format checkpoint (#845 ) ## context Now, in llama stack, we only support inference / eval a finetuned checkpoint with meta-reference as inference provider. This is sub-optimal since meta-reference is pretty slow. Our vision is that developer can inference / eval a finetuned checkpoint produced by post training apis with all the inference providers on the stack. To achieve this, we'd like to define an unified output checkpoint format for post training providers. So that, all the inference provider can respect that format for customized model inference. By spotting check how [ollama](https://github.com/ollama/ollama/blob/main/docs/import.md) and [fireworks](https://docs.fireworks.ai/models/uploading-custom-models) do inference on a customized model, we defined the output checkpoint format as /adapter/adapter_config.json and /adapter/adapter_model.safetensors (as we only support LoRA post training now, we begin from adapter only checkpoint) ## test we kick off a post training job and configured checkpoint format as 'huggingface'. Output files ![Screenshot 2025-02-24 at 11 54 33 PM](https://github.com/user-attachments/assets/fb45a5d7-f288-4d30-82f8-b7a8da2859be) we did a proof of concept with ollama to see if ollama can inference our finetuned checkpoint 1. create Modelfile like <img width="799" alt="Screenshot 2025-01-22 at 5 04 18 PM" src="https://github.com/user-attachments/assets/7fca9ac3-a294-44f8-aab1-83852c600609" /> 2. create a customized model with `ollama create llama_3_2_finetuned` and run inference successfully ![Screenshot 2025-02-24 at 11 55 17 PM](https://github.com/user-attachments/assets/1abe7c52-c6a7-491a-b07c-b7a8e3fd1ddd) This is just a proof of concept with ollama cmd line. As next step, we'd like to wrap loading / inference customized model logic in the inference provider implementation.	2025-02-25 23:29:08 -08:00
Botao Chen	aeb76390fc	[1/n] torchtune <> llama-stack integration skeleton (#540 ) ### Context This is the 1st of series PRs that integrate torchtune with llama-stack as meta reference post-training implementation. For MVP, we will focus on single device LoRA SFT. Though this PR is still WIP, we want to get early feedback on the high level design of this skeleton while still working on several details ### Scope To limit the scope of this PR, we focus on the skeleton of the implementation. What are included? - refine the post-training SFT apis - skeleton of supervised_fine_tune implementation. We verified that we can call the supervised_fine_tune API successfully from llama stack client SDK (client side PR: https://github.com/meta-llama/llama-stack-client-python/pull/51) - a very basic single device LoRA training recipe based on torchtune core components - parity check with torchtune library and post training api unit test What are not includes? - implementation of other job management, get training artifacts apis (separate PR) - refactor the meta reference inference logic to support eval on finetuned model (separate PR) - several necessary functionality in the training recipe such as logging, validation etc (separate PR) - interop with telemetry for tracing and metrics logging, currently temporarily log to local disk (separate PR) ### Testing e2e test Although we haven't added detailed testing and numerical parity check with torchtune yet, we did a simple E2E test from client to server 1. setup server with` llama stack build --template experimental-post-training --image-type conda` and `llama stack run experimental-post-training ` 2. On client, run `llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training supervised_fine_tune` 3. Training finishes successfully. On server side, get the finetune checkpoints under output dir. On client side, get the job uuid server <img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM" src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d"> client <img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM" src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838"> parity check torchtune dataloader output and llama-stack post training dataloader output are same <img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM" src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee"> torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca dataset with llama3.2 3B instruct model are numerical match <img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM" src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99"> <img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM" src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443"> unit test ![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()	2024-12-13 11:05:35 -08:00

4 commits