Update iOS inference instructions for new quantization

2025-12-03 09:53:45 +00:00 · 2024-10-24 14:47:27 -04:00 · 2024-10-24 14:47:27 -04:00 · 8eceebec98
commit 8eceebec98
parent 8aa8847b4a
1 changed files with 12 additions and 1 deletions
--- a/llama_stack/providers/impls/ios/inference/README.md
+++ b/llama_stack/providers/impls/ios/inference/README.md
@ -56,9 +56,20 @@ We're working on making LocalInference easier to set up. For now, you'll need t

 ## Preparing a model

-1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#step-2-prepare-model)
+1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model)
 2. Bundle the `.pte` and `tokenizer.model` file into your app

+We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
+
+
+| Llama 3.2 1B | Tokens / Second (total) |  | Time-to-First-Token (sec) |  |
+| :---- | :---- | :---- | :---- | :---- |
+|  | Haiku | Paragraph | Haiku | Paragraph |
+| BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
+| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
+| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
+
+
 ## Using LocalInference

 1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service: