diff --git a/llama_stack/providers/impls/ios/inference/README.md b/llama_stack/providers/impls/ios/inference/README.md index d6ce42382..160980759 100644 --- a/llama_stack/providers/impls/ios/inference/README.md +++ b/llama_stack/providers/impls/ios/inference/README.md @@ -56,9 +56,20 @@ We're working on making LocalInference easier to set up. For now, you'll need t ## Preparing a model -1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#step-2-prepare-model) +1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model) 2. Bundle the `.pte` and `tokenizer.model` file into your app +We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro): + + +| Llama 3.2 1B | Tokens / Second (total) | | Time-to-First-Token (sec) | | +| :---- | :---- | :---- | :---- | :---- | +| | Haiku | Paragraph | Haiku | Paragraph | +| BF16 | 2.2 | 2.5 | 2.3 | 1.9 | +| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 | +| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 | + + ## Using LocalInference 1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service: