4.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	LocalInference
LocalInference provides a local inference implementation powered by executorch.
Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using executorch, PyTorch’s on-device inference library.
Installation
We're working on making LocalInference easier to set up. For now, you'll need to import it via .xcframework:
- 
Clone the executorch submodule in this repo and its dependencies: git submodule update --init --recursive
- 
Install Cmake for the executorch build` 
- 
Drag LocalInference.xcodeprojinto your project
- 
Add LocalInferenceas a framework in your app target
- 
Add a package dependency on https://github.com/pytorch/executorch (branch latest) 
- 
Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target: - backend_coreml
- backend_mps
- backend_xnnpack
- kernels_custom
- kernels_optimized
- kernels_portable
- kernels_quantized
 
- 
In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add: -force_load $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
- 
In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add: -force_load $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
Preparing a model
- Prepare a .ptefile following the executorch docs
- Bundle the .pteandtokenizer.modelfile into your app
We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
| Llama 3.2 1B | Tokens / Second (total) | Time-to-First-Token (sec) | ||
|---|---|---|---|---|
| Haiku | Paragraph | Haiku | Paragraph | |
| BF16 | 2.2 | 2.5 | 2.3 | 1.9 | 
| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 | 
| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 | 
Using LocalInference
- Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
  init () {
    runnerQueue = DispatchQueue(label: "org.meta.llamastack")
    inferenceService = LocalInferenceService(queue: runnerQueue)
    agentsService = LocalAgentsService(inference: inferenceService)
  }
- Before making any inference calls, load your model from your bundle:
let mainBundle = Bundle.main
inferenceService.loadModel(
    modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
    tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
    completion: {_ in } // use to handle load failures
)
- Make inference calls (or agents calls) as you normally would with LlamaStack:
for await chunk in try await agentsService.initAndCreateTurn(
    messages: [
    .UserMessage(Components.Schemas.UserMessage(
        content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
        role: .user))
    ]
) {
Troubleshooting
If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
(Opt+Click) Product > Clean Build Folder Immediately
rm -rf \
  ~/Library/org.swift.swiftpm \
  ~/Library/Caches/org.swift.swiftpm \
  ~/Library/Caches/com.apple.dt.Xcode \
  ~/Library/Developer/Xcode/DerivedData