* move docs -> source * Add files via upload * mv image * Add files via upload * colocate iOS setup doc * delete image * Add files via upload * fix * delete image * Add files via upload * Update developer_cookbook.md * toctree * wip subfolder * docs update * subfolder * updates * name * updates * index * updates * refactor structure * depth * docs * content * docs * getting started * distributions * fireworks * fireworks * update * theme * theme * theme * pdj theme * pytorch theme * css * theme * agents example * format * index * headers * copy button * test tabs * test tabs * fix * tabs * tab * tabs * sphinx_design * quick start commands * size * width * css * css * download models * asthetic fix * tab format * update * css * width * css * docs * tab based * tab * tabs * docs * style * image * css * color * typo * update docs * missing links * list templates * links * links update * troubleshooting * fix * distributions * docs * fix table * kill llamastack-local-gpu/cpu * Update index.md * Update index.md * mv ios_setup.md * Update ios_setup.md * Add remote_or_local.gif * Update ios_setup.md * release notes * typos * Add ios_setup to index * nav bar * hide torctree * ios image * links update * rename * rename * docs * rename * links * distributions * distributions * distributions * distributions * remove release * remote --------- Co-authored-by: dltn <6599399+dltn@users.noreply.github.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
6.3 KiB
iOS SDK
We offer both remote and on-device use of Llama Stack in Swift via two components:
:alt: Seamlessly switching between local, on-device inference and remote hosted inference
:width: 412px
:align: center
Remote Only
If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.
-
Add
https://github.com/meta-llama/llama-stack-client-swift/
as a Package Dependency in Xcode -
Add
LlamaStackClient
as a framework to your app target -
Call an API:
import LlamaStackClient
let agents = RemoteAgents(url: URL(string: "http://localhost:5000")!)
let request = Components.Schemas.CreateAgentTurnRequest(
agent_id: agentId,
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Hello Llama!"),
role: .user
))
],
session_id: self.agenticSystemSessionId,
stream: true
)
for try await chunk in try await agents.createTurn(request: request) {
let payload = chunk.event.payload
// ...
Check out iOSCalendarAssistant for a complete app demo.
LocalInference
LocalInference provides a local inference implementation powered by executorch.
Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using executorch, PyTorch’s on-device inference library.
The APIs work the same as remote – the only difference is you'll instead use the LocalAgents
/ LocalInference
classes and pass in a DispatchQueue
:
private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
let inference = LocalInference(queue: runnerQueue)
let agents = LocalAgents(inference: self.inference)
Check out iOSCalendarAssistantWithLocalInf for a complete app demo.
Installation
We're working on making LocalInference easier to set up. For now, you'll need to import it via .xcframework
:
-
Clone the executorch submodule in this repo and its dependencies:
git submodule update --init --recursive
-
Install Cmake for the executorch build`
-
Drag
LocalInference.xcodeproj
into your project -
Add
LocalInference
as a framework in your app target -
Add a package dependency on https://github.com/pytorch/executorch (branch latest)
-
Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target:
- backend_coreml
- backend_mps
- backend_xnnpack
- kernels_custom
- kernels_optimized
- kernels_portable
- kernels_quantized
-
In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add:
-force_load $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
-
In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add:
-force_load $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a -force_load $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
Preparing a model
- Prepare a
.pte
file following the executorch docs - Bundle the
.pte
andtokenizer.model
file into your app
We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
Llama 3.2 1B | Tokens / Second (total) | Time-to-First-Token (sec) | ||
---|---|---|---|---|
Haiku | Paragraph | Haiku | Paragraph | |
BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
Using LocalInference
- Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
init () {
runnerQueue = DispatchQueue(label: "org.meta.llamastack")
inferenceService = LocalInferenceService(queue: runnerQueue)
agentsService = LocalAgentsService(inference: inferenceService)
}
- Before making any inference calls, load your model from your bundle:
let mainBundle = Bundle.main
inferenceService.loadModel(
modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
completion: {_ in } // use to handle load failures
)
- Make inference calls (or agents calls) as you normally would with LlamaStack:
for await chunk in try await agentsService.initAndCreateTurn(
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
role: .user))
]
) {
Troubleshooting
If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
(Opt+Click) Product > Clean Build Folder Immediately
rm -rf \
~/Library/org.swift.swiftpm \
~/Library/Caches/org.swift.swiftpm \
~/Library/Caches/com.apple.dt.Xcode \
~/Library/Developer/Xcode/DerivedData