forked from phoenix-oss/llama-stack-mirror

* move docs -> source

* Add files via upload

* mv image

* Add files via upload

* colocate iOS setup doc

* delete image

* Add files via upload

* fix

* delete image

* Add files via upload

* Update developer_cookbook.md

* toctree

* wip subfolder

* docs update

* subfolder

* updates

* name

* updates

* index

* updates

* refactor structure

* depth

* docs

* content

* docs

* getting started

* distributions

* fireworks

* fireworks

* update

* theme

* theme

* theme

* pdj theme

* pytorch theme

* css

* theme

* agents example

* format

* index

* headers

* copy button

* test tabs

* test tabs

* fix

* tabs

* tab

* tabs

* sphinx_design

* quick start commands

* size

* width

* css

* css

* download models

* asthetic fix

* tab format

* update

* css

* width

* css

* docs

* tab based

* tab

* tabs

* docs

* style

* image

* css

* color

* typo

* update docs

* missing links

* list templates

* links

* links update

* troubleshooting

* fix

* distributions

* docs

* fix table

* kill llamastack-local-gpu/cpu

* Update index.md

* Update index.md

* mv ios_setup.md

* Update ios_setup.md

* Add remote_or_local.gif

* Update ios_setup.md

* release notes

* typos

* Add ios_setup to index

* nav bar

* hide torctree

* ios image

* links update

* rename

* rename

* docs

* rename

* links

* distributions

* distributions

* distributions

* distributions

* remove release

* remote

---------

Co-authored-by: dltn <6599399+dltn@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>

2024-11-04 16:52:38 -08:00

6.3 KiB

Raw Blame History

iOS SDK

We offer both remote and on-device use of Llama Stack in Swift via two components:

:alt: Seamlessly switching between local, on-device inference and remote hosted inference
:width: 412px
:align: center

Remote Only

If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.

Add https://github.com/meta-llama/llama-stack-client-swift/ as a Package Dependency in Xcode
Add LlamaStackClient as a framework to your app target
Call an API:

import LlamaStackClient

let agents = RemoteAgents(url: URL(string: "http://localhost:5000")!)
let request = Components.Schemas.CreateAgentTurnRequest(
        agent_id: agentId,
        messages: [
          .UserMessage(Components.Schemas.UserMessage(
            content: .case1("Hello Llama!"),
            role: .user
          ))
        ],
        session_id: self.agenticSystemSessionId,
        stream: true
      )

      for try await chunk in try await agents.createTurn(request: request) {
        let payload = chunk.event.payload
      // ...

Check out iOSCalendarAssistant for a complete app demo.

LocalInference

LocalInference provides a local inference implementation powered by executorch.

Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using executorch, PyTorch’s on-device inference library.

The APIs work the same as remote – the only difference is you'll instead use the LocalAgents / LocalInference classes and pass in a DispatchQueue:

private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
let inference = LocalInference(queue: runnerQueue)
let agents = LocalAgents(inference: self.inference)

Check out iOSCalendarAssistantWithLocalInf for a complete app demo.

Installation

We're working on making LocalInference easier to set up. For now, you'll need to import it via .xcframework:

Clone the executorch submodule in this repo and its dependencies: git submodule update --init --recursive
Install Cmake for the executorch build`
Drag LocalInference.xcodeproj into your project
Add LocalInference as a framework in your app target
Add a package dependency on https://github.com/pytorch/executorch (branch latest)
Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target:
- backend_coreml
- backend_mps
- backend_xnnpack
- kernels_custom
- kernels_optimized
- kernels_portable
- kernels_quantized

In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add:

-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a

In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add:

-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
-force_load
$(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a

Preparing a model

Prepare a .pte file following the executorch docs
Bundle the .pte and tokenizer.model file into your app

We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):

Llama 3.2 1B	Tokens / Second (total)		Time-to-First-Token (sec)
	Haiku	Paragraph	Haiku	Paragraph
BF16	2.2	2.5	2.3	1.9
QAT+LoRA	7.1	3.3	0.37	0.24
SpinQuant	10.1	5.2	0.2	0.2

Using LocalInference

Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:

  init () {
    runnerQueue = DispatchQueue(label: "org.meta.llamastack")
    inferenceService = LocalInferenceService(queue: runnerQueue)
    agentsService = LocalAgentsService(inference: inferenceService)
  }

Before making any inference calls, load your model from your bundle:

let mainBundle = Bundle.main
inferenceService.loadModel(
    modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
    tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
    completion: {_ in } // use to handle load failures
)

Make inference calls (or agents calls) as you normally would with LlamaStack:

for await chunk in try await agentsService.initAndCreateTurn(
    messages: [
    .UserMessage(Components.Schemas.UserMessage(
        content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
        role: .user))
    ]
) {

Troubleshooting

If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:

(Opt+Click) Product > Clean Build Folder Immediately

rm -rf \
  ~/Library/org.swift.swiftpm \
  ~/Library/Caches/org.swift.swiftpm \
  ~/Library/Caches/com.apple.dt.Xcode \
  ~/Library/Developer/Xcode/DerivedData

6.3 KiB Raw Blame History Unescape Escape