Restructure docs (#494)

Rendered docs at: https://llama-stack.readthedocs.io/en/doc-simplify/
2024-11-20 15:54:47 -08:00 · 2024-11-20 15:54:47 -08:00 · b3f9e8b2f2
commit b3f9e8b2f2
parent 068ac00a3b
20 changed files with 586 additions and 200 deletions
--- a/docs/source/getting_started/distributions/ondevice_distro/ios_sdk.md
+++ b/docs/source/getting_started/distributions/ondevice_distro/ios_sdk.md
@ -1,176 +0,0 @@
-# iOS SDK
-
-We offer both remote and on-device use of Llama Stack in Swift via two components:
-
-1. [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/)
-2. [LocalInferenceImpl](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/ios/inference)
-
-```{image} ../../../../_static/remote_or_local.gif
-:alt: Seamlessly switching between local, on-device inference and remote hosted inference
-:width: 412px
-:align: center
-```
-
-## Remote Only
-
-If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.
-
-1. Add `https://github.com/meta-llama/llama-stack-client-swift/` as a Package Dependency in Xcode
-
-2. Add `LlamaStackClient` as a framework to your app target
-
-3. Call an API:
-
-```swift
-import LlamaStackClient
-
-let agents = RemoteAgents(url: URL(string: "http://localhost:5000")!)
-let request = Components.Schemas.CreateAgentTurnRequest(
-        agent_id: agentId,
-        messages: [
-          .UserMessage(Components.Schemas.UserMessage(
-            content: .case1("Hello Llama!"),
-            role: .user
-          ))
-        ],
-        session_id: self.agenticSystemSessionId,
-        stream: true
-      )
-
-      for try await chunk in try await agents.createTurn(request: request) {
-        let payload = chunk.event.payload
-      // ...
-```
-
-Check out [iOSCalendarAssistant](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/ios_calendar_assistant) for a complete app demo.
-
-## LocalInference
-
-LocalInference provides a local inference implementation powered by [executorch](https://github.com/pytorch/executorch/).
-
-Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using [executorch](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo), PyTorch’s on-device inference library.
-
-The APIs *work the same as remote* – the only difference is you'll instead use the `LocalAgents` / `LocalInference` classes and pass in a `DispatchQueue`:
-
-```swift
-private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
-let inference = LocalInference(queue: runnerQueue)
-let agents = LocalAgents(inference: self.inference)
-```
-
-Check out [iOSCalendarAssistantWithLocalInf](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/ios_calendar_assistant) for a complete app demo.
-
-### Installation
-
-We're working on making LocalInference easier to set up. For now, you'll need to import it via `.xcframework`:
-
-1. Clone the executorch submodule in this repo and its dependencies: `git submodule update --init --recursive`
-1. Install [Cmake](https://cmake.org/) for the executorch build`
-1. Drag `LocalInference.xcodeproj` into your project
-1. Add `LocalInference` as a framework in your app target
-1. Add a package dependency on https://github.com/pytorch/executorch (branch latest)
-1. Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target:
-    - backend_coreml
-    - backend_mps
-    - backend_xnnpack
-    - kernels_custom
-    - kernels_optimized
-    - kernels_portable
-    - kernels_quantized
-1. In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add:
-    ```
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
-    ```
-
-1. In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add:
-
-    ```
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
-    -force_load
-    $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
-    ```
-
-### Preparing a model
-
-1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model)
-2. Bundle the `.pte` and `tokenizer.model` file into your app
-
-We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
-
-
-| Llama 3.2 1B | Tokens / Second (total) |  | Time-to-First-Token (sec) |  |
-| :---- | :---- | :---- | :---- | :---- |
-|  | Haiku | Paragraph | Haiku | Paragraph |
-| BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
-| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
-| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
-
-
-### Using LocalInference
-
-1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
-
-```swift
-  init () {
-    runnerQueue = DispatchQueue(label: "org.meta.llamastack")
-    inferenceService = LocalInferenceService(queue: runnerQueue)
-    agentsService = LocalAgentsService(inference: inferenceService)
-  }
-```
-
-2. Before making any inference calls, load your model from your bundle:
-
-```swift
-let mainBundle = Bundle.main
-inferenceService.loadModel(
-    modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
-    tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
-    completion: {_ in } // use to handle load failures
-)
-```
-
-3. Make inference calls (or agents calls) as you normally would with LlamaStack:
-
-```
-for await chunk in try await agentsService.initAndCreateTurn(
-    messages: [
-    .UserMessage(Components.Schemas.UserMessage(
-        content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
-        role: .user))
-    ]
-) {
-```
-
-### Troubleshooting
-
-If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
-
-(Opt+Click) Product > Clean Build Folder Immediately
-
-```
-rm -rf \
-  ~/Library/org.swift.swiftpm \
-  ~/Library/Caches/org.swift.swiftpm \
-  ~/Library/Caches/com.apple.dt.Xcode \
-  ~/Library/Developer/Xcode/DerivedData
-```