forked from phoenix-oss/llama-stack-mirror
[docs] update documentations (#356)
* move docs -> source * Add files via upload * mv image * Add files via upload * colocate iOS setup doc * delete image * Add files via upload * fix * delete image * Add files via upload * Update developer_cookbook.md * toctree * wip subfolder * docs update * subfolder * updates * name * updates * index * updates * refactor structure * depth * docs * content * docs * getting started * distributions * fireworks * fireworks * update * theme * theme * theme * pdj theme * pytorch theme * css * theme * agents example * format * index * headers * copy button * test tabs * test tabs * fix * tabs * tab * tabs * sphinx_design * quick start commands * size * width * css * css * download models * asthetic fix * tab format * update * css * width * css * docs * tab based * tab * tabs * docs * style * image * css * color * typo * update docs * missing links * list templates * links * links update * troubleshooting * fix * distributions * docs * fix table * kill llamastack-local-gpu/cpu * Update index.md * Update index.md * mv ios_setup.md * Update ios_setup.md * Add remote_or_local.gif * Update ios_setup.md * release notes * typos * Add ios_setup to index * nav bar * hide torctree * ios image * links update * rename * rename * docs * rename * links * distributions * distributions * distributions * distributions * remove release * remote --------- Co-authored-by: dltn <6599399+dltn@users.noreply.github.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
This commit is contained in:
parent
ac93dd89cf
commit
c810a4184d
37 changed files with 1777 additions and 2154 deletions
|
@ -0,0 +1,9 @@
|
|||
# On-Device Distribution
|
||||
|
||||
On-device distributions are Llama Stack distributions that run locally on your iOS / Android device.
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
ios_sdk
|
||||
```
|
|
@ -0,0 +1,176 @@
|
|||
# iOS SDK
|
||||
|
||||
We offer both remote and on-device use of Llama Stack in Swift via two components:
|
||||
|
||||
1. [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/)
|
||||
2. [LocalInferenceImpl](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/impls/ios/inference)
|
||||
|
||||
```{image} ../../../../_static/remote_or_local.gif
|
||||
:alt: Seamlessly switching between local, on-device inference and remote hosted inference
|
||||
:width: 412px
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Remote Only
|
||||
|
||||
If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.
|
||||
|
||||
1. Add `https://github.com/meta-llama/llama-stack-client-swift/` as a Package Dependency in Xcode
|
||||
|
||||
2. Add `LlamaStackClient` as a framework to your app target
|
||||
|
||||
3. Call an API:
|
||||
|
||||
```swift
|
||||
import LlamaStackClient
|
||||
|
||||
let agents = RemoteAgents(url: URL(string: "http://localhost:5000")!)
|
||||
let request = Components.Schemas.CreateAgentTurnRequest(
|
||||
agent_id: agentId,
|
||||
messages: [
|
||||
.UserMessage(Components.Schemas.UserMessage(
|
||||
content: .case1("Hello Llama!"),
|
||||
role: .user
|
||||
))
|
||||
],
|
||||
session_id: self.agenticSystemSessionId,
|
||||
stream: true
|
||||
)
|
||||
|
||||
for try await chunk in try await agents.createTurn(request: request) {
|
||||
let payload = chunk.event.payload
|
||||
// ...
|
||||
```
|
||||
|
||||
Check out [iOSCalendarAssistant](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/ios_calendar_assistant) for a complete app demo.
|
||||
|
||||
## LocalInference
|
||||
|
||||
LocalInference provides a local inference implementation powered by [executorch](https://github.com/pytorch/executorch/).
|
||||
|
||||
Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using [executorch](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo), PyTorch’s on-device inference library.
|
||||
|
||||
The APIs *work the same as remote* – the only difference is you'll instead use the `LocalAgents` / `LocalInference` classes and pass in a `DispatchQueue`:
|
||||
|
||||
```swift
|
||||
private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
|
||||
let inference = LocalInference(queue: runnerQueue)
|
||||
let agents = LocalAgents(inference: self.inference)
|
||||
```
|
||||
|
||||
Check out [iOSCalendarAssistantWithLocalInf](https://github.com/meta-llama/llama-stack-apps/tree/main/examples/ios_calendar_assistant) for a complete app demo.
|
||||
|
||||
### Installation
|
||||
|
||||
We're working on making LocalInference easier to set up. For now, you'll need to import it via `.xcframework`:
|
||||
|
||||
1. Clone the executorch submodule in this repo and its dependencies: `git submodule update --init --recursive`
|
||||
1. Install [Cmake](https://cmake.org/) for the executorch build`
|
||||
1. Drag `LocalInference.xcodeproj` into your project
|
||||
1. Add `LocalInference` as a framework in your app target
|
||||
1. Add a package dependency on https://github.com/pytorch/executorch (branch latest)
|
||||
1. Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target:
|
||||
- backend_coreml
|
||||
- backend_mps
|
||||
- backend_xnnpack
|
||||
- kernels_custom
|
||||
- kernels_optimized
|
||||
- kernels_portable
|
||||
- kernels_quantized
|
||||
1. In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add:
|
||||
```
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
|
||||
```
|
||||
|
||||
1. In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add:
|
||||
|
||||
```
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
|
||||
-force_load
|
||||
$(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
|
||||
```
|
||||
|
||||
### Preparing a model
|
||||
|
||||
1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model)
|
||||
2. Bundle the `.pte` and `tokenizer.model` file into your app
|
||||
|
||||
We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
|
||||
|
||||
|
||||
| Llama 3.2 1B | Tokens / Second (total) | | Time-to-First-Token (sec) | |
|
||||
| :---- | :---- | :---- | :---- | :---- |
|
||||
| | Haiku | Paragraph | Haiku | Paragraph |
|
||||
| BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
|
||||
| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
|
||||
| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
|
||||
|
||||
|
||||
### Using LocalInference
|
||||
|
||||
1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
|
||||
|
||||
```swift
|
||||
init () {
|
||||
runnerQueue = DispatchQueue(label: "org.meta.llamastack")
|
||||
inferenceService = LocalInferenceService(queue: runnerQueue)
|
||||
agentsService = LocalAgentsService(inference: inferenceService)
|
||||
}
|
||||
```
|
||||
|
||||
2. Before making any inference calls, load your model from your bundle:
|
||||
|
||||
```swift
|
||||
let mainBundle = Bundle.main
|
||||
inferenceService.loadModel(
|
||||
modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
|
||||
tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
|
||||
completion: {_ in } // use to handle load failures
|
||||
)
|
||||
```
|
||||
|
||||
3. Make inference calls (or agents calls) as you normally would with LlamaStack:
|
||||
|
||||
```
|
||||
for await chunk in try await agentsService.initAndCreateTurn(
|
||||
messages: [
|
||||
.UserMessage(Components.Schemas.UserMessage(
|
||||
content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
|
||||
role: .user))
|
||||
]
|
||||
) {
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
|
||||
|
||||
(Opt+Click) Product > Clean Build Folder Immediately
|
||||
|
||||
```
|
||||
rm -rf \
|
||||
~/Library/org.swift.swiftpm \
|
||||
~/Library/Caches/org.swift.swiftpm \
|
||||
~/Library/Caches/com.apple.dt.Xcode \
|
||||
~/Library/Developer/Xcode/DerivedData
|
||||
```
|
|
@ -0,0 +1,64 @@
|
|||
# Fireworks Distribution
|
||||
|
||||
The `llamastack/distribution-fireworks` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::fireworks | meta-reference | meta-reference | meta-reference | meta-reference |
|
||||
|
||||
### Step 0. Prerequisite
|
||||
- Make sure you have access to a fireworks API Key. You can get one by visiting [fireworks.ai](https://fireworks.ai/)
|
||||
|
||||
### Step 1. Start the Distribution (Single Node CPU)
|
||||
|
||||
#### (Option 1) Start Distribution Via Docker
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint at Fireworks with API Key.
|
||||
|
||||
```
|
||||
$ cd distributions/fireworks && docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: fireworks
|
||||
provider_type: remote::fireworks
|
||||
config:
|
||||
url: https://api.fireworks.ai/inference
|
||||
api_key: <optional api key>
|
||||
```
|
||||
|
||||
#### (Option 2) Start Distribution Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template fireworks --image-type conda
|
||||
# -- modify run.yaml to a valid Fireworks server endpoint
|
||||
llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
|
||||
### (Optional) Model Serving
|
||||
|
||||
Use `llama-stack-client models list` to check the available models served by Fireworks.
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+==============================+==============================+===============+============+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
# Remote-Hosted Distribution
|
||||
|
||||
Remote Hosted distributions are distributions connecting to remote hosted services through Llama Stack server. Inference is done through remote providers. These are useful if you have an API key for a remote inference provider like Fireworks, Together, etc.
|
||||
|
||||
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
|
||||
| Together | [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/together.html) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
||||
| Fireworks | [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/fireworks.html) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
fireworks
|
||||
together
|
||||
```
|
|
@ -0,0 +1,62 @@
|
|||
# Together Distribution
|
||||
|
||||
### Connect to a Llama Stack Together Endpoint
|
||||
- You may connect to a hosted endpoint `https://llama-stack.together.ai`, serving a Llama Stack distribution
|
||||
|
||||
The `llamastack/distribution-together` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::together | meta-reference | meta-reference, remote::weaviate | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Docker: Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint at Together with API Key.
|
||||
|
||||
```
|
||||
$ cd distributions/together && docker compose up
|
||||
```
|
||||
|
||||
Make sure in your `run.yaml` file, your inference provider is pointing to the correct Together URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: together
|
||||
provider_type: remote::together
|
||||
config:
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: <optional api key>
|
||||
```
|
||||
|
||||
### Conda llama stack run (Single Node CPU)
|
||||
|
||||
```bash
|
||||
llama stack build --template together --image-type conda
|
||||
# -- modify run.yaml to a valid Together server endpoint
|
||||
llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
### (Optional) Update Model Serving Configuration
|
||||
|
||||
Use `llama-stack-client models list` to check the available models served by together.
|
||||
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+==============================+==============================+===============+============+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
```
|
|
@ -0,0 +1,68 @@
|
|||
# Dell-TGI Distribution
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
|
||||
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
|
||||
```
|
||||
$ cd distributions/dell-tgi/
|
||||
$ ls
|
||||
compose.yaml README.md run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
|
||||
```
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
|
||||
|
||||
#### Start Dell-TGI server locally
|
||||
```
|
||||
docker run -it --shm-size 1g -p 80:80 --gpus 4 \
|
||||
-e NUM_SHARD=4
|
||||
-e MAX_BATCH_PREFILL_TOKENS=32768 \
|
||||
-e MAX_INPUT_TOKENS=8000 \
|
||||
-e MAX_TOTAL_TOKENS=8192 \
|
||||
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
|
||||
```
|
||||
|
||||
|
||||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
|
@ -0,0 +1,20 @@
|
|||
# Self-Hosted Distribution
|
||||
|
||||
We offer deployable distributions where you can host your own Llama Stack server using local inference.
|
||||
|
||||
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
|
||||
| Meta Reference | [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html) | meta-reference | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| Meta Reference Quantized | [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.html) | meta-reference-quantized | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| Ollama | [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html) | remote::ollama | meta-reference | remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| TGI | [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html) | remote::tgi | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
meta-reference-gpu
|
||||
meta-reference-quantized-gpu
|
||||
ollama
|
||||
tgi
|
||||
dell-tgi
|
||||
```
|
|
@ -0,0 +1,71 @@
|
|||
# Meta Reference Distribution
|
||||
|
||||
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Step 0. Prerequisite - Downloading Models
|
||||
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
|
||||
|
||||
```
|
||||
$ ls ~/.llama/checkpoints
|
||||
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
|
||||
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
|
||||
```
|
||||
|
||||
### Step 1. Start the Distribution
|
||||
|
||||
#### (Option 1) Start with Docker
|
||||
```
|
||||
$ cd distributions/meta-reference-gpu && docker compose up
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
|
||||
|
||||
```
|
||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
#### (Option 2) Start with Conda
|
||||
|
||||
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
|
||||
|
||||
2. Build the `meta-reference-gpu` distribution
|
||||
|
||||
```
|
||||
$ llama stack build --template meta-reference-gpu --image-type conda
|
||||
```
|
||||
|
||||
3. Start running distribution
|
||||
```
|
||||
$ cd distributions/meta-reference-gpu
|
||||
$ llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
### (Optional) Serving a new model
|
||||
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
|
||||
```
|
||||
inference:
|
||||
- provider_id: meta0
|
||||
provider_type: meta-reference
|
||||
config:
|
||||
model: Llama3.2-11B-Vision-Instruct
|
||||
quantization: null
|
||||
torch_seed: null
|
||||
max_seq_len: 4096
|
||||
max_batch_size: 1
|
||||
```
|
||||
|
||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
|
@ -0,0 +1,34 @@
|
|||
# Meta Reference Quantized Distribution
|
||||
|
||||
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |------------------------ |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | meta-reference-quantized | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
To download and start running a pre-built docker container, you may use the following commands:
|
||||
|
||||
```
|
||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
--gpus=all \
|
||||
distribution-meta-reference-quantized-gpu \
|
||||
--yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
### Alternative (Build and start distribution locally via conda)
|
||||
|
||||
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up the distribution.
|
|
@ -0,0 +1,129 @@
|
|||
# Ollama Distribution
|
||||
|
||||
The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference |
|
||||
|
||||
|
||||
### Docker: Start a Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a Ollama server with access to your GPU.
|
||||
|
||||
```
|
||||
$ cd distributions/ollama/gpu
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
You will see outputs similar to following ---
|
||||
```
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
[llamastack] | Resolved 12 providers
|
||||
[llamastack] | inner-inference => ollama0
|
||||
[llamastack] | models => __routing_table__
|
||||
[llamastack] | inference => __autorouted__
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Docker: Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
|
||||
|
||||
```
|
||||
$ cd distributions/ollama/cpu
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
### Conda: ollama run + llama stack run
|
||||
|
||||
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
|
||||
|
||||
#### Start Ollama server.
|
||||
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
|
||||
```
|
||||
|
||||
**Via CLI**
|
||||
```
|
||||
ollama run <model_id>
|
||||
```
|
||||
|
||||
#### Start Llama Stack server pointing to Ollama server
|
||||
|
||||
**Via Conda**
|
||||
|
||||
```
|
||||
llama stack build --template ollama --image-type conda
|
||||
llama stack run ./gpu/run.yaml
|
||||
```
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
|
||||
```
|
||||
|
||||
Make sure in your `run.yaml` file, your inference provider is pointing to the correct Ollama endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: ollama0
|
||||
provider_type: remote::ollama
|
||||
config:
|
||||
url: http://127.0.0.1:14343
|
||||
```
|
||||
|
||||
### (Optional) Update Model Serving Configuration
|
||||
|
||||
#### Downloading model via Ollama
|
||||
|
||||
You can use ollama for managing model downloads.
|
||||
|
||||
```
|
||||
ollama pull llama3.1:8b-instruct-fp16
|
||||
ollama pull llama3.1:70b-instruct-fp16
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py) for the supported Ollama models.
|
||||
|
||||
|
||||
To serve a new model with `ollama`
|
||||
```
|
||||
ollama run <model_name>
|
||||
```
|
||||
|
||||
To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
|
||||
```
|
||||
$ ollama ps
|
||||
|
||||
NAME ID SIZE PROCESSOR UNTIL
|
||||
llama3.1:8b-instruct-fp16 4aacac419454 17 GB 100% GPU 4 minutes from now
|
||||
```
|
||||
|
||||
To verify that the model served by ollama is correctly connected to Llama Stack server
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+======================+======================+===============+===============================================+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0 | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
|
||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||
```
|
|
@ -0,0 +1,112 @@
|
|||
# TGI Distribution
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Docker: Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
|
||||
|
||||
```
|
||||
$ cd distributions/tgi/gpu && docker compose up
|
||||
```
|
||||
|
||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
|
||||
```
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Docker: Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint compatible with TGI server.
|
||||
|
||||
```
|
||||
$ cd distributions/tgi/cpu && docker compose up
|
||||
```
|
||||
|
||||
Replace <ENTER_YOUR_TGI_HOSTED_ENDPOINT> in `run.yaml` file with your TGI endpoint.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
|
||||
```
|
||||
|
||||
### Conda: TGI server + llama stack run
|
||||
|
||||
If you wish to separately spin up a TGI server, and connect with Llama Stack, you may use the following commands.
|
||||
|
||||
#### Start TGI server locally
|
||||
- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint.
|
||||
|
||||
```
|
||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.1-8B-Instruct --port 5009
|
||||
```
|
||||
|
||||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
**Via Conda**
|
||||
|
||||
```bash
|
||||
llama stack build --template tgi --image-type conda
|
||||
# -- start a TGI server endpoint
|
||||
llama stack run ./gpu/run.yaml
|
||||
```
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
||||
|
||||
|
||||
### (Optional) Update Model Serving Configuration
|
||||
To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
|
||||
|
||||
This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
|
||||
|
||||
```
|
||||
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
|
||||
```
|
||||
|
||||
or by changing the docker run command's `--model-id` flag
|
||||
```
|
||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
|
||||
```
|
||||
|
||||
In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue