move distribution folders

This commit is contained in:
Xi Yan 2024-10-18 17:05:41 -07:00
parent fd90d2ae97
commit b4aca0aeb6
13 changed files with 274 additions and 57 deletions

View file

@ -1,16 +1,21 @@
# Ollama GPU Distribution
# Ollama Distribution
The scripts in these folders help you spin up a Llama Stack distribution with Ollama Inference provider.
The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference |
### Start a Distribution (Single Node GPU)
> [!NOTE]
> This assumes you have access to GPU to start a Ollama server with access to your GPU. Please see Ollama CPU Distribution if you wish to run Ollama on CPU.
### Getting Started
> This assumes you have access to GPU to start a Ollama server with access to your GPU.
```
$ cd llama_stack/distribution/docker/ollama
$ cd llama-stack/distribution/ollama/gpu
$ ls
compose.yaml ollama-run.yaml
compose.yaml run.yaml
$ docker compose up
```
@ -22,10 +27,10 @@ INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
[llamastack-local-cpu] | Resolved 12 providers
[llamastack-local-cpu] | inner-inference => ollama0
[llamastack-local-cpu] | models => __routing_table__
[llamastack-local-cpu] | inference => __autorouted__
[llamastack] | Resolved 12 providers
[llamastack] | inner-inference => ollama0
[llamastack] | models => __routing_table__
[llamastack] | inference => __autorouted__
```
To kill the server
@ -33,11 +38,23 @@ To kill the server
docker compose down
```
### (Alternative) Docker Run
### Start the Distribution (Single Node CPU)
> [!NOTE]
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
```
$ cd llama-stack/distribution/ollama/cpu
$ ls
compose.yaml run.yaml
$ docker compose up
```
### (Alternative) ollama run + llama stack Run
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
##### Start Ollama server.
#### Start Ollama server.
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
**Via Docker**
@ -50,9 +67,9 @@ docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
ollama run <model_id>
```
#### Start Llama Stack server pointing to Ollama server
##### Start Llama Stack server pointing to Ollama server
**Via Docker**
```
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./ollama-run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-ollama.yaml
```
@ -65,3 +82,10 @@ inference:
config:
url: http://127.0.0.1:14343
```
**Via Conda**
```
llama stack build --config ./build.yaml
llama stack run ./gpu/run.yaml
```

View file

@ -0,0 +1,13 @@
name: local-ollama
distribution_spec:
description: Like local, but use ollama for running LLM inference
providers:
inference: remote::ollama
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
image_type: conda

View file

@ -0,0 +1,30 @@
services:
ollama:
image: ollama/ollama:latest
network_mode: "host"
volumes:
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
ports:
- "11434:11434"
command: []
llamastack:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to ollama run.yaml file
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
# Hack: wait for ollama server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
deploy:
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s
volumes:
ollama:

View file

@ -0,0 +1,46 @@
version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
inference:
- provider_id: ollama0
provider_type: remote::ollama
config:
url: http://127.0.0.1:14343
safety:
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
disable_input_check: false
disable_output_check: false
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta0
provider_type: meta-reference
config: {}

View file

@ -1,11 +1,18 @@
# TGI GPU Distribution
# TGI Distribution
The scripts in these folders help you spin up a Llama Stack distribution with TGI Inference provider.
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
### Start the Distribution (Single Node GPU)
> [!NOTE]
> This assumes you have access to GPU to start a TGI server with access to your GPU. Please see TGI CPU Distribution if you wish connect to a hosted TGI endpoint.
> This assumes you have access to GPU to start a TGI server with access to your GPU.
### Getting Started
```
$ cd llama_stack/distribution/docker/tgi
@ -30,23 +37,37 @@ To kill the server
docker compose down
```
### (Alternative) Docker Run
### Start the Distribution (Single Node CPU)
> [!NOTE]
> This assumes you have an hosted endpoint
```
$ cd llama-stack/distribution/tgi/cpu
$ ls
compose.yaml run.yaml
$ docker compose up
```
### (Alternative) TGI server + llama stack run (Single Node GPU)
If you wish to separately spin up a TGI server, and connect with Llama Stack, you may use the following commands.
##### Start TGI server.
#### (optional) Start TGI server locally
- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint.
```
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.1-8B-Instruct --port 5009
```
- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) for more details.
##### Start Llama Stack server pointing to TGI server
#### Start Llama Stack server pointing to TGI server
```
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./tgi-run.yaml:/root/llamastack-run-tgi.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-tgi.yaml
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack-local-cpu --yaml_config /root/my-run.yaml
```
Make sure in you `tgi-run.yaml` file, you inference provider is pointing to the correct TGI endpoint. E.g.
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
```
inference:
- provider_id: tgi0
@ -54,3 +75,11 @@ inference:
config:
url: http://127.0.0.1:5009
```
**Via Conda**
```bash
llama stack build --config ./build.yaml
# -- start a TGI server endpoint
llama stack run ./gpu/run.yaml
```

View file

@ -0,0 +1,54 @@
services:
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:latest
network_mode: "host"
volumes:
- $HOME/.cache/huggingface:/data
ports:
- "5009:5009"
devices:
- nvidia.com/gpu=all
environment:
- CUDA_VISIBLE_DEVICES=0
- HF_HOME=/data
- HF_DATASETS_CACHE=/data
- HF_MODULES_CACHE=/data
- HF_HUB_CACHE=/data
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.1-8B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
# that's the closest analogue to --gpus; provide
# an integer amount of devices or 'all'
count: 1
# Devices are reserved using a list of capabilities, making
# capabilities the only required field. A device MUST
# satisfy all the requested capabilities for a successful
# reservation.
capabilities: [gpu]
runtime: nvidia
healthcheck:
test: ["CMD", "curl", "-f", "http://text-generation-inference:5009/health"]
interval: 5s
timeout: 5s
retries: 30
llamastack:
depends_on:
text-generation-inference:
condition: service_healthy
image: llamastack/llamastack-local-cpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to run.yaml file
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s

View file

@ -0,0 +1,46 @@
version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
inference:
- provider_id: tgi0
provider_type: remote::tgi
config:
url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
safety:
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
disable_input_check: false
disable_output_check: false
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta0
provider_type: meta-reference
config: {}

View file

@ -34,7 +34,7 @@ services:
interval: 5s
timeout: 5s
retries: 30
llamastack-local-cpu:
llamastack:
depends_on:
text-generation-inference:
condition: service_healthy
@ -43,11 +43,11 @@ services:
volumes:
- ~/.llama:/root/.llama
# Link to TGI run.yaml file
- ./tgi-run.yaml:/root/llamastack-run-tgi.yaml
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
# Hack: wait for TGI server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-tgi.yaml"
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
restart_policy:
condition: on-failure
delay: 3s

View file

@ -1,28 +0,0 @@
# Docker Compose Scripts
This folder contains scripts to enable starting a distribution using `docker compose`.
#### Example: TGI Inference Adapter
```
$ cd llama_stack/distribution/docker/tgi
$ ls
compose.yaml tgi-run.yaml
$ docker compose up
```
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
```
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
```
To kill the server
```
docker compose down
```

View file

@ -3,7 +3,10 @@ distribution_spec:
description: Like local, but use ollama for running LLM inference
providers:
inference: remote::ollama
memory: meta-reference
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference