update distributions compose/readme (#338)

* readme updates

* quantied compose

* dell tgi

* config update
This commit is contained in:
Xi Yan 2024-10-28 16:34:43 -07:00 committed by GitHub
parent 985ff4d6ce
commit a70a4706fc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
14 changed files with 219 additions and 31 deletions

View file

@ -0,0 +1,68 @@
# Dell-TGI Distribution
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
### Start the Distribution (Single Node GPU)
> [!NOTE]
> This assumes you have access to GPU to start a TGI server with access to your GPU.
```
$ cd distributions/dell-tgi/
$ ls
compose.yaml README.md run.yaml
$ docker compose up
```
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
```
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
```
To kill the server
```
docker compose down
```
### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
#### Start Dell-TGI server locally
```
docker run -it --shm-size 1g -p 80:80 --gpus 4 \
-e NUM_SHARD=4
-e MAX_BATCH_PREFILL_TOKENS=32768 \
-e MAX_INPUT_TOKENS=8000 \
-e MAX_TOTAL_TOKENS=8192 \
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
```
#### Start Llama Stack server pointing to TGI server
```
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
```
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
```
inference:
- provider_id: tgi0
provider_type: remote::tgi
config:
url: http://127.0.0.1:5009
```

View file

@ -0,0 +1,50 @@
services:
text-generation-inference:
image: registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
network_mode: "host"
volumes:
- $HOME/.cache/huggingface:/data
ports:
- "5009:5009"
devices:
- nvidia.com/gpu=all
environment:
- CUDA_VISIBLE_DEVICES=0,1,2,3,4
- NUM_SHARD=4
- MAX_BATCH_PREFILL_TOKENS=32768
- MAX_INPUT_TOKENS=8000
- MAX_TOTAL_TOKENS=8192
command: []
deploy:
resources:
reservations:
devices:
- driver: nvidia
# that's the closest analogue to --gpus; provide
# an integer amount of devices or 'all'
count: all
# Devices are reserved using a list of capabilities, making
# capabilities the only required field. A device MUST
# satisfy all the requested capabilities for a successful
# reservation.
capabilities: [gpu]
runtime: nvidia
llamastack:
depends_on:
text-generation-inference:
condition: service_healthy
image: llamastack/distribution-tgi
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to TGI run.yaml file
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
# Hack: wait for TGI server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s

View file

@ -0,0 +1,46 @@
version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
inference:
- provider_id: tgi0
provider_type: remote::tgi
config:
url: http://127.0.0.1:80
safety:
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
disable_input_check: false
disable_output_check: false
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta0
provider_type: meta-reference
config: {}

View file

@ -33,14 +33,9 @@ providers:
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: pgvector
provider_type: remote::pgvector
config:
host: 127.0.0.1
port: 5432
db: postgres
user: postgres
password: mysecretpassword
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference

View file

@ -0,0 +1,35 @@
services:
llamastack:
image: llamastack/distribution-meta-reference-quantized-gpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
devices:
- nvidia.com/gpu=all
environment:
- CUDA_VISIBLE_DEVICES=0
command: []
deploy:
resources:
reservations:
devices:
- driver: nvidia
# that's the closest analogue to --gpus; provide
# an integer amount of devices or 'all'
count: 1
# Devices are reserved using a list of capabilities, making
# capabilities the only required field. A device MUST
# satisfy all the requested capabilities for a successful
# reservation.
capabilities: [gpu]
runtime: nvidia
entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
deploy:
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s

View file

@ -50,7 +50,7 @@ compose.yaml run.yaml
$ docker compose up
```
### (Alternative) ollama run + llama stack Run
### (Alternative) ollama run + llama stack run
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
@ -71,7 +71,7 @@ ollama run <model_id>
**Via Docker**
```
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
```
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.

View file

@ -10,7 +10,7 @@ services:
llamastack:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
image: llamastack/distribution-ollama
network_mode: "host"
volumes:
- ~/.llama:/root/.llama

View file

@ -25,10 +25,10 @@ services:
# reservation.
capabilities: [gpu]
runtime: nvidia
llamastack-local-cpu:
llamastack:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
image: llamastack/distribution-ollama
network_mode: "host"
volumes:
- ~/.llama:/root/.llama

View file

@ -73,7 +73,7 @@ docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all gh
#### Start Llama Stack server pointing to TGI server
```
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack-local-cpu --yaml_config /root/my-run.yaml
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
```
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.

View file

@ -38,7 +38,7 @@ services:
depends_on:
text-generation-inference:
condition: service_healthy
image: llamastack/llamastack-local-cpu
image: llamastack/distribution-tgi
network_mode: "host"
volumes:
- ~/.llama:/root/.llama

View file

@ -8,7 +8,7 @@ The `llamastack/distribution-together` distribution consists of the following pr
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
| **Provider(s)** | remote::together | meta-reference | meta-reference, remote::weaviate | meta-reference | meta-reference |
### Start the Distribution (Single Node CPU)
@ -49,16 +49,6 @@ inference:
api_key: <optional api key>
```
Together distribution comes with weaviate as Memory provider. We also need to configure the remote weaviate API key and URL in `run.yaml` to get memory API.
```
memory:
- provider_id: weaviate0
provider_type: remote::weaviate
config:
weaviate_api_key: <ENTER_WEAVIATE_API_KEY>
weaviate_cluster_url: <ENTER_WEAVIATE_CLUSTER_URL>
```
**Via Conda**
```bash

View file

@ -25,9 +25,7 @@ providers:
memory:
- provider_id: meta0
provider_type: remote::weaviate
config:
weaviate_api_key: <ENTER_WEAVIATE_API_KEY>
weaviate_cluster_url: <ENTER_WEAVIATE_CLUSTER_URL>
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference

View file

@ -3,7 +3,11 @@ distribution_spec:
description: Use Fireworks.ai for running LLM inference
providers:
inference: remote::fireworks
memory: meta-reference
memory:
- meta-reference
- remote::weaviate
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference

View file

@ -3,7 +3,9 @@ distribution_spec:
description: Use Together.ai for running LLM inference
providers:
inference: remote::together
memory: remote::weaviate
memory:
- meta-reference
- remote::weaviate
safety: remote::together
agents: meta-reference
telemetry: meta-reference