mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-24 13:19:54 +00:00
[docs] update documentations (#356)
* move docs -> source * Add files via upload * mv image * Add files via upload * colocate iOS setup doc * delete image * Add files via upload * fix * delete image * Add files via upload * Update developer_cookbook.md * toctree * wip subfolder * docs update * subfolder * updates * name * updates * index * updates * refactor structure * depth * docs * content * docs * getting started * distributions * fireworks * fireworks * update * theme * theme * theme * pdj theme * pytorch theme * css * theme * agents example * format * index * headers * copy button * test tabs * test tabs * fix * tabs * tab * tabs * sphinx_design * quick start commands * size * width * css * css * download models * asthetic fix * tab format * update * css * width * css * docs * tab based * tab * tabs * docs * style * image * css * color * typo * update docs * missing links * list templates * links * links update * troubleshooting * fix * distributions * docs * fix table * kill llamastack-local-gpu/cpu * Update index.md * Update index.md * mv ios_setup.md * Update ios_setup.md * Add remote_or_local.gif * Update ios_setup.md * release notes * typos * Add ios_setup to index * nav bar * hide torctree * ios image * links update * rename * rename * docs * rename * links * distributions * distributions * distributions * distributions * remove release * remote --------- Co-authored-by: dltn <6599399+dltn@users.noreply.github.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
This commit is contained in:
parent
ac93dd89cf
commit
c810a4184d
37 changed files with 1777 additions and 2154 deletions
|
@ -1,14 +0,0 @@
|
|||
# Llama Stack Distribution
|
||||
|
||||
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
|
||||
|
||||
|
||||
## Quick Start Llama Stack Distributions Guide
|
||||
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
|
||||
| Meta Reference | [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) | [Guide](./meta-reference-gpu/) | meta-reference | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| Meta Reference Quantized | [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) | [Guide](./meta-reference-quantized-gpu/) | meta-reference-quantized | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| Ollama | [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general) | [Guide](./ollama/) | remote::ollama | meta-reference | remote::pgvector; remote::chromadb | remote::ollama | meta-reference |
|
||||
| TGI | [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general) | [Guide](./tgi/) | remote::tgi | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
|
||||
| Together | [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general) | [Guide](./together/) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
||||
| Fireworks | [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general) | [Guide](./fireworks/) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
|
@ -1,68 +0,0 @@
|
|||
# Dell-TGI Distribution
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
|
||||
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
|
||||
```
|
||||
$ cd distributions/dell-tgi/
|
||||
$ ls
|
||||
compose.yaml README.md run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
|
||||
```
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
|
||||
|
||||
#### Start Dell-TGI server locally
|
||||
```
|
||||
docker run -it --shm-size 1g -p 80:80 --gpus 4 \
|
||||
-e NUM_SHARD=4
|
||||
-e MAX_BATCH_PREFILL_TOKENS=32768 \
|
||||
-e MAX_INPUT_TOKENS=8000 \
|
||||
-e MAX_TOTAL_TOKENS=8192 \
|
||||
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
|
||||
```
|
||||
|
||||
|
||||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
|
@ -1,79 +0,0 @@
|
|||
# Fireworks Distribution
|
||||
|
||||
The `llamastack/distribution-` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::fireworks | meta-reference | meta-reference | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint at Fireworks with API Key.
|
||||
|
||||
```
|
||||
$ cd distributions/fireworks
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: fireworks
|
||||
provider_type: remote::fireworks
|
||||
config:
|
||||
url: https://api.fireworks.ai/inferenc
|
||||
api_key: <optional api key>
|
||||
```
|
||||
|
||||
### (Alternative) llama stack run (Single Node CPU)
|
||||
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-fireworks --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: fireworks
|
||||
provider_type: remote::fireworks
|
||||
config:
|
||||
url: https://api.fireworks.ai/inference
|
||||
api_key: <enter your api key>
|
||||
```
|
||||
|
||||
**Via Conda**
|
||||
|
||||
```bash
|
||||
llama stack build --template fireworks --image-type conda
|
||||
# -- modify run.yaml to a valid Fireworks server endpoint
|
||||
llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
### Model Serving
|
||||
|
||||
Use `llama-stack-client models list` to chekc the available models served by Fireworks.
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+==============================+==============================+===============+============+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
```
|
|
@ -1,102 +0,0 @@
|
|||
# Meta Reference Distribution
|
||||
|
||||
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
```
|
||||
$ cd distributions/meta-reference-gpu
|
||||
$ ls
|
||||
build.yaml compose.yaml README.md run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
|
||||
|
||||
```
|
||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
### Alternative (Build and start distribution locally via conda)
|
||||
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution.
|
||||
|
||||
### Start Distribution With pgvector/chromadb Memory Provider
|
||||
##### pgvector
|
||||
1. Start running the pgvector server:
|
||||
|
||||
```
|
||||
docker run --network host --name mypostgres -it -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres pgvector/pgvector:pg16
|
||||
```
|
||||
|
||||
2. Edit the `run.yaml` file to point to the pgvector server.
|
||||
```
|
||||
memory:
|
||||
- provider_id: pgvector
|
||||
provider_type: remote::pgvector
|
||||
config:
|
||||
host: 127.0.0.1
|
||||
port: 5432
|
||||
db: postgres
|
||||
user: postgres
|
||||
password: mysecretpassword
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> If you get a `RuntimeError: Vector extension is not installed.`. You will need to run `CREATE EXTENSION IF NOT EXISTS vector;` to include the vector extension. E.g.
|
||||
|
||||
```
|
||||
docker exec -it mypostgres ./bin/psql -U postgres
|
||||
postgres=# CREATE EXTENSION IF NOT EXISTS vector;
|
||||
postgres=# SELECT extname from pg_extension;
|
||||
extname
|
||||
```
|
||||
|
||||
3. Run `docker compose up` with the updated `run.yaml` file.
|
||||
|
||||
##### chromadb
|
||||
1. Start running chromadb server
|
||||
```
|
||||
docker run -it --network host --name chromadb -p 6000:6000 -v ./chroma_vdb:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest
|
||||
```
|
||||
|
||||
2. Edit the `run.yaml` file to point to the chromadb server.
|
||||
```
|
||||
memory:
|
||||
- provider_id: remote::chromadb
|
||||
provider_type: remote::chromadb
|
||||
config:
|
||||
host: localhost
|
||||
port: 6000
|
||||
```
|
||||
|
||||
3. Run `docker compose up` with the updated `run.yaml` file.
|
||||
|
||||
### Serving a new model
|
||||
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
|
||||
```
|
||||
inference:
|
||||
- provider_id: meta0
|
||||
provider_type: meta-reference
|
||||
config:
|
||||
model: Llama3.2-11B-Vision-Instruct
|
||||
quantization: null
|
||||
torch_seed: null
|
||||
max_seq_len: 4096
|
||||
max_batch_size: 1
|
||||
```
|
||||
|
||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
|
@ -1,34 +0,0 @@
|
|||
# Meta Reference Quantized Distribution
|
||||
|
||||
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |------------------------ |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | meta-reference-quantized | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
To download and start running a pre-built docker container, you may use the following commands:
|
||||
|
||||
```
|
||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
--gpus=all \
|
||||
distribution-meta-reference-quantized-gpu \
|
||||
--yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
### Alternative (Build and start distribution locally via conda)
|
||||
|
||||
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up the distribution.
|
|
@ -1,116 +0,0 @@
|
|||
# Ollama Distribution
|
||||
|
||||
The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference |
|
||||
|
||||
|
||||
### Start a Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a Ollama server with access to your GPU.
|
||||
|
||||
```
|
||||
$ cd distributions/ollama/gpu
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
You will see outputs similar to following ---
|
||||
```
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
[llamastack] | Resolved 12 providers
|
||||
[llamastack] | inner-inference => ollama0
|
||||
[llamastack] | models => __routing_table__
|
||||
[llamastack] | inference => __autorouted__
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
|
||||
|
||||
```
|
||||
$ cd distributions/ollama/cpu
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
### (Alternative) ollama run + llama stack run
|
||||
|
||||
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
|
||||
|
||||
#### Start Ollama server.
|
||||
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
|
||||
```
|
||||
|
||||
**Via CLI**
|
||||
```
|
||||
ollama run <model_id>
|
||||
```
|
||||
|
||||
#### Start Llama Stack server pointing to Ollama server
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: ollama0
|
||||
provider_type: remote::ollama
|
||||
config:
|
||||
url: http://127.0.0.1:14343
|
||||
```
|
||||
|
||||
**Via Conda**
|
||||
|
||||
```
|
||||
llama stack build --template ollama --image-type conda
|
||||
llama stack run ./gpu/run.yaml
|
||||
```
|
||||
|
||||
### Model Serving
|
||||
|
||||
To serve a new model with `ollama`
|
||||
```
|
||||
ollama run <model_name>
|
||||
```
|
||||
|
||||
To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
|
||||
```
|
||||
$ ollama ps
|
||||
|
||||
NAME ID SIZE PROCESSOR UNTIL
|
||||
llama3.1:8b-instruct-fp16 4aacac419454 17 GB 100% GPU 4 minutes from now
|
||||
```
|
||||
|
||||
To verify that the model served by ollama is correctly connected to Llama Stack server
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+======================+======================+===============+===============================================+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0 | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
|
||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||
```
|
|
@ -1,117 +0,0 @@
|
|||
# TGI Distribution
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
|
||||
|
||||
```
|
||||
$ cd distributions/tgi/gpu
|
||||
$ ls
|
||||
compose.yaml tgi-run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
|
||||
```
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint compatible with TGI server.
|
||||
|
||||
```
|
||||
$ cd distributions/tgi/cpu
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
Replace <ENTER_YOUR_TGI_HOSTED_ENDPOINT> in `run.yaml` file with your TGI endpoint.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
|
||||
```
|
||||
|
||||
### (Alternative) TGI server + llama stack run (Single Node GPU)
|
||||
|
||||
If you wish to separately spin up a TGI server, and connect with Llama Stack, you may use the following commands.
|
||||
|
||||
#### (optional) Start TGI server locally
|
||||
- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint.
|
||||
|
||||
```
|
||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.1-8B-Instruct --port 5009
|
||||
```
|
||||
|
||||
|
||||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
||||
|
||||
**Via Conda**
|
||||
|
||||
```bash
|
||||
llama stack build --template tgi --image-type conda
|
||||
# -- start a TGI server endpoint
|
||||
llama stack run ./gpu/run.yaml
|
||||
```
|
||||
|
||||
### Model Serving
|
||||
To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
|
||||
|
||||
This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
|
||||
|
||||
```
|
||||
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
|
||||
```
|
||||
|
||||
or by changing the docker run command's `--model-id` flag
|
||||
```
|
||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
|
||||
```
|
||||
|
||||
In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
|
@ -17,7 +17,7 @@ services:
|
|||
depends_on:
|
||||
text-generation-inference:
|
||||
condition: service_healthy
|
||||
image: llamastack/llamastack-local-cpu
|
||||
image: llamastack/llamastack-tgi
|
||||
network_mode: "host"
|
||||
volumes:
|
||||
- ~/.llama:/root/.llama
|
||||
|
|
|
@ -11,7 +11,7 @@ The `llamastack/distribution-together` distribution consists of the following pr
|
|||
| **Provider(s)** | remote::together | meta-reference | meta-reference, remote::weaviate | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Start the Distribution (Single Node CPU)
|
||||
### Docker: Start the Distribution (Single Node CPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint at Together with API Key.
|
||||
|
@ -33,23 +33,7 @@ inference:
|
|||
api_key: <optional api key>
|
||||
```
|
||||
|
||||
### (Alternative) llama stack run (Single Node CPU)
|
||||
|
||||
```
|
||||
docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-together --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: together
|
||||
provider_type: remote::together
|
||||
config:
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: <optional api key>
|
||||
```
|
||||
|
||||
**Via Conda**
|
||||
### Conda llama stack run (Single Node CPU)
|
||||
|
||||
```bash
|
||||
llama stack build --template together --image-type conda
|
||||
|
@ -57,7 +41,7 @@ llama stack build --template together --image-type conda
|
|||
llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
### Model Serving
|
||||
### (Optional) Update Model Serving Configuration
|
||||
|
||||
Use `llama-stack-client models list` to check the available models served by together.
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue