forked from phoenix-oss/llama-stack-mirror
distro readmes with model serving instructions (#339)
* readme updates * quantied compose * dell tgi * config update * readme * update model serving readmes * update * update * config
This commit is contained in:
parent
a70a4706fc
commit
ae671eaf7a
8 changed files with 136 additions and 4 deletions
|
@ -43,7 +43,7 @@ inference:
|
||||||
provider_type: remote::fireworks
|
provider_type: remote::fireworks
|
||||||
config:
|
config:
|
||||||
url: https://api.fireworks.ai/inference
|
url: https://api.fireworks.ai/inference
|
||||||
api_key: <optional api key>
|
api_key: <enter your api key>
|
||||||
```
|
```
|
||||||
|
|
||||||
**Via Conda**
|
**Via Conda**
|
||||||
|
@ -53,3 +53,27 @@ llama stack build --template fireworks --image-type conda
|
||||||
# -- modify run.yaml to a valid Fireworks server endpoint
|
# -- modify run.yaml to a valid Fireworks server endpoint
|
||||||
llama stack run ./run.yaml
|
llama stack run ./run.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Model Serving
|
||||||
|
|
||||||
|
Use `llama-stack-client models list` to chekc the available models served by Fireworks.
|
||||||
|
```
|
||||||
|
$ llama-stack-client models list
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| identifier | llama_model | provider_id | metadata |
|
||||||
|
+==============================+==============================+===============+============+
|
||||||
|
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
```
|
||||||
|
|
|
@ -17,6 +17,7 @@ providers:
|
||||||
provider_type: remote::fireworks
|
provider_type: remote::fireworks
|
||||||
config:
|
config:
|
||||||
url: https://api.fireworks.ai/inference
|
url: https://api.fireworks.ai/inference
|
||||||
|
# api_key: <ENTER_YOUR_API_KEY>
|
||||||
safety:
|
safety:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: meta-reference
|
provider_type: meta-reference
|
||||||
|
@ -32,6 +33,10 @@ providers:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: meta-reference
|
provider_type: meta-reference
|
||||||
config: {}
|
config: {}
|
||||||
|
# Uncomment to use weaviate memory provider
|
||||||
|
# - provider_id: weaviate0
|
||||||
|
# provider_type: remote::weaviate
|
||||||
|
# config: {}
|
||||||
agents:
|
agents:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: meta-reference
|
provider_type: meta-reference
|
||||||
|
|
|
@ -84,3 +84,19 @@ memory:
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Run `docker compose up` with the updated `run.yaml` file.
|
3. Run `docker compose up` with the updated `run.yaml` file.
|
||||||
|
|
||||||
|
### Serving a new model
|
||||||
|
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
|
||||||
|
```
|
||||||
|
inference:
|
||||||
|
- provider_id: meta0
|
||||||
|
provider_type: meta-reference
|
||||||
|
config:
|
||||||
|
model: Llama3.2-11B-Vision-Instruct
|
||||||
|
quantization: null
|
||||||
|
torch_seed: null
|
||||||
|
max_seq_len: 4096
|
||||||
|
max_batch_size: 1
|
||||||
|
```
|
||||||
|
|
||||||
|
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
||||||
|
|
|
@ -36,6 +36,15 @@ providers:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: meta-reference
|
provider_type: meta-reference
|
||||||
config: {}
|
config: {}
|
||||||
|
# Uncomment to use pgvector
|
||||||
|
# - provider_id: pgvector
|
||||||
|
# provider_type: remote::pgvector
|
||||||
|
# config:
|
||||||
|
# host: 127.0.0.1
|
||||||
|
# port: 5432
|
||||||
|
# db: postgres
|
||||||
|
# user: postgres
|
||||||
|
# password: mysecretpassword
|
||||||
agents:
|
agents:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: meta-reference
|
provider_type: meta-reference
|
||||||
|
|
|
@ -89,3 +89,28 @@ inference:
|
||||||
llama stack build --template ollama --image-type conda
|
llama stack build --template ollama --image-type conda
|
||||||
llama stack run ./gpu/run.yaml
|
llama stack run ./gpu/run.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Model Serving
|
||||||
|
|
||||||
|
To serve a new model with `ollama`
|
||||||
|
```
|
||||||
|
ollama run <model_name>
|
||||||
|
```
|
||||||
|
|
||||||
|
To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
|
||||||
|
```
|
||||||
|
$ ollama ps
|
||||||
|
|
||||||
|
NAME ID SIZE PROCESSOR UNTIL
|
||||||
|
llama3.1:8b-instruct-fp16 4aacac419454 17 GB 100% GPU 4 minutes from now
|
||||||
|
```
|
||||||
|
|
||||||
|
To verify that the model served by ollama is correctly connected to Llama Stack server
|
||||||
|
```
|
||||||
|
$ llama-stack-client models list
|
||||||
|
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||||
|
| identifier | llama_model | provider_id | metadata |
|
||||||
|
+======================+======================+===============+===============================================+
|
||||||
|
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0 | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
|
||||||
|
+----------------------+----------------------+---------------+-----------------------------------------------+
|
||||||
|
```
|
||||||
|
|
|
@ -92,3 +92,26 @@ llama stack build --template tgi --image-type conda
|
||||||
# -- start a TGI server endpoint
|
# -- start a TGI server endpoint
|
||||||
llama stack run ./gpu/run.yaml
|
llama stack run ./gpu/run.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Model Serving
|
||||||
|
To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
|
||||||
|
|
||||||
|
This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
|
||||||
|
|
||||||
|
```
|
||||||
|
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
|
||||||
|
```
|
||||||
|
|
||||||
|
or by changing the docker run command's `--model-id` flag
|
||||||
|
```
|
||||||
|
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
|
||||||
|
```
|
||||||
|
|
||||||
|
In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
|
||||||
|
```
|
||||||
|
inference:
|
||||||
|
- provider_id: tgi0
|
||||||
|
provider_type: remote::tgi
|
||||||
|
config:
|
||||||
|
url: http://127.0.0.1:5009
|
||||||
|
```
|
||||||
|
|
|
@ -56,3 +56,26 @@ llama stack build --template together --image-type conda
|
||||||
# -- modify run.yaml to a valid Together server endpoint
|
# -- modify run.yaml to a valid Together server endpoint
|
||||||
llama stack run ./run.yaml
|
llama stack run ./run.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Model Serving
|
||||||
|
|
||||||
|
Use `llama-stack-client models list` to check the available models served by together.
|
||||||
|
|
||||||
|
```
|
||||||
|
$ llama-stack-client models list
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| identifier | llama_model | provider_id | metadata |
|
||||||
|
+==============================+==============================+===============+============+
|
||||||
|
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} |
|
||||||
|
+------------------------------+------------------------------+---------------+------------+
|
||||||
|
```
|
||||||
|
|
|
@ -17,11 +17,18 @@ providers:
|
||||||
provider_type: remote::together
|
provider_type: remote::together
|
||||||
config:
|
config:
|
||||||
url: https://api.together.xyz/v1
|
url: https://api.together.xyz/v1
|
||||||
|
# api_key: <ENTER_YOUR_API_KEY>
|
||||||
safety:
|
safety:
|
||||||
- provider_id: together0
|
- provider_id: meta0
|
||||||
provider_type: remote::together
|
provider_type: meta-reference
|
||||||
config:
|
config:
|
||||||
url: https://api.together.xyz/v1
|
llama_guard_shield:
|
||||||
|
model: Llama-Guard-3-1B
|
||||||
|
excluded_categories: []
|
||||||
|
disable_input_check: false
|
||||||
|
disable_output_check: false
|
||||||
|
prompt_guard_shield:
|
||||||
|
model: Prompt-Guard-86M
|
||||||
memory:
|
memory:
|
||||||
- provider_id: meta0
|
- provider_id: meta0
|
||||||
provider_type: remote::weaviate
|
provider_type: remote::weaviate
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue