forked from phoenix-oss/llama-stack-mirror
docs
This commit is contained in:
parent
e605d57fb7
commit
f78200b189
2 changed files with 9 additions and 397 deletions
|
@ -23,5 +23,6 @@ tgi
|
||||||
dell-tgi
|
dell-tgi
|
||||||
together
|
together
|
||||||
fireworks
|
fireworks
|
||||||
|
remote-vllm
|
||||||
bedrock
|
bedrock
|
||||||
```
|
```
|
||||||
|
|
|
@ -53,9 +53,9 @@ Please see our pages in detail for the types of distributions we offer:
|
||||||
3. [On-device Distribution](./distributions/ondevice_distro/index.md): If you want to run Llama Stack inference on your iOS / Android device.
|
3. [On-device Distribution](./distributions/ondevice_distro/index.md): If you want to run Llama Stack inference on your iOS / Android device.
|
||||||
|
|
||||||
|
|
||||||
### Quick Start Commands
|
### Table of Contents
|
||||||
|
|
||||||
Once you have decided on the inference provider and distribution to use, use the following quick start commands to get started.
|
Once you have decided on the inference provider and distribution to use, use the following guides to get started.
|
||||||
|
|
||||||
##### 1.0 Prerequisite
|
##### 1.0 Prerequisite
|
||||||
|
|
||||||
|
@ -109,421 +109,32 @@ Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [firew
|
||||||
|
|
||||||
##### 1.1. Start the distribution
|
##### 1.1. Start the distribution
|
||||||
|
|
||||||
**(Option 1) Via Docker**
|
|
||||||
::::{tab-set}
|
|
||||||
|
|
||||||
:::{tab-item} meta-reference-gpu
|
:::{tab-item} meta-reference-gpu
|
||||||
```
|
[Start Meta Reference GPU Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html)
|
||||||
$ cd llama-stack/distributions/meta-reference-gpu && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
This will download and start running a pre-built Docker container. Alternatively, you may use the following commands:
|
|
||||||
|
|
||||||
```
|
|
||||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tab-item} vLLM
|
:::{tab-item} vLLM
|
||||||
```
|
[Start vLLM Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/remote-vllm.html)
|
||||||
$ cd llama-stack/distributions/remote-vllm && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
The script will first start up vLLM server on port 8000, then start up Llama Stack distribution server hooking up to it for inference. You should see the following outputs --
|
|
||||||
```
|
|
||||||
<TO BE FILLED>
|
|
||||||
```
|
|
||||||
|
|
||||||
To kill the server
|
|
||||||
```
|
|
||||||
docker compose down
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tab-item} tgi
|
:::{tab-item} tgi
|
||||||
```
|
[Start TGI Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html)
|
||||||
$ cd llama-stack/distributions/tgi && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should see the following outputs --
|
|
||||||
```
|
|
||||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
|
||||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
|
||||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
|
||||||
INFO: Started server process [1]
|
|
||||||
INFO: Waiting for application startup.
|
|
||||||
INFO: Application startup complete.
|
|
||||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
|
||||||
```
|
|
||||||
|
|
||||||
To kill the server
|
|
||||||
```
|
|
||||||
docker compose down
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
|
|
||||||
:::{tab-item} ollama
|
|
||||||
```
|
|
||||||
$ cd llama-stack/distributions/ollama && docker compose up
|
|
||||||
|
|
||||||
# OR
|
|
||||||
|
|
||||||
$ cd llama-stack/distributions/ollama-gpu && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
You will see outputs similar to following ---
|
|
||||||
```
|
|
||||||
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
|
|
||||||
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
|
|
||||||
INFO: Started server process [1]
|
|
||||||
INFO: Waiting for application startup.
|
|
||||||
INFO: Application startup complete.
|
|
||||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
|
||||||
[llamastack] | Resolved 12 providers
|
|
||||||
[llamastack] | inner-inference => ollama0
|
|
||||||
[llamastack] | models => __routing_table__
|
|
||||||
[llamastack] | inference => __autorouted__
|
|
||||||
```
|
|
||||||
|
|
||||||
To kill the server
|
|
||||||
```
|
|
||||||
docker compose down
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} fireworks
|
|
||||||
```
|
|
||||||
$ cd llama-stack/distributions/fireworks && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g.
|
|
||||||
```
|
|
||||||
inference:
|
|
||||||
- provider_id: fireworks
|
|
||||||
provider_type: remote::fireworks
|
|
||||||
config:
|
|
||||||
url: https://api.fireworks.ai/inference
|
|
||||||
api_key: <optional api key>
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} together
|
|
||||||
```
|
|
||||||
$ cd distributions/together && docker compose up
|
|
||||||
```
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g.
|
|
||||||
```
|
|
||||||
inference:
|
|
||||||
- provider_id: together
|
|
||||||
provider_type: remote::together
|
|
||||||
config:
|
|
||||||
url: https://api.together.xyz/v1
|
|
||||||
api_key: <optional api key>
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
|
|
||||||
::::
|
|
||||||
|
|
||||||
**(Option 2) Via Conda**
|
|
||||||
|
|
||||||
::::{tab-set}
|
|
||||||
|
|
||||||
:::{tab-item} meta-reference-gpu
|
|
||||||
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
|
|
||||||
|
|
||||||
2. Build the `meta-reference-gpu` distribution
|
|
||||||
|
|
||||||
```
|
|
||||||
$ llama stack build --template meta-reference-gpu --image-type conda
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Start running distribution
|
|
||||||
```
|
|
||||||
$ llama stack run ~/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
|
|
||||||
```
|
|
||||||
memory:
|
|
||||||
- provider_id: faiss-0
|
|
||||||
provider_type: faiss
|
|
||||||
config:
|
|
||||||
kvstore:
|
|
||||||
namespace: null
|
|
||||||
type: sqlite
|
|
||||||
db_path: ~/.llama/runtime/faiss_store.db
|
|
||||||
```
|
|
||||||
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} tgi
|
|
||||||
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
|
|
||||||
|
|
||||||
2. Build the `tgi` distribution
|
|
||||||
|
|
||||||
```bash
|
|
||||||
llama stack build --template tgi --image-type conda
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Start a TGI server endpoint
|
|
||||||
|
|
||||||
4. Make sure in your `run.yaml` file, your `conda_env` is pointing to the conda environment and inference provider is pointing to the correct TGI server endpoint. E.g.
|
|
||||||
```
|
|
||||||
conda_env: llamastack-tgi
|
|
||||||
...
|
|
||||||
inference:
|
|
||||||
- provider_id: tgi0
|
|
||||||
provider_type: remote::tgi
|
|
||||||
config:
|
|
||||||
url: http://127.0.0.1:5009
|
|
||||||
```
|
|
||||||
|
|
||||||
5. Start Llama Stack server
|
|
||||||
```bash
|
|
||||||
$ llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
|
|
||||||
```
|
|
||||||
memory:
|
|
||||||
- provider_id: faiss-0
|
|
||||||
provider_type: faiss
|
|
||||||
config:
|
|
||||||
kvstore:
|
|
||||||
namespace: null
|
|
||||||
type: sqlite
|
|
||||||
db_path: ~/.llama/runtime/faiss_store.db
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tab-item} ollama
|
:::{tab-item} ollama
|
||||||
|
[Start Ollama Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html)
|
||||||
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
|
|
||||||
|
|
||||||
#### Start Ollama server.
|
|
||||||
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
|
|
||||||
|
|
||||||
**Via Docker**
|
|
||||||
```
|
|
||||||
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
|
|
||||||
```
|
|
||||||
|
|
||||||
**Via CLI**
|
|
||||||
```
|
|
||||||
ollama run <model_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Start Llama Stack server pointing to Ollama server
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the correct Ollama endpoint. E.g.
|
|
||||||
```
|
|
||||||
conda_env: llamastack-ollama
|
|
||||||
...
|
|
||||||
inference:
|
|
||||||
- provider_id: ollama0
|
|
||||||
provider_type: remote::ollama
|
|
||||||
config:
|
|
||||||
url: http://127.0.0.1:11434
|
|
||||||
```
|
|
||||||
|
|
||||||
```
|
|
||||||
llama stack build --template ollama --image-type conda
|
|
||||||
llama stack run ~/.llama/distributions/llamastack-ollama/ollama-run.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
|
|
||||||
```
|
|
||||||
memory:
|
|
||||||
- provider_id: faiss-0
|
|
||||||
provider_type: faiss
|
|
||||||
config:
|
|
||||||
kvstore:
|
|
||||||
namespace: null
|
|
||||||
type: sqlite
|
|
||||||
db_path: ~/.llama/runtime/faiss_store.db
|
|
||||||
```
|
|
||||||
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} fireworks
|
|
||||||
|
|
||||||
```bash
|
|
||||||
llama stack build --template fireworks --image-type conda
|
|
||||||
# -- modify run.yaml to a valid Fireworks server endpoint
|
|
||||||
llama stack run ./run.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g.
|
|
||||||
```
|
|
||||||
conda_env: llamastack-fireworks
|
|
||||||
...
|
|
||||||
inference:
|
|
||||||
- provider_id: fireworks
|
|
||||||
provider_type: remote::fireworks
|
|
||||||
config:
|
|
||||||
url: https://api.fireworks.ai/inference
|
|
||||||
api_key: <optional api key>
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tab-item} together
|
:::{tab-item} together
|
||||||
|
[Start Together Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/together.html)
|
||||||
```bash
|
|
||||||
llama stack build --template together --image-type conda
|
|
||||||
# -- modify run.yaml to a valid Together server endpoint
|
|
||||||
llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g.
|
|
||||||
```
|
|
||||||
conda_env: llamastack-together
|
|
||||||
...
|
|
||||||
inference:
|
|
||||||
- provider_id: together
|
|
||||||
provider_type: remote::together
|
|
||||||
config:
|
|
||||||
url: https://api.together.xyz/v1
|
|
||||||
api_key: <optional api key>
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
::::
|
|
||||||
|
|
||||||
##### 1.2 (Optional) Update Model Serving Configuration
|
|
||||||
::::{tab-set}
|
|
||||||
|
|
||||||
:::{tab-item} meta-reference-gpu
|
|
||||||
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
|
|
||||||
```
|
|
||||||
inference:
|
|
||||||
- provider_id: meta0
|
|
||||||
provider_type: inline::meta-reference
|
|
||||||
config:
|
|
||||||
model: Llama3.2-11B-Vision-Instruct
|
|
||||||
quantization: null
|
|
||||||
torch_seed: null
|
|
||||||
max_seq_len: 4096
|
|
||||||
max_batch_size: 1
|
|
||||||
```
|
|
||||||
|
|
||||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} tgi
|
|
||||||
To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
|
|
||||||
|
|
||||||
This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
|
|
||||||
|
|
||||||
```
|
|
||||||
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
|
|
||||||
```
|
|
||||||
|
|
||||||
or by changing the docker run command's `--model-id` flag
|
|
||||||
```
|
|
||||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
|
|
||||||
```
|
|
||||||
|
|
||||||
Make sure your `run.yaml` file has the inference provider pointing to the TGI server endpoint serving your model.
|
|
||||||
```
|
|
||||||
inference:
|
|
||||||
- provider_id: tgi0
|
|
||||||
provider_type: remote::tgi
|
|
||||||
config:
|
|
||||||
url: http://127.0.0.1:5009
|
|
||||||
```
|
|
||||||
```
|
|
||||||
|
|
||||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} ollama
|
|
||||||
You can use ollama for managing model downloads.
|
|
||||||
|
|
||||||
```
|
|
||||||
ollama pull llama3.1:8b-instruct-fp16
|
|
||||||
ollama pull llama3.1:70b-instruct-fp16
|
|
||||||
```
|
|
||||||
|
|
||||||
> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers.remote/inference/ollama/ollama.py) for the supported Ollama models.
|
|
||||||
|
|
||||||
|
|
||||||
To serve a new model with `ollama`
|
|
||||||
```
|
|
||||||
ollama run <model_name>
|
|
||||||
```
|
|
||||||
|
|
||||||
To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
|
|
||||||
```
|
|
||||||
$ ollama ps
|
|
||||||
|
|
||||||
NAME ID SIZE PROCESSOR UNTIL
|
|
||||||
llama3.1:8b-instruct-fp16 4aacac419454 17 GB 100% GPU 4 minutes from now
|
|
||||||
```
|
|
||||||
|
|
||||||
To verify that the model served by ollama is correctly connected to Llama Stack server
|
|
||||||
```
|
|
||||||
$ llama-stack-client models list
|
|
||||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
|
||||||
| identifier | llama_model | provider_id | metadata |
|
|
||||||
+======================+======================+===============+===============================================+
|
|
||||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0 | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
|
|
||||||
+----------------------+----------------------+---------------+-----------------------------------------------+
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
:::{tab-item} together
|
|
||||||
Use `llama-stack-client models list` to check the available models served by together.
|
|
||||||
|
|
||||||
```
|
|
||||||
$ llama-stack-client models list
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| identifier | llama_model | provider_id | metadata |
|
|
||||||
+==============================+==============================+===============+============+
|
|
||||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tab-item} fireworks
|
:::{tab-item} fireworks
|
||||||
Use `llama-stack-client models list` to check the available models served by Fireworks.
|
[Start Fireworks Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/fireworks.html)
|
||||||
```
|
|
||||||
$ llama-stack-client models list
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| identifier | llama_model | provider_id | metadata |
|
|
||||||
+==============================+==============================+===============+============+
|
|
||||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} |
|
|
||||||
+------------------------------+------------------------------+---------------+------------+
|
|
||||||
```
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
::::
|
::::
|
||||||
|
|
||||||
|
|
||||||
##### Troubleshooting
|
##### Troubleshooting
|
||||||
- If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
- If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
||||||
- Use `--port <PORT>` flag to use a different port number. For docker run, update the `-p <PORT>:<PORT>` flag.
|
- Use `--port <PORT>` flag to use a different port number. For docker run, update the `-p <PORT>:<PORT>` flag.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue