From c94fae5ab1a93a2721d8387c5ede4ec7dc7d85e9 Mon Sep 17 00:00:00 2001 From: Xi Yan Date: Wed, 30 Oct 2024 11:13:01 -0700 Subject: [PATCH] tabs --- .../distributions/fireworks.md | 9 +- .../getting_started/distributions/together.md | 5 +- docs/source/getting_started/index.md | 230 +++++++++++++++++- 3 files changed, 224 insertions(+), 20 deletions(-) diff --git a/docs/source/getting_started/distributions/fireworks.md b/docs/source/getting_started/distributions/fireworks.md index 100a10794..ee46cd18d 100644 --- a/docs/source/getting_started/distributions/fireworks.md +++ b/docs/source/getting_started/distributions/fireworks.md @@ -12,15 +12,12 @@ The `llamastack/distribution-fireworks` distribution consists of the following p ### Step 1. Start the Distribution (Single Node CPU) -#### (Option 1) Start Distribution Via Conda +#### (Option 1) Start Distribution Via Docker > [!NOTE] > This assumes you have an hosted endpoint at Fireworks with API Key. ``` -$ cd distributions/fireworks -$ ls -compose.yaml run.yaml -$ docker compose up +$ cd distributions/fireworks && docker compose up ``` Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g. @@ -44,7 +41,7 @@ llama stack run ./run.yaml ### (Optional) Model Serving -Use `llama-stack-client models list` to chekc the available models served by Fireworks. +Use `llama-stack-client models list` to check the available models served by Fireworks. ``` $ llama-stack-client models list +------------------------------+------------------------------+---------------+------------+ diff --git a/docs/source/getting_started/distributions/together.md b/docs/source/getting_started/distributions/together.md index 5f9c90071..6a4142361 100644 --- a/docs/source/getting_started/distributions/together.md +++ b/docs/source/getting_started/distributions/together.md @@ -17,10 +17,7 @@ The `llamastack/distribution-together` distribution consists of the following pr > This assumes you have an hosted endpoint at Together with API Key. ``` -$ cd distributions/together -$ ls -compose.yaml run.yaml -$ docker compose up +$ cd distributions/together && docker compose up ``` Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g. diff --git a/docs/source/getting_started/index.md b/docs/source/getting_started/index.md index 6d6e953e8..1aa974e11 100644 --- a/docs/source/getting_started/index.md +++ b/docs/source/getting_started/index.md @@ -62,10 +62,24 @@ $ ls ~/.llama/checkpoints Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M ``` + +> This assumes you have access to GPU to start a local server with access to your GPU. ::: :::{tab-item} tgi -This assumes you have access to GPU to start a TGI server with access to your GPU. +Access to GPU to start a TGI server with access to your GPU. +::: + +:::{tab-item} ollama +Access to Single-Node CPU able to run ollama. +::: + +:::{tab-item} together +Access to Single-Node CPU with Together hosted endpoint via API_KEY from [together.ai](https://api.together.xyz/signin). +::: + +:::{tab-item} fireworks +Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [together.ai](https://fireworks.ai/). ::: :::: @@ -80,14 +94,6 @@ This assumes you have access to GPU to start a TGI server with access to your GP $ cd distributions/meta-reference-gpu && docker compose up ``` -> [!NOTE] -> This assumes you have access to GPU to start a local server with access to your GPU. - - -> [!NOTE] -> `~/.llama` should be the path containing downloaded weights of Llama models. - - This will download and start running a pre-built docker container. Alternatively, you may use the following commands: ``` @@ -117,6 +123,65 @@ docker compose down ``` ::: + +:::{tab-item} ollama +``` +$ cd distributions/ollama/cpu && docker compose up +``` + +You will see outputs similar to following --- +``` +[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps" +[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps" +INFO: Started server process [1] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) +[llamastack] | Resolved 12 providers +[llamastack] | inner-inference => ollama0 +[llamastack] | models => __routing_table__ +[llamastack] | inference => __autorouted__ +``` + +To kill the server +``` +docker compose down +``` +::: + +:::{tab-item} fireworks +``` +$ cd distributions/fireworks && docker compose up +``` + +Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g. +``` +inference: + - provider_id: fireworks + provider_type: remote::fireworks + config: + url: https://api.fireworks.ai/inference + api_key: +``` +::: + +:::{tab-item} together +``` +$ cd distributions/together && docker compose up +``` + +Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g. +``` +inference: + - provider_id: together + provider_type: remote::together + config: + url: https://api.together.xyz/v1 + api_key: +``` +::: + + :::: **Via Conda** @@ -147,6 +212,78 @@ llama stack run ./gpu/run.yaml ``` ::: +:::{tab-item} ollama + +If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands. + +#### Start Ollama server. +- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details. + +**Via Docker** +``` +docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama +``` + +**Via CLI** +``` +ollama run +``` + +#### Start Llama Stack server pointing to Ollama server + +Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g. +``` +inference: + - provider_id: ollama0 + provider_type: remote::ollama + config: + url: http://127.0.0.1:14343 +``` + +``` +llama stack build --template ollama --image-type conda +llama stack run ./gpu/run.yaml +``` + +::: + +:::{tab-item} fireworks + +```bash +llama stack build --template fireworks --image-type conda +# -- modify run.yaml to a valid Fireworks server endpoint +llama stack run ./run.yaml +``` + +Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g. +``` +inference: + - provider_id: fireworks + provider_type: remote::fireworks + config: + url: https://api.fireworks.ai/inference + api_key: +``` +::: + +:::{tab-item} together + +```bash +llama stack build --template together --image-type conda +# -- modify run.yaml to a valid Together server endpoint +llama stack run ./run.yaml +``` +Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g. +``` +inference: + - provider_id: together + provider_type: remote::together + config: + url: https://api.together.xyz/v1 + api_key: +``` +::: + :::: @@ -170,6 +307,33 @@ inference: Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints. ::: +:::{tab-item} tgi +To serve a new model with `tgi`, change the docker command flag `--model-id `. + +This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve. + +``` +command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"] +``` + +or by changing the docker run command's `--model-id` flag +``` +docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009 +``` + +In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model. +``` +inference: + - provider_id: tgi0 + provider_type: remote::tgi + config: + url: http://127.0.0.1:5009 +``` +``` + +Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints. +::: + :::{tab-item} ollama You can use ollama for managing model downloads. @@ -178,7 +342,6 @@ ollama pull llama3.1:8b-instruct-fp16 ollama pull llama3.1:70b-instruct-fp16 ``` -> [!NOTE] > Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py) for the supported Ollama models. @@ -206,6 +369,53 @@ $ llama-stack-client models list ``` ::: +:::{tab-item} together +Use `llama-stack-client models list` to check the available models served by together. + +``` +$ llama-stack-client models list ++------------------------------+------------------------------+---------------+------------+ +| identifier | llama_model | provider_id | metadata | ++==============================+==============================+===============+============+ +| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} | ++------------------------------+------------------------------+---------------+------------+ +``` +::: + +:::{tab-item} fireworks +Use `llama-stack-client models list` to check the available models served by Fireworks. +``` +$ llama-stack-client models list ++------------------------------+------------------------------+---------------+------------+ +| identifier | llama_model | provider_id | metadata | ++==============================+==============================+===============+============+ +| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} | ++------------------------------+------------------------------+---------------+------------+ +``` +::: + ::::