mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-29 15:23:51 +00:00
tabs
This commit is contained in:
parent
c2195a0b5c
commit
c94fae5ab1
3 changed files with 224 additions and 20 deletions
|
@ -12,15 +12,12 @@ The `llamastack/distribution-fireworks` distribution consists of the following p
|
|||
|
||||
### Step 1. Start the Distribution (Single Node CPU)
|
||||
|
||||
#### (Option 1) Start Distribution Via Conda
|
||||
#### (Option 1) Start Distribution Via Docker
|
||||
> [!NOTE]
|
||||
> This assumes you have an hosted endpoint at Fireworks with API Key.
|
||||
|
||||
```
|
||||
$ cd distributions/fireworks
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
$ cd distributions/fireworks && docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
|
@ -44,7 +41,7 @@ llama stack run ./run.yaml
|
|||
|
||||
### (Optional) Model Serving
|
||||
|
||||
Use `llama-stack-client models list` to chekc the available models served by Fireworks.
|
||||
Use `llama-stack-client models list` to check the available models served by Fireworks.
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
|
|
|
@ -17,10 +17,7 @@ The `llamastack/distribution-together` distribution consists of the following pr
|
|||
> This assumes you have an hosted endpoint at Together with API Key.
|
||||
|
||||
```
|
||||
$ cd distributions/together
|
||||
$ ls
|
||||
compose.yaml run.yaml
|
||||
$ docker compose up
|
||||
$ cd distributions/together && docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
|
||||
|
|
|
@ -62,10 +62,24 @@ $ ls ~/.llama/checkpoints
|
|||
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
|
||||
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
|
||||
```
|
||||
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
:::
|
||||
|
||||
:::{tab-item} tgi
|
||||
This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
Access to GPU to start a TGI server with access to your GPU.
|
||||
:::
|
||||
|
||||
:::{tab-item} ollama
|
||||
Access to Single-Node CPU able to run ollama.
|
||||
:::
|
||||
|
||||
:::{tab-item} together
|
||||
Access to Single-Node CPU with Together hosted endpoint via API_KEY from [together.ai](https://api.together.xyz/signin).
|
||||
:::
|
||||
|
||||
:::{tab-item} fireworks
|
||||
Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [together.ai](https://fireworks.ai/).
|
||||
:::
|
||||
|
||||
::::
|
||||
|
@ -80,14 +94,6 @@ This assumes you have access to GPU to start a TGI server with access to your GP
|
|||
$ cd distributions/meta-reference-gpu && docker compose up
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
|
||||
|
||||
```
|
||||
|
@ -117,6 +123,65 @@ docker compose down
|
|||
```
|
||||
:::
|
||||
|
||||
|
||||
:::{tab-item} ollama
|
||||
```
|
||||
$ cd distributions/ollama/cpu && docker compose up
|
||||
```
|
||||
|
||||
You will see outputs similar to following ---
|
||||
```
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
|
||||
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
|
||||
[llamastack] | Resolved 12 providers
|
||||
[llamastack] | inner-inference => ollama0
|
||||
[llamastack] | models => __routing_table__
|
||||
[llamastack] | inference => __autorouted__
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} fireworks
|
||||
```
|
||||
$ cd distributions/fireworks && docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: fireworks
|
||||
provider_type: remote::fireworks
|
||||
config:
|
||||
url: https://api.fireworks.ai/inference
|
||||
api_key: <optional api key>
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} together
|
||||
```
|
||||
$ cd distributions/together && docker compose up
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: together
|
||||
provider_type: remote::together
|
||||
config:
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: <optional api key>
|
||||
```
|
||||
:::
|
||||
|
||||
|
||||
::::
|
||||
|
||||
**Via Conda**
|
||||
|
@ -147,6 +212,78 @@ llama stack run ./gpu/run.yaml
|
|||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} ollama
|
||||
|
||||
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
|
||||
|
||||
#### Start Ollama server.
|
||||
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
|
||||
|
||||
**Via Docker**
|
||||
```
|
||||
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
|
||||
```
|
||||
|
||||
**Via CLI**
|
||||
```
|
||||
ollama run <model_id>
|
||||
```
|
||||
|
||||
#### Start Llama Stack server pointing to Ollama server
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: ollama0
|
||||
provider_type: remote::ollama
|
||||
config:
|
||||
url: http://127.0.0.1:14343
|
||||
```
|
||||
|
||||
```
|
||||
llama stack build --template ollama --image-type conda
|
||||
llama stack run ./gpu/run.yaml
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} fireworks
|
||||
|
||||
```bash
|
||||
llama stack build --template fireworks --image-type conda
|
||||
# -- modify run.yaml to a valid Fireworks server endpoint
|
||||
llama stack run ./run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: fireworks
|
||||
provider_type: remote::fireworks
|
||||
config:
|
||||
url: https://api.fireworks.ai/inference
|
||||
api_key: <optional api key>
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} together
|
||||
|
||||
```bash
|
||||
llama stack build --template together --image-type conda
|
||||
# -- modify run.yaml to a valid Together server endpoint
|
||||
llama stack run ./run.yaml
|
||||
```
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: together
|
||||
provider_type: remote::together
|
||||
config:
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: <optional api key>
|
||||
```
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
|
||||
|
@ -170,6 +307,33 @@ inference:
|
|||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
||||
:::
|
||||
|
||||
:::{tab-item} tgi
|
||||
To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
|
||||
|
||||
This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
|
||||
|
||||
```
|
||||
command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
|
||||
```
|
||||
|
||||
or by changing the docker run command's `--model-id` flag
|
||||
```
|
||||
docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
|
||||
```
|
||||
|
||||
In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
||||
```
|
||||
|
||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
||||
:::
|
||||
|
||||
:::{tab-item} ollama
|
||||
You can use ollama for managing model downloads.
|
||||
|
||||
|
@ -178,7 +342,6 @@ ollama pull llama3.1:8b-instruct-fp16
|
|||
ollama pull llama3.1:70b-instruct-fp16
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py) for the supported Ollama models.
|
||||
|
||||
|
||||
|
@ -206,6 +369,53 @@ $ llama-stack-client models list
|
|||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} together
|
||||
Use `llama-stack-client models list` to check the available models served by together.
|
||||
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+==============================+==============================+===============+============+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} fireworks
|
||||
Use `llama-stack-client models list` to check the available models served by Fireworks.
|
||||
```
|
||||
$ llama-stack-client models list
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| identifier | llama_model | provider_id | metadata |
|
||||
+==============================+==============================+===============+============+
|
||||
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} |
|
||||
+------------------------------+------------------------------+---------------+------------+
|
||||
```
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue