docs

2024-11-19 16:37:30 -08:00 · 2024-11-19 16:37:30 -08:00 · f78200b189
commit f78200b189
parent e605d57fb7
2 changed files with 9 additions and 397 deletions
--- a/docs/source/getting_started/distributions/self_hosted_distro/index.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/index.md
@ -23,5 +23,6 @@ tgi
 dell-tgi
 together
 fireworks
 remote-vllm
 bedrock
 ```
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -53,9 +53,9 @@ Please see our pages in detail for the types of distributions we offer:
 3. [On-device Distribution](./distributions/ondevice_distro/index.md): If you want to run Llama Stack inference on your iOS / Android device.
-### Quick Start Commands
+### Table of Contents
-Once you have decided on the inference provider and distribution to use, use the following quick start commands to get started.
+Once you have decided on the inference provider and distribution to use, use the following guides to get started.
 ##### 1.0 Prerequisite
@ -109,421 +109,32 @@ Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [firew
 ##### 1.1. Start the distribution
 **(Option 1) Via Docker**
 ::::{tab-set}
 :::{tab-item} meta-reference-gpu
-```
+[Start Meta Reference GPU Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html)
 $ cd llama-stack/distributions/meta-reference-gpu && docker compose up
 ```
 This will download and start running a pre-built Docker container. Alternatively, you may use the following commands:
 ```
 docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
 ```
 :::
 :::{tab-item} vLLM
-```
+[Start vLLM Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/remote-vllm.html)
 $ cd llama-stack/distributions/remote-vllm && docker compose up
 ```
 The script will first start up vLLM server on port 8000, then start up Llama Stack distribution server hooking up to it for inference. You should see the following outputs --
 ```
 <TO BE FILLED>
 ```
 To kill the server
 ```
 docker compose down
 ```
 :::
 :::{tab-item} tgi
-```
+[Start TGI Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html)
 $ cd llama-stack/distributions/tgi && docker compose up
 ```
 The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should see the following outputs --
 ```
 [text-generation-inference] | 2024-10-15T18:56:33.810397Z  INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
 [text-generation-inference] | 2024-10-15T18:56:33.810448Z  WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
 [text-generation-inference] | 2024-10-15T18:56:33.864143Z  INFO text_generation_router::server: router/src/server.rs:2353: Connected
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```
 To kill the server
 ```
 docker compose down
 ```
 :::
 :::{tab-item} ollama
 ```
 $ cd llama-stack/distributions/ollama && docker compose up
 # OR
 $ cd llama-stack/distributions/ollama-gpu && docker compose up
 ```
 You will see outputs similar to following ---
 ```
 [ollama]               | [GIN] 2024/10/18 - 21:19:41 | 200 |     226.841µs |             ::1 | GET      "/api/ps"
 [ollama]               | [GIN] 2024/10/18 - 21:19:42 | 200 |      60.908µs |             ::1 | GET      "/api/ps"
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 [llamastack] | Resolved 12 providers
 [llamastack] |  inner-inference => ollama0
 [llamastack] |  models => __routing_table__
 [llamastack] |  inference => __autorouted__
 ```
 To kill the server
 ```
 docker compose down
 ```
 :::
 :::{tab-item} fireworks
 ```
 $ cd llama-stack/distributions/fireworks && docker compose up
 ```
 Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g.
 ```
 inference:
  - provider_id: fireworks
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
      api_key: <optional api key>
 ```
 :::
 :::{tab-item} together
 ```
 $ cd distributions/together && docker compose up
 ```
 Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g.
 ```
 inference:
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
      api_key: <optional api key>
 ```
 :::
 ::::
 **(Option 2) Via Conda**
 ::::{tab-set}
 :::{tab-item} meta-reference-gpu
 1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
 2. Build the `meta-reference-gpu` distribution
 ```
 $ llama stack build --template meta-reference-gpu --image-type conda
 ```
 3. Start running distribution
 ```
 $ llama stack run ~/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml
 ```
 Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
 ```
 memory:
  - provider_id: faiss-0
    provider_type: faiss
    config:
      kvstore:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/faiss_store.db
 ```
 :::
 :::{tab-item} tgi
 1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
 2. Build the `tgi` distribution
 ```bash
 llama stack build --template tgi --image-type conda
 ```
 3. Start a TGI server endpoint
 4. Make sure in your `run.yaml` file, your `conda_env` is pointing to the conda environment and inference provider is pointing to the correct TGI server endpoint. E.g.
 ```
 conda_env: llamastack-tgi
 ...
 inference:
  - provider_id: tgi0
    provider_type: remote::tgi
    config:
      url: http://127.0.0.1:5009
 ```
 5. Start Llama Stack server
 ```bash
 $ llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml
 ```
 Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
 ```
 memory:
  - provider_id: faiss-0
    provider_type: faiss
    config:
      kvstore:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/faiss_store.db
 ```
 :::
 :::{tab-item} ollama
-
+[Start Ollama Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html)
 If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
 #### Start Ollama server.
 - Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
 **Via Docker**
 ```
 docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
 ```
 **Via CLI**
 ```
 ollama run <model_id>
 ```
 #### Start Llama Stack server pointing to Ollama server
 Make sure your `run.yaml` file has the inference provider pointing to the correct Ollama endpoint. E.g.
 ```
 conda_env: llamastack-ollama
 ...
 inference:
  - provider_id: ollama0
    provider_type: remote::ollama
    config:
      url: http://127.0.0.1:11434
 ```
 ```
 llama stack build --template ollama --image-type conda
 llama stack run ~/.llama/distributions/llamastack-ollama/ollama-run.yaml
 ```
 Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section:
 ```
 memory:
  - provider_id: faiss-0
    provider_type: faiss
    config:
      kvstore:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/faiss_store.db
 ```
 :::
 :::{tab-item} fireworks
 ```bash
 llama stack build --template fireworks --image-type conda
 # -- modify run.yaml to a valid Fireworks server endpoint
 llama stack run ./run.yaml
 ```
 Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g.
 ```
 conda_env: llamastack-fireworks
 ...
 inference:
  - provider_id: fireworks
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
      api_key: <optional api key>
 ```
 :::
 :::{tab-item} together
-
+[Start Together Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/together.html)
 ```bash
 llama stack build --template together --image-type conda
 # -- modify run.yaml to a valid Together server endpoint
 llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml
 ```
 Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g.
 ```
 conda_env: llamastack-together
 ...
 inference:
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
      api_key: <optional api key>
 ```
 :::
 ::::
 ##### 1.2 (Optional) Update Model Serving Configuration
 ::::{tab-set}
 :::{tab-item} meta-reference-gpu
 You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
 ```
 inference:
  - provider_id: meta0
    provider_type: inline::meta-reference
    config:
      model: Llama3.2-11B-Vision-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
 ```
 Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
 :::
 :::{tab-item} tgi
 To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
 This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
 ```
 command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
 ```
 or by changing the docker run command's `--model-id` flag
 ```
 docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
 ```
 Make sure your `run.yaml` file has the inference provider pointing to the TGI server endpoint serving your model.
 ```
 inference:
  - provider_id: tgi0
    provider_type: remote::tgi
    config:
      url: http://127.0.0.1:5009
 ```
 ```
 Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
 :::
 :::{tab-item} ollama
 You can use ollama for managing model downloads.
 ```
 ollama pull llama3.1:8b-instruct-fp16
 ollama pull llama3.1:70b-instruct-fp16
 ```
 > Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers.remote/inference/ollama/ollama.py) for the supported Ollama models.
 To serve a new model with `ollama`
 ```
 ollama run <model_name>
 ```
 To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
 ```
 $ ollama ps
 NAME                         ID              SIZE     PROCESSOR    UNTIL
 llama3.1:8b-instruct-fp16    4aacac419454    17 GB    100% GPU     4 minutes from now
 ```
 To verify that the model served by ollama is correctly connected to Llama Stack server
 ```
 $ llama-stack-client models list
 +----------------------+----------------------+---------------+-----------------------------------------------+
 | identifier           | llama_model          | provider_id   | metadata                                      |
 +======================+======================+===============+===============================================+
 | Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0       | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
 +----------------------+----------------------+---------------+-----------------------------------------------+
 ```
 :::
 :::{tab-item} together
 Use `llama-stack-client models list` to check the available models served by together.
 ```
 $ llama-stack-client models list
 +------------------------------+------------------------------+---------------+------------+
 | identifier                   | llama_model                  | provider_id   | metadata   |
 +==============================+==============================+===============+============+
 | Llama3.1-8B-Instruct         | Llama3.1-8B-Instruct         | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.1-70B-Instruct        | Llama3.1-70B-Instruct        | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.1-405B-Instruct       | Llama3.1-405B-Instruct       | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-3B-Instruct         | Llama3.2-3B-Instruct         | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0     | {}         |
 +------------------------------+------------------------------+---------------+------------+
 ```
 :::
 :::{tab-item} fireworks
-Use `llama-stack-client models list` to check the available models served by Fireworks.
+[Start Fireworks Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/fireworks.html)
 ```
 $ llama-stack-client models list
 +------------------------------+------------------------------+---------------+------------+
 | identifier                   | llama_model                  | provider_id   | metadata   |
 +==============================+==============================+===============+============+
 | Llama3.1-8B-Instruct         | Llama3.1-8B-Instruct         | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.1-70B-Instruct        | Llama3.1-70B-Instruct        | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.1-405B-Instruct       | Llama3.1-405B-Instruct       | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-1B-Instruct         | Llama3.2-1B-Instruct         | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-3B-Instruct         | Llama3.2-3B-Instruct         | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 | Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0    | {}         |
 +------------------------------+------------------------------+---------------+------------+
 ```
 :::
 ::::
 ##### Troubleshooting
 - If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
 - Use `--port <PORT>` flag to use a different port number. For docker run, update the `-p <PORT>:<PORT>` flag.