diff --git a/docs/source/getting_started/distributions/self_hosted_distro/index.md b/docs/source/getting_started/distributions/self_hosted_distro/index.md index ed6ab5d7f..502b95cb4 100644 --- a/docs/source/getting_started/distributions/self_hosted_distro/index.md +++ b/docs/source/getting_started/distributions/self_hosted_distro/index.md @@ -23,5 +23,6 @@ tgi dell-tgi together fireworks +remote-vllm bedrock ``` diff --git a/docs/source/getting_started/index.md b/docs/source/getting_started/index.md index 189bd6cb5..6400fb285 100644 --- a/docs/source/getting_started/index.md +++ b/docs/source/getting_started/index.md @@ -53,9 +53,9 @@ Please see our pages in detail for the types of distributions we offer: 3. [On-device Distribution](./distributions/ondevice_distro/index.md): If you want to run Llama Stack inference on your iOS / Android device. -### Quick Start Commands +### Table of Contents -Once you have decided on the inference provider and distribution to use, use the following quick start commands to get started. +Once you have decided on the inference provider and distribution to use, use the following guides to get started. ##### 1.0 Prerequisite @@ -109,421 +109,32 @@ Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [firew ##### 1.1. Start the distribution -**(Option 1) Via Docker** -::::{tab-set} - :::{tab-item} meta-reference-gpu -``` -$ cd llama-stack/distributions/meta-reference-gpu && docker compose up -``` - -This will download and start running a pre-built Docker container. Alternatively, you may use the following commands: - -``` -docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml -``` +[Start Meta Reference GPU Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html) ::: :::{tab-item} vLLM -``` -$ cd llama-stack/distributions/remote-vllm && docker compose up -``` - -The script will first start up vLLM server on port 8000, then start up Llama Stack distribution server hooking up to it for inference. You should see the following outputs -- -``` - -``` - -To kill the server -``` -docker compose down -``` +[Start vLLM Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/remote-vllm.html) ::: :::{tab-item} tgi -``` -$ cd llama-stack/distributions/tgi && docker compose up -``` - -The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should see the following outputs -- -``` -[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama) -[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0 -[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected -INFO: Started server process [1] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -To kill the server -``` -docker compose down -``` -::: - - -:::{tab-item} ollama -``` -$ cd llama-stack/distributions/ollama && docker compose up - -# OR - -$ cd llama-stack/distributions/ollama-gpu && docker compose up -``` - -You will see outputs similar to following --- -``` -[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps" -[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps" -INFO: Started server process [1] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -[llamastack] | Resolved 12 providers -[llamastack] | inner-inference => ollama0 -[llamastack] | models => __routing_table__ -[llamastack] | inference => __autorouted__ -``` - -To kill the server -``` -docker compose down -``` -::: - -:::{tab-item} fireworks -``` -$ cd llama-stack/distributions/fireworks && docker compose up -``` - -Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g. -``` -inference: - - provider_id: fireworks - provider_type: remote::fireworks - config: - url: https://api.fireworks.ai/inference - api_key: -``` -::: - -:::{tab-item} together -``` -$ cd distributions/together && docker compose up -``` - -Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g. -``` -inference: - - provider_id: together - provider_type: remote::together - config: - url: https://api.together.xyz/v1 - api_key: -``` -::: - - -:::: - -**(Option 2) Via Conda** - -::::{tab-set} - -:::{tab-item} meta-reference-gpu -1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html) - -2. Build the `meta-reference-gpu` distribution - -``` -$ llama stack build --template meta-reference-gpu --image-type conda -``` - -3. Start running distribution -``` -$ llama stack run ~/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml -``` - -Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section: -``` -memory: - - provider_id: faiss-0 - provider_type: faiss - config: - kvstore: - namespace: null - type: sqlite - db_path: ~/.llama/runtime/faiss_store.db -``` - -::: - -:::{tab-item} tgi -1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html) - -2. Build the `tgi` distribution - -```bash -llama stack build --template tgi --image-type conda -``` - -3. Start a TGI server endpoint - -4. Make sure in your `run.yaml` file, your `conda_env` is pointing to the conda environment and inference provider is pointing to the correct TGI server endpoint. E.g. -``` -conda_env: llamastack-tgi -... -inference: - - provider_id: tgi0 - provider_type: remote::tgi - config: - url: http://127.0.0.1:5009 -``` - -5. Start Llama Stack server -```bash -$ llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml -``` - -Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section: -``` -memory: - - provider_id: faiss-0 - provider_type: faiss - config: - kvstore: - namespace: null - type: sqlite - db_path: ~/.llama/runtime/faiss_store.db -``` +[Start TGI Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html) ::: :::{tab-item} ollama - -If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands. - -#### Start Ollama server. -- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details. - -**Via Docker** -``` -docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama -``` - -**Via CLI** -``` -ollama run -``` - -#### Start Llama Stack server pointing to Ollama server - -Make sure your `run.yaml` file has the inference provider pointing to the correct Ollama endpoint. E.g. -``` -conda_env: llamastack-ollama -... -inference: - - provider_id: ollama0 - provider_type: remote::ollama - config: - url: http://127.0.0.1:11434 -``` - -``` -llama stack build --template ollama --image-type conda -llama stack run ~/.llama/distributions/llamastack-ollama/ollama-run.yaml -``` - -Note: If you wish to use pgvector or chromadb as memory provider. You may need to update generated `run.yaml` file to point to the desired memory provider. See [Memory Providers](https://llama-stack.readthedocs.io/en/latest/api_providers/memory_api.html) for more details. Or comment out the pgvector or chromadb memory provider in `run.yaml` file to use the default inline memory provider, keeping only the following section: -``` -memory: - - provider_id: faiss-0 - provider_type: faiss - config: - kvstore: - namespace: null - type: sqlite - db_path: ~/.llama/runtime/faiss_store.db -``` - -::: - -:::{tab-item} fireworks - -```bash -llama stack build --template fireworks --image-type conda -# -- modify run.yaml to a valid Fireworks server endpoint -llama stack run ./run.yaml -``` - -Make sure your `run.yaml` file has the inference provider pointing to the correct Fireworks URL server endpoint. E.g. -``` -conda_env: llamastack-fireworks -... -inference: - - provider_id: fireworks - provider_type: remote::fireworks - config: - url: https://api.fireworks.ai/inference - api_key: -``` +[Start Ollama Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html) ::: :::{tab-item} together - -```bash -llama stack build --template together --image-type conda -# -- modify run.yaml to a valid Together server endpoint -llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml -``` - -Make sure your `run.yaml` file has the inference provider pointing to the correct Together URL server endpoint. E.g. -``` -conda_env: llamastack-together -... -inference: - - provider_id: together - provider_type: remote::together - config: - url: https://api.together.xyz/v1 - api_key: -``` -::: - -:::: - -##### 1.2 (Optional) Update Model Serving Configuration -::::{tab-set} - -:::{tab-item} meta-reference-gpu -You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`. -``` -inference: - - provider_id: meta0 - provider_type: inline::meta-reference - config: - model: Llama3.2-11B-Vision-Instruct - quantization: null - torch_seed: null - max_seq_len: 4096 - max_batch_size: 1 -``` - -Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints. -::: - -:::{tab-item} tgi -To serve a new model with `tgi`, change the docker command flag `--model-id `. - -This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve. - -``` -command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"] -``` - -or by changing the docker run command's `--model-id` flag -``` -docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009 -``` - -Make sure your `run.yaml` file has the inference provider pointing to the TGI server endpoint serving your model. -``` -inference: - - provider_id: tgi0 - provider_type: remote::tgi - config: - url: http://127.0.0.1:5009 -``` -``` - -Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints. -::: - -:::{tab-item} ollama -You can use ollama for managing model downloads. - -``` -ollama pull llama3.1:8b-instruct-fp16 -ollama pull llama3.1:70b-instruct-fp16 -``` - -> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers.remote/inference/ollama/ollama.py) for the supported Ollama models. - - -To serve a new model with `ollama` -``` -ollama run -``` - -To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama. -``` -$ ollama ps - -NAME ID SIZE PROCESSOR UNTIL -llama3.1:8b-instruct-fp16 4aacac419454 17 GB 100% GPU 4 minutes from now -``` - -To verify that the model served by ollama is correctly connected to Llama Stack server -``` -$ llama-stack-client models list -+----------------------+----------------------+---------------+-----------------------------------------------+ -| identifier | llama_model | provider_id | metadata | -+======================+======================+===============+===============================================+ -| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0 | {'ollama_model': 'llama3.1:8b-instruct-fp16'} | -+----------------------+----------------------+---------------+-----------------------------------------------+ -``` -::: - -:::{tab-item} together -Use `llama-stack-client models list` to check the available models served by together. - -``` -$ llama-stack-client models list -+------------------------------+------------------------------+---------------+------------+ -| identifier | llama_model | provider_id | metadata | -+==============================+==============================+===============+============+ -| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | together0 | {} | -+------------------------------+------------------------------+---------------+------------+ -``` +[Start Together Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/together.html) ::: :::{tab-item} fireworks -Use `llama-stack-client models list` to check the available models served by Fireworks. -``` -$ llama-stack-client models list -+------------------------------+------------------------------+---------------+------------+ -| identifier | llama_model | provider_id | metadata | -+==============================+==============================+===============+============+ -| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.1-70B-Instruct | Llama3.1-70B-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.1-405B-Instruct | Llama3.1-405B-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-1B-Instruct | Llama3.2-1B-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-3B-Instruct | Llama3.2-3B-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0 | {} | -+------------------------------+------------------------------+---------------+------------+ -``` +[Start Fireworks Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/fireworks.html) ::: :::: - ##### Troubleshooting - If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue. - Use `--port ` flag to use a different port number. For docker run, update the `-p :` flag.