[docs] update documentations (#356)

* move docs -> source * Add files via upload * mv image * Add files via upload * colocate iOS setup doc * delete image * Add files via upload * fix * delete image * Add files via upload * Update developer_cookbook.md * toctree * wip subfolder * docs update * subfolder * updates * name * updates * index * updates * refactor structure * depth * docs * content * docs * getting started * distributions * fireworks * fireworks * update * theme * theme * theme * pdj theme * pytorch theme * css * theme * agents example * format * index * headers * copy button * test tabs * test tabs * fix * tabs * tab * tabs * sphinx_design * quick start commands * size * width * css * css * download models * asthetic fix * tab format * update * css * width * css * docs * tab based * tab * tabs * docs * style * image * css * color * typo * update docs * missing links * list templates * links * links update * troubleshooting * fix * distributions * docs * fix table * kill llamastack-local-gpu/cpu * Update index.md * Update index.md * mv ios_setup.md * Update ios_setup.md * Add remote_or_local.gif * Update ios_setup.md * release notes * typos * Add ios_setup to index * nav bar * hide torctree * ios image * links update * rename * rename * docs * rename * links * distributions * distributions * distributions * distributions * remove release * remote --------- Co-authored-by: dltn <6599399+dltn@users.noreply.github.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2024-11-04 16:52:38 -08:00 · 2024-11-04 16:52:38 -08:00 · c810a4184d
commit c810a4184d
parent ac93dd89cf
37 changed files with 1777 additions and 2154 deletions
--- a/docs/source/getting_started/distributions/self_hosted_distro/dell-tgi.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/dell-tgi.md
@ -0,0 +1,68 @@
+# Dell-TGI Distribution
+
+The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
+
+
+| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
+|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
+| **Provider(s)** 	| remote::tgi   	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
+
+
+The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
+
+
+### Start the Distribution (Single Node GPU)
+
+> [!NOTE]
+> This assumes you have access to GPU to start a TGI server with access to your GPU.
+
+```
+$ cd distributions/dell-tgi/
+$ ls
+compose.yaml  README.md  run.yaml
+$ docker compose up
+```
+
+The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
+```
+[text-generation-inference] | 2024-10-15T18:56:33.810397Z  INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
+[text-generation-inference] | 2024-10-15T18:56:33.810448Z  WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
+[text-generation-inference] | 2024-10-15T18:56:33.864143Z  INFO text_generation_router::server: router/src/server.rs:2353: Connected
+INFO:     Started server process [1]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
+```
+
+To kill the server
+```
+docker compose down
+```
+
+### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
+
+#### Start Dell-TGI server locally
+```
+docker run -it --shm-size 1g -p 80:80 --gpus 4 \
+-e NUM_SHARD=4
+-e MAX_BATCH_PREFILL_TOKENS=32768 \
+-e MAX_INPUT_TOKENS=8000 \
+-e MAX_TOTAL_TOKENS=8192 \
+registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
+```
+
+
+#### Start Llama Stack server pointing to TGI server
+
+```
+docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
+```
+
+Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
+```
+inference:
+  - provider_id: tgi0
+    provider_type: remote::tgi
+    config:
+      url: http://127.0.0.1:5009
+```
--- a/docs/source/getting_started/distributions/self_hosted_distro/index.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/index.md
@ -0,0 +1,20 @@
+# Self-Hosted Distribution
+
+We offer deployable distributions where you can host your own Llama Stack server using local inference.
+
+| **Distribution** 	|           **Llama Stack Docker**           	| Start This Distribution 	|    **Inference**   	|     **Agents**     	|     **Memory**     	|     **Safety**     	|    **Telemetry**   	|
+|:----------------:	|:------------------------------------------:	|:-----------------------:	|:------------------:	|:------------------:	|:------------------:	|:------------------:	|:------------------:	|
+|  Meta Reference  	| [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) 	|       [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html)       	| meta-reference 	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb	| meta-reference 	| meta-reference	|
+|  Meta Reference Quantized  	| [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) 	|       [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.html)       	| meta-reference-quantized 	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb	| meta-reference 	| meta-reference	|
+|      Ollama      	|       [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general)       	|       [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html)       	| remote::ollama	| meta-reference 	| remote::pgvector; remote::chromadb 	|  meta-reference 	| meta-reference 	|
+|        TGI       	|         [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general)        	|       [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html)       	| remote::tgi	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb 	| meta-reference 	| meta-reference 	|
+
+```{toctree}
+:maxdepth: 1
+
+meta-reference-gpu
+meta-reference-quantized-gpu
+ollama
+tgi
+dell-tgi
+```
--- a/docs/source/getting_started/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/meta-reference-gpu.md
@ -0,0 +1,71 @@
+# Meta Reference Distribution
+
+The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
+
+
+| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
+|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
+| **Provider(s)** 	| meta-reference  	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
+
+
+### Step 0. Prerequisite - Downloading Models
+Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
+
+```
+$ ls ~/.llama/checkpoints
+Llama3.1-8B           Llama3.2-11B-Vision-Instruct  Llama3.2-1B-Instruct  Llama3.2-90B-Vision-Instruct  Llama-Guard-3-8B
+Llama3.1-8B-Instruct  Llama3.2-1B                   Llama3.2-3B-Instruct  Llama-Guard-3-1B              Prompt-Guard-86M
+```
+
+### Step 1. Start the Distribution
+
+#### (Option 1) Start with Docker
+```
+$ cd distributions/meta-reference-gpu && docker compose up
+```
+
+> [!NOTE]
+> This assumes you have access to GPU to start a local server with access to your GPU.
+
+
+> [!NOTE]
+> `~/.llama` should be the path containing downloaded weights of Llama models.
+
+
+This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
+
+```
+docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
+```
+
+#### (Option 2) Start with Conda
+
+1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
+
+2. Build the `meta-reference-gpu` distribution
+
+```
+$ llama stack build --template meta-reference-gpu --image-type conda
+```
+
+3. Start running distribution
+```
+$ cd distributions/meta-reference-gpu
+$ llama stack run ./run.yaml
+```
+
+### (Optional) Serving a new model
+You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
+```
+inference:
+  - provider_id: meta0
+    provider_type: meta-reference
+    config:
+      model: Llama3.2-11B-Vision-Instruct
+      quantization: null
+      torch_seed: null
+      max_seq_len: 4096
+      max_batch_size: 1
+```
+
+Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
--- a/docs/source/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.md
@ -0,0 +1,34 @@
+# Meta Reference Quantized Distribution
+
+The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations.
+
+
+| **API**         	| **Inference**            	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
+|-----------------	|------------------------  	|----------------	|--------------------------------------------------	|----------------	|----------------	|
+| **Provider(s)** 	| meta-reference-quantized  | meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
+
+The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
+
+### Start the Distribution (Single Node GPU)
+
+> [!NOTE]
+> This assumes you have access to GPU to start a local server with access to your GPU.
+
+
+> [!NOTE]
+> `~/.llama` should be the path containing downloaded weights of Llama models.
+
+
+To download and start running a pre-built docker container, you may use the following commands:
+
+```
+docker run -it -p 5000:5000 -v ~/.llama:/root/.llama \
+  -v ./run.yaml:/root/my-run.yaml \
+  --gpus=all \
+  distribution-meta-reference-quantized-gpu \
+  --yaml_config /root/my-run.yaml
+```
+
+### Alternative (Build and start distribution locally via conda)
+
+- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up the distribution.
--- a/docs/source/getting_started/distributions/self_hosted_distro/ollama.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/ollama.md
@ -0,0 +1,129 @@
+# Ollama Distribution
+
+The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
+
+| **API**         	| **Inference**  	| **Agents**     	| **Memory**                       	| **Safety**     	| **Telemetry**  	|
+|-----------------	|----------------	|----------------	|----------------------------------	|----------------	|----------------	|
+| **Provider(s)** 	| remote::ollama 	| meta-reference 	| remote::pgvector, remote::chroma 	| remote::ollama 	| meta-reference 	|
+
+
+### Docker: Start a Distribution (Single Node GPU)
+
+> [!NOTE]
+> This assumes you have access to GPU to start a Ollama server with access to your GPU.
+
+```
+$ cd distributions/ollama/gpu
+$ ls
+compose.yaml  run.yaml
+$ docker compose up
+```
+
+You will see outputs similar to following ---
+```
+[ollama]               | [GIN] 2024/10/18 - 21:19:41 | 200 |     226.841µs |             ::1 | GET      "/api/ps"
+[ollama]               | [GIN] 2024/10/18 - 21:19:42 | 200 |      60.908µs |             ::1 | GET      "/api/ps"
+INFO:     Started server process [1]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
+[llamastack] | Resolved 12 providers
+[llamastack] |  inner-inference => ollama0
+[llamastack] |  models => __routing_table__
+[llamastack] |  inference => __autorouted__
+```
+
+To kill the server
+```
+docker compose down
+```
+
+### Docker: Start the Distribution (Single Node CPU)
+
+> [!NOTE]
+> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
+
+```
+$ cd distributions/ollama/cpu
+$ ls
+compose.yaml  run.yaml
+$ docker compose up
+```
+
+### Conda: ollama run + llama stack run
+
+If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
+
+#### Start Ollama server.
+- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
+
+**Via Docker**
+```
+docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
+```
+
+**Via CLI**
+```
+ollama run <model_id>
+```
+
+#### Start Llama Stack server pointing to Ollama server
+
+**Via Conda**
+
+```
+llama stack build --template ollama --image-type conda
+llama stack run ./gpu/run.yaml
+```
+
+**Via Docker**
+```
+docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
+```
+
+Make sure in your `run.yaml` file, your inference provider is pointing to the correct Ollama endpoint. E.g.
+```
+inference:
+  - provider_id: ollama0
+    provider_type: remote::ollama
+    config:
+      url: http://127.0.0.1:14343
+```
+
+### (Optional) Update Model Serving Configuration
+
+#### Downloading model via Ollama
+
+You can use ollama for managing model downloads.
+
+```
+ollama pull llama3.1:8b-instruct-fp16
+ollama pull llama3.1:70b-instruct-fp16
+```
+
+> [!NOTE]
+> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py) for the supported Ollama models.
+
+
+To serve a new model with `ollama`
+```
+ollama run <model_name>
+```
+
+To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
+```
+$ ollama ps
+
+NAME                         ID              SIZE     PROCESSOR    UNTIL
+llama3.1:8b-instruct-fp16    4aacac419454    17 GB    100% GPU     4 minutes from now
+```
+
+To verify that the model served by ollama is correctly connected to Llama Stack server
+```
+$ llama-stack-client models list
+----------------------+----------------------+---------------+-----------------------------------------------+
+| identifier           | llama_model          | provider_id   | metadata                                      |
+======================+======================+===============+===============================================+
+| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0       | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
+----------------------+----------------------+---------------+-----------------------------------------------+
+```
--- a/docs/source/getting_started/distributions/self_hosted_distro/tgi.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/tgi.md
@ -0,0 +1,112 @@
+# TGI Distribution
+
+The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
+
+
+| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
+|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
+| **Provider(s)** 	| remote::tgi   	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
+
+
+### Docker: Start the Distribution (Single Node GPU)
+
+> [!NOTE]
+> This assumes you have access to GPU to start a TGI server with access to your GPU.
+
+
+```
+$ cd distributions/tgi/gpu && docker compose up
+```
+
+The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
+```
+[text-generation-inference] | 2024-10-15T18:56:33.810397Z  INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
+[text-generation-inference] | 2024-10-15T18:56:33.810448Z  WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
+[text-generation-inference] | 2024-10-15T18:56:33.864143Z  INFO text_generation_router::server: router/src/server.rs:2353: Connected
+INFO:     Started server process [1]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
+```
+
+To kill the server
+```
+docker compose down
+```
+
+### Docker: Start the Distribution (Single Node CPU)
+
+> [!NOTE]
+> This assumes you have an hosted endpoint compatible with TGI server.
+
+```
+$ cd distributions/tgi/cpu && docker compose up
+```
+
+Replace <ENTER_YOUR_TGI_HOSTED_ENDPOINT> in `run.yaml` file with your TGI endpoint.
+```
+inference:
+  - provider_id: tgi0
+    provider_type: remote::tgi
+    config:
+      url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
+```
+
+### Conda: TGI server + llama stack run
+
+If you wish to separately spin up a TGI server, and connect with Llama Stack, you may use the following commands.
+
+#### Start TGI server locally
+- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint.
+
+```
+docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.1-8B-Instruct --port 5009
+```
+
+#### Start Llama Stack server pointing to TGI server
+
+**Via Conda**
+
+```bash
+llama stack build --template tgi --image-type conda
+# -- start a TGI server endpoint
+llama stack run ./gpu/run.yaml
+```
+
+**Via Docker**
+```
+docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
+```
+
+Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
+```
+inference:
+  - provider_id: tgi0
+    provider_type: remote::tgi
+    config:
+      url: http://127.0.0.1:5009
+```
+
+
+### (Optional) Update Model Serving Configuration
+To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
+
+This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
+
+```
+command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
+```
+
+or by changing the docker run command's `--model-id` flag
+```
+docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
+```
+
+In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
+```
+inference:
+  - provider_id: tgi0
+    provider_type: remote::tgi
+    config:
+      url: http://127.0.0.1:5009
+```