Run the script to produce vllm outputs

2025-12-17 18:42:38 +00:00 · 2024-11-17 14:09:36 -08:00 · 2024-11-17 14:09:36 -08:00 · 9bb07ce298
commit 9bb07ce298
parent 0218e68849
10 changed files with 109 additions and 71 deletions
--- a/docs/source/getting_started/distributions/self_hosted_distro/remote-vllm.md
+++ b/docs/source/getting_started/distributions/self_hosted_distro/remote-vllm.md
@ -1,20 +1,39 @@
 # Remote vLLM Distribution

-The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations.
+The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:

-| **API**           | **Inference**   | **Agents**      | **Memory**                          | **Safety**     	| **Telemetry**  	|
-|-----------------  |---------------- |---------------- |------------------------------------	|----------------	|----------------	|
-| **Provider(s)**   | remote::vllm  	| meta-reference 	| remote::pgvector, remote::chromadb 	| meta-reference 	| meta-reference 	|
+                        Provider Configuration
+┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ API       ┃ Provider(s)                                             ┃
+┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ agents    │ `inline::meta-reference`                                │
+│ inference │ `remote::vllm`                                          │
+│ memory    │ `inline::faiss`, `remote::chromadb`, `remote::pgvector` │
+│ safety    │ `inline::llama-guard`                                   │
+│ telemetry │ `inline::meta-reference`                                │
+└───────────┴─────────────────────────────────────────────────────────┘

-You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
+
+You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.### Environment Variables
+
+The following environment variables can be configured:
+
+- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
+- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
+- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100}/v1`)
+- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
+- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
+- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
+
+### Models
+
+The following models are configured by default:
+- `${env.INFERENCE_MODEL}`
+- `${env.SAFETY_MODEL}`

 ## Using Docker Compose

 You can use `docker compose` to start a vLLM container and Llama Stack server container together.
-
-> [!NOTE]
-> This assumes you have access to GPU to start a vLLM server with access to your GPU.
-
 ```bash
 $ cd distributions/remote-vllm; docker compose up
 ```
@ -31,8 +50,7 @@ docker compose down

 ## Starting vLLM and Llama Stack separately

-You may want to start a vLLM server and connect with Llama Stack manually. There are two ways to start a vLLM server and connect with Llama Stack.
-
+You can also decide to start a vLLM server and connect with Llama Stack manually. There are two ways to start a vLLM server and connect with Llama Stack.

 #### Start vLLM server.

@ -43,7 +61,7 @@ docker run --runtime nvidia --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
-    --model meta-llama/Llama-3.1-8B-Instruct
+    --model meta-llama/Llama-3.2-3B-Instruct
 ```

 Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) for more details.
@ -66,7 +84,7 @@ inference:
 If you are using Conda, you can build and run the Llama Stack server with the following commands:
 ```bash
 cd distributions/remote-vllm
-llama stack build --template remote_vllm --image-type conda
+llama stack build --template remote-vllm --image-type conda
 llama stack run run.yaml
 ```