Run the script to produce vllm outputs

This commit is contained in:
Ashwin Bharambe 2024-11-17 14:09:36 -08:00
parent 0218e68849
commit 9bb07ce298
10 changed files with 109 additions and 71 deletions

View file

@ -1,20 +1,39 @@
# Remote vLLM Distribution
The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations.
The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |---------------- |---------------- |------------------------------------ |---------------- |---------------- |
| **Provider(s)** | remote::vllm | meta-reference | remote::pgvector, remote::chromadb | meta-reference | meta-reference |
Provider Configuration
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API ┃ Provider(s) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ agents │ `inline::meta-reference`
│ inference │ `remote::vllm`
│ memory │ `inline::faiss`, `remote::chromadb`, `remote::pgvector`
│ safety │ `inline::llama-guard`
│ telemetry │ `inline::meta-reference`
└───────────┴─────────────────────────────────────────────────────────┘
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.### Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100}/v1`)
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
### Models
The following models are configured by default:
- `${env.INFERENCE_MODEL}`
- `${env.SAFETY_MODEL}`
## Using Docker Compose
You can use `docker compose` to start a vLLM container and Llama Stack server container together.
> [!NOTE]
> This assumes you have access to GPU to start a vLLM server with access to your GPU.
```bash
$ cd distributions/remote-vllm; docker compose up
```
@ -31,8 +50,7 @@ docker compose down
## Starting vLLM and Llama Stack separately
You may want to start a vLLM server and connect with Llama Stack manually. There are two ways to start a vLLM server and connect with Llama Stack.
You can also decide to start a vLLM server and connect with Llama Stack manually. There are two ways to start a vLLM server and connect with Llama Stack.
#### Start vLLM server.
@ -43,7 +61,7 @@ docker run --runtime nvidia --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
--model meta-llama/Llama-3.2-3B-Instruct
```
Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) for more details.
@ -66,7 +84,7 @@ inference:
If you are using Conda, you can build and run the Llama Stack server with the following commands:
```bash
cd distributions/remote-vllm
llama stack build --template remote_vllm --image-type conda
llama stack build --template remote-vllm --image-type conda
llama stack run run.yaml
```