forked from phoenix-oss/llama-stack-mirror
# What does this PR do? The builtin implementation of code interpreter is not robust and has a really weak sandboxing shell (the `bubblewrap` container). Given the availability of better MCP code interpreter servers coming up, we should use them instead of baking an implementation into the Stack and expanding the vulnerability surface to the rest of the Stack. This PR only does the removal. We will add examples with how to integrate with MCPs in subsequent ones. ## Test Plan Existing tests.
297 lines
10 KiB
Markdown
297 lines
10 KiB
Markdown
---
|
|
orphan: true
|
|
---
|
|
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
|
# Remote vLLM Distribution
|
|
```{toctree}
|
|
:maxdepth: 2
|
|
:hidden:
|
|
|
|
self
|
|
```
|
|
|
|
The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:
|
|
|
|
| API | Provider(s) |
|
|
|-----|-------------|
|
|
| agents | `inline::meta-reference` |
|
|
| datasetio | `remote::huggingface`, `inline::localfs` |
|
|
| eval | `inline::meta-reference` |
|
|
| inference | `remote::vllm`, `inline::sentence-transformers` |
|
|
| safety | `inline::llama-guard` |
|
|
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
|
| telemetry | `inline::meta-reference` |
|
|
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
|
|
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
|
|
|
|
|
You can use this distribution if you want to run an independent vLLM server for inference.
|
|
|
|
### Environment Variables
|
|
|
|
The following environment variables can be configured:
|
|
|
|
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
|
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
|
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`)
|
|
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
|
|
- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
|
|
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
|
|
|
|
|
## Setting up vLLM server
|
|
|
|
In the following sections, we'll use AMD, NVIDIA or Intel GPUs to serve as hardware accelerators for the vLLM
|
|
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
|
|
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
|
|
that we only use GPUs here for demonstration purposes. Note that if you run into issues, you can include the environment variable `--env VLLM_DEBUG_LOG_API_SERVER_RESPONSE=true` (available in vLLM v0.8.3 and above) in the `docker run` command to enable log response from API server for debugging.
|
|
|
|
### Setting up vLLM server on AMD GPU
|
|
|
|
AMD provides two main vLLM container options:
|
|
- rocm/vllm: Production-ready container
|
|
- rocm/vllm-dev: Development container with the latest vLLM features
|
|
|
|
Please check the [Blog about ROCm vLLM Usage](https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html) to get more details.
|
|
|
|
Here is a sample script to start a ROCm vLLM server locally via Docker:
|
|
|
|
```bash
|
|
export INFERENCE_PORT=8000
|
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|
export CUDA_VISIBLE_DEVICES=0
|
|
export VLLM_DIMG="rocm/vllm-dev:main"
|
|
|
|
docker run \
|
|
--pull always \
|
|
--ipc=host \
|
|
--privileged \
|
|
--shm-size 16g \
|
|
--device=/dev/kfd \
|
|
--device=/dev/dri \
|
|
--group-add video \
|
|
--cap-add=SYS_PTRACE \
|
|
--cap-add=CAP_SYS_ADMIN \
|
|
--security-opt seccomp=unconfined \
|
|
--security-opt apparmor=unconfined \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
|
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
$VLLM_DIMG \
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model $INFERENCE_MODEL \
|
|
--port $INFERENCE_PORT
|
|
```
|
|
|
|
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
|
|
|
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
|
|
|
```bash
|
|
export SAFETY_PORT=8081
|
|
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|
export CUDA_VISIBLE_DEVICES=1
|
|
export VLLM_DIMG="rocm/vllm-dev:main"
|
|
|
|
docker run \
|
|
--pull always \
|
|
--ipc=host \
|
|
--privileged \
|
|
--shm-size 16g \
|
|
--device=/dev/kfd \
|
|
--device=/dev/dri \
|
|
--group-add video \
|
|
--cap-add=SYS_PTRACE \
|
|
--cap-add=CAP_SYS_ADMIN \
|
|
--security-opt seccomp=unconfined \
|
|
--security-opt apparmor=unconfined \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
|
-p $SAFETY_PORT:$SAFETY_PORT \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
$VLLM_DIMG \
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model $SAFETY_MODEL \
|
|
--port $SAFETY_PORT
|
|
```
|
|
|
|
### Setting up vLLM server on NVIDIA GPU
|
|
|
|
Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) to get a vLLM endpoint. Here is a sample script to start a vLLM server locally via Docker:
|
|
|
|
```bash
|
|
export INFERENCE_PORT=8000
|
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|
export CUDA_VISIBLE_DEVICES=0
|
|
|
|
docker run \
|
|
--pull always \
|
|
--runtime nvidia \
|
|
--gpus $CUDA_VISIBLE_DEVICES \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
|
--ipc=host \
|
|
vllm/vllm-openai:latest \
|
|
--gpu-memory-utilization 0.7 \
|
|
--model $INFERENCE_MODEL \
|
|
--port $INFERENCE_PORT
|
|
```
|
|
|
|
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
|
|
|
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
|
|
|
```bash
|
|
export SAFETY_PORT=8081
|
|
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|
export CUDA_VISIBLE_DEVICES=1
|
|
|
|
docker run \
|
|
--pull always \
|
|
--runtime nvidia \
|
|
--gpus $CUDA_VISIBLE_DEVICES \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
-p $SAFETY_PORT:$SAFETY_PORT \
|
|
--ipc=host \
|
|
vllm/vllm-openai:latest \
|
|
--gpu-memory-utilization 0.7 \
|
|
--model $SAFETY_MODEL \
|
|
--port $SAFETY_PORT
|
|
```
|
|
|
|
### Setting up vLLM server on Intel GPU
|
|
|
|
Refer to [vLLM Documentation for XPU](https://docs.vllm.ai/en/v0.8.2/getting_started/installation/gpu.html?device=xpu) to get a vLLM endpoint. In addition to vLLM side setup which guides towards installing vLLM from sources orself-building vLLM Docker container, Intel provides prebuilt vLLM container to use on systems with Intel GPUs supported by PyTorch XPU backend:
|
|
- [intel/vllm](https://hub.docker.com/r/intel/vllm)
|
|
|
|
Here is a sample script to start a vLLM server locally via Docker using Intel provided container:
|
|
|
|
```bash
|
|
export INFERENCE_PORT=8000
|
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
|
|
export ZE_AFFINITY_MASK=0
|
|
|
|
docker run \
|
|
--pull always \
|
|
--device /dev/dri \
|
|
-v /dev/dri/by-path:/dev/dri/by-path \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
--env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
|
|
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
|
--ipc=host \
|
|
intel/vllm:xpu \
|
|
--gpu-memory-utilization 0.7 \
|
|
--model $INFERENCE_MODEL \
|
|
--port $INFERENCE_PORT
|
|
```
|
|
|
|
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
|
|
|
```bash
|
|
export SAFETY_PORT=8081
|
|
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|
export ZE_AFFINITY_MASK=1
|
|
|
|
docker run \
|
|
--pull always \
|
|
--device /dev/dri \
|
|
-v /dev/dri/by-path:/dev/dri/by-path \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
|
--env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
|
|
-p $SAFETY_PORT:$SAFETY_PORT \
|
|
--ipc=host \
|
|
intel/vllm:xpu \
|
|
--gpu-memory-utilization 0.7 \
|
|
--model $SAFETY_MODEL \
|
|
--port $SAFETY_PORT
|
|
```
|
|
|
|
## Running Llama Stack
|
|
|
|
Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
|
|
|
|
### Via Docker
|
|
|
|
This method allows you to get started quickly without having to build the distribution code.
|
|
|
|
```bash
|
|
export INFERENCE_PORT=8000
|
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|
export LLAMA_STACK_PORT=8321
|
|
|
|
# You need a local checkout of llama-stack to run this, get it using
|
|
# git clone https://github.com/meta-llama/llama-stack.git
|
|
cd /path/to/llama-stack
|
|
|
|
docker run \
|
|
--pull always \
|
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
|
-v ./llama_stack/templates/remote-vllm/run.yaml:/root/my-run.yaml \
|
|
llamastack/distribution-remote-vllm \
|
|
--yaml-config /root/my-run.yaml \
|
|
--port $LLAMA_STACK_PORT \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
|
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1
|
|
```
|
|
|
|
If you are using Llama Stack Safety / Shield APIs, use:
|
|
|
|
```bash
|
|
export SAFETY_PORT=8081
|
|
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|
|
|
# You need a local checkout of llama-stack to run this, get it using
|
|
# git clone https://github.com/meta-llama/llama-stack.git
|
|
cd /path/to/llama-stack
|
|
|
|
docker run \
|
|
--pull always \
|
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
|
-v ~/.llama:/root/.llama \
|
|
-v ./llama_stack/templates/remote-vllm/run-with-safety.yaml:/root/my-run.yaml \
|
|
llamastack/distribution-remote-vllm \
|
|
--yaml-config /root/my-run.yaml \
|
|
--port $LLAMA_STACK_PORT \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
|
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1 \
|
|
--env SAFETY_MODEL=$SAFETY_MODEL \
|
|
--env SAFETY_VLLM_URL=http://host.docker.internal:$SAFETY_PORT/v1
|
|
```
|
|
|
|
|
|
### Via Conda
|
|
|
|
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
|
|
|
```bash
|
|
export INFERENCE_PORT=8000
|
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|
export LLAMA_STACK_PORT=8321
|
|
|
|
cd distributions/remote-vllm
|
|
llama stack build --template remote-vllm --image-type conda
|
|
|
|
llama stack run ./run.yaml \
|
|
--port $LLAMA_STACK_PORT \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
|
--env VLLM_URL=http://localhost:$INFERENCE_PORT/v1
|
|
```
|
|
|
|
If you are using Llama Stack Safety / Shield APIs, use:
|
|
|
|
```bash
|
|
export SAFETY_PORT=8081
|
|
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|
|
|
llama stack run ./run-with-safety.yaml \
|
|
--port $LLAMA_STACK_PORT \
|
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
|
--env VLLM_URL=http://localhost:$INFERENCE_PORT/v1 \
|
|
--env SAFETY_MODEL=$SAFETY_MODEL \
|
|
--env SAFETY_VLLM_URL=http://localhost:$SAFETY_PORT/v1
|
|
```
|