# What does this PR do? Automatically generates - build.yaml - run.yaml - run-with-safety.yaml - parts of markdown docs for the distributions. ## Test Plan At this point, this only updates the YAMLs and the docs. Some testing (especially with ollama and vllm) has been performed but needs to be much more tested.
3.5 KiB
Remote vLLM Distribution
The llamastack/distribution-remote-vllm
distribution consists of the following provider configurations:
API | Provider(s) |
---|---|
agents | inline::meta-reference |
inference | remote::vllm |
memory | inline::faiss , remote::chromadb , remote::pgvector |
safety | inline::llama-guard |
telemetry | inline::meta-reference |
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
Setting up vLLM server
Please check the vLLM Documentation to get a vLLM endpoint. Here is a sample script to start a vLLM server locally via Docker:
export INFERENCE_PORT=8000
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export CUDA_VISIBLE_DEVICES=0
docker run \
--runtime nvidia \
--gpus $CUDA_VISIBLE_DEVICES \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p $INFERENCE_PORT:$INFERENCE_PORT \
--ipc=host \
vllm/vllm-openai:latest \
--model $INFERENCE_MODEL \
--port $INFERENCE_PORT
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like meta-llama/Llama-Guard-3-1B
using a script like:
export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1
docker run \
--runtime nvidia \
--gpus $CUDA_VISIBLE_DEVICES \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p $SAFETY_PORT:$SAFETY_PORT \
--ipc=host \
vllm/vllm-openai:latest \
--model $SAFETY_MODEL \
--port $SAFETY_PORT
Running Llama Stack
Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
Via Docker
This method allows you to get started quickly without having to build the distribution code.
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-remote-vllm \
/root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT \
If you are using Llama Stack Safety / Shield APIs, use:
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run-with-safety.yaml:/root/my-run.yaml \
llamastack/distribution-remote-vllm \
/root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env VLLM_SAFETY_URL=http://host.docker.internal:$SAFETY_PORT
Via Conda
Make sure you have done pip install llama-stack
and have the Llama Stack CLI available.
llama stack build --template remote-vllm --image-type conda
llama stack run ./run.yaml \
--port 5001 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://127.0.0.1:$INFERENCE_PORT
If you are using Llama Stack Safety / Shield APIs, use:
llama stack run ./run-with-safety.yaml \
--port 5001 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://127.0.0.1:$INFERENCE_PORT \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env VLLM_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT