llama-stack-mirror/llama_stack/templates/remote-vllm/doc_template.md
2024-11-16 21:48:34 -08:00

2.7 KiB

Remote vLLM Distribution

The llamastack/distribution-{{ name }} distribution consists of the following provider configurations:

{{ providers_table }}

You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.

{%- if docker_compose_env_vars %}

Environment Variables

The following environment variables can be configured:

{% for var, (default_value, description) in docker_compose_env_vars.items() %}

  • {{ var }}: {{ description }} (default: {{ default_value }}) {% endfor %} {% endif %}

{% if default_models %}

Models

The following models are configured by default: {% for model in default_models %}

  • {{ model.model_id }} {% endfor %} {% endif %}

Using Docker Compose

You can use docker compose to start a vLLM container and Llama Stack server container together.

$ cd distributions/{{ name }}; docker compose up

You will see outputs similar to following ---

<TO BE FILLED>

To kill the server

docker compose down

Starting vLLM and Llama Stack separately

You can also decide to start a vLLM server and connect with Llama Stack manually. There are two ways to start a vLLM server and connect with Llama Stack.

Start vLLM server.

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.2-3B-Instruct

Please check the vLLM Documentation for more details.

Start Llama Stack server pointing to your vLLM server

We have provided a template run.yaml file in the distributions/remote-vllm directory. Please make sure to modify the inference.provider_id to point to your vLLM server endpoint. As an example, if your vLLM server is running on http://127.0.0.1:8000, your run.yaml file should look like the following:

inference:
  - provider_id: vllm0
    provider_type: remote::vllm
    config:
      url: http://127.0.0.1:8000

Via Conda

If you are using Conda, you can build and run the Llama Stack server with the following commands:

cd distributions/remote-vllm
llama stack build --template remote_vllm --image-type conda
llama stack run run.yaml

Via Docker

You can use the Llama Stack Docker image to start the server with the following command:

docker run --network host -it -p 5000:5000 \
  -v ~/.llama:/root/.llama \
  -v ./gpu/run.yaml:/root/llamastack-run-remote-vllm.yaml \
  --gpus=all \
  llamastack/distribution-remote-vllm \
  --yaml_config /root/llamastack-run-remote-vllm.yaml