llama-stack/docs/source/getting_started/distributions/self_hosted_distro/tgi.md
2024-11-18 23:51:35 -08:00

4 KiB

TGI Distribution

The llamastack/distribution-tgi distribution consists of the following provider configurations.

API Provider(s)
agents inline::meta-reference
inference remote::tgi
memory inline::faiss, remote::chromadb, remote::pgvector
safety inline::llama-guard
telemetry inline::meta-reference

You can use this distribution if you have GPUs and want to run an independent TGI server container for running inference.

Environment Variables

The following environment variables can be configured:

  • LLAMASTACK_PORT: Port for the Llama Stack distribution server (default: 5001)
  • INFERENCE_MODEL: Inference model loaded into the TGI server (default: meta-llama/Llama-3.2-3B-Instruct)
  • TGI_URL: URL of the TGI server with the main inference model (default: http://127.0.0.1:8080}/v1)
  • TGI_SAFETY_URL: URL of the TGI server with the safety model (default: http://127.0.0.1:8081/v1)
  • SAFETY_MODEL: Name of the safety (Llama-Guard) model to use (default: meta-llama/Llama-Guard-3-1B)

Setting up TGI server

Please check the TGI Getting Started Guide to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker:

export INFERENCE_PORT=8080
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export CUDA_VISIBLE_DEVICES=0

docker run --rm -it \
  -v $HOME/.cache/huggingface:/data \
  -p $INFERENCE_PORT:$INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference:2.3.1 \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --cuda-memory-fraction 0.7 \
  --model-id $INFERENCE_MODEL \
  --port $INFERENCE_PORT

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like:

export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1

docker run --rm -it \
  -v $HOME/.cache/huggingface:/data \
  -p $SAFETY_PORT:$SAFETY_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference:2.3.1 \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --model-id $SAFETY_MODEL \
  --port $SAFETY_PORT

Running Llama Stack

Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

LLAMA_STACK_PORT=5001
docker run \
  -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ./run.yaml:/root/my-run.yaml \
  llamastack/distribution-tgi \
  --yaml-config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT

If you are using Llama Stack Safety / Shield APIs, use:

docker run \
  -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ./run-with-safety.yaml:/root/my-run.yaml \
  llamastack/distribution-tgi \
  --yaml-config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT \
  --env SAFETY_MODEL=$SAFETY_MODEL \
  --env TGI_SAFETY_URL=http://host.docker.internal:$SAFETY_PORT

Via Conda

Make sure you have done pip install llama-stack and have the Llama Stack CLI available.

llama stack build --template tgi --image-type conda
llama stack run ./run.yaml
  --port 5001
  --env INFERENCE_MODEL=$INFERENCE_MODEL
  --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT

If you are using Llama Stack Safety / Shield APIs, use:

llama stack run ./run-with-safety.yaml
  --port 5001
  --env INFERENCE_MODEL=$INFERENCE_MODEL
  --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT
  --env SAFETY_MODEL=$SAFETY_MODEL
  --env TGI_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT