mirror of https://github.com/meta-llama/llama-stack.git synced 2025-06-27 18:50:41 +00:00

fix: replace all instances of --yaml-config with --config (#2196 )

# What does this PR do?

start_stack.sh was using --yaml-config which is deprecated.

a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config

resolves #2189

Signed-off-by: Charlie Doern <cdoern@redhat.com>

2025-05-16 14:31:12 -07:00

5.9 KiB

Raw Blame History

orphan
true

Dell Distribution of Llama Stack

:maxdepth: 2
:hidden:

self

The llamastack/distribution-dell distribution consists of the following provider configurations.

API	Provider(s)
agents	`inline::meta-reference`
datasetio	`remote::huggingface`, `inline::localfs`
eval	`inline::meta-reference`
inference	`remote::tgi`, `inline::sentence-transformers`
safety	`inline::llama-guard`
scoring	`inline::basic`, `inline::llm-as-judge`, `inline::braintrust`
telemetry	`inline::meta-reference`
tool_runtime	`remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`
vector_io	`inline::faiss`, `remote::chromadb`, `remote::pgvector`

You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.

Environment Variables

The following environment variables can be configured:

DEH_URL: URL for the Dell inference server (default: http://0.0.0.0:8181)
DEH_SAFETY_URL: URL for the Dell safety inference server (default: http://0.0.0.0:8282)
CHROMA_URL: URL for the Chroma server (default: http://localhost:6601)
INFERENCE_MODEL: Inference model loaded into the TGI server (default: meta-llama/Llama-3.2-3B-Instruct)
SAFETY_MODEL: Name of the safety (Llama-Guard) model to use (default: meta-llama/Llama-Guard-3-1B)

Setting up Inference server using Dell Enterprise Hub's custom TGI container.

NOTE: This is a placeholder to run inference with TGI. This will be updated to use Dell Enterprise Hub's containers once verified.

export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321

docker run --rm -it \
  --pull always \
  --network host \
  -v $HOME/.cache/huggingface:/data \
  -e HF_TOKEN=$HF_TOKEN \
  -p $INFERENCE_PORT:$INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --cuda-memory-fraction 0.7 \
  --model-id $INFERENCE_MODEL \
  --port $INFERENCE_PORT --hostname 0.0.0.0

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like:

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1

docker run --rm -it \
  --pull always \
  --network host \
  -v $HOME/.cache/huggingface:/data \
  -e HF_TOKEN=$HF_TOKEN \
  -p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --cuda-memory-fraction 0.7 \
  --model-id $SAFETY_MODEL \
  --hostname 0.0.0.0 \
  --port $SAFETY_INFERENCE_PORT

Dell distribution relies on ChromaDB for vector database usage

You can start a chroma-db easily using docker.

# This is where the indices are persisted
mkdir -p $HOME/chromadb

podman run --rm -it \
  --network host \
  --name chromadb \
  -v $HOME/chromadb:/chroma/chroma \
  -e IS_PERSISTENT=TRUE \
  chromadb/chroma:latest \
  --port $CHROMADB_PORT \
  --host $CHROMADB_HOST

Running Llama Stack

Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

docker run -it \
  --pull always \
  --network host \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v $HOME/.llama:/root/.llama \
  # NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
  -v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \
  # localhost/distribution-dell:dev if building / testing locally
  llamastack/distribution-dell\
  --port $LLAMA_STACK_PORT  \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env DEH_URL=$DEH_URL \
  --env CHROMA_URL=$CHROMA_URL

If you are using Llama Stack Safety / Shield APIs, use:

# You need a local checkout of llama-stack to run this, get it using
# git clone https://github.com/meta-llama/llama-stack.git
cd /path/to/llama-stack

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v $HOME/.llama:/root/.llama \
  -v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
  llamastack/distribution-dell \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env DEH_URL=$DEH_URL \
  --env SAFETY_MODEL=$SAFETY_MODEL \
  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
  --env CHROMA_URL=$CHROMA_URL

Via Conda

Make sure you have done pip install llama-stack and have the Llama Stack CLI available.

llama stack build --template dell --image-type conda
llama stack run dell
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env DEH_URL=$DEH_URL \
  --env CHROMA_URL=$CHROMA_URL