point to DEH for infernece

2025-08-09 19:58:29 +00:00 · 2025-02-06 11:45:54 -08:00 · 2025-02-06 11:45:54 -08:00 · b281ae343a
commit b281ae343a
parent aa6138e0ca
2 changed files with 68 additions and 70 deletions
--- a/docs/source/distributions/self_hosted_distro/dell.md
+++ b/docs/source/distributions/self_hosted_distro/dell.md
@ -39,9 +39,9 @@ The following environment variables can be configured:
 - `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)


-## Setting up TGI server
+## Setting up Inference server using Dell Enterprise Hub's custom TGI container

-Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker:
+You can

 ```bash
 export INFERENCE_PORT=8181
@ -53,19 +53,18 @@ export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
 export CUDA_VISIBLE_DEVICES=0
 export LLAMA_STACK_PORT=8321

-docker run --rm -it \
+docker run \
+  -it \
  --network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
+  --shm-size 1g \
  -p $INFERENCE_PORT:$INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
-ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $INFERENCE_MODEL \
--port $INFERENCE_PORT --hostname 0.0.0.0
+  -e NUM_SHARD=1 \
+  -e MAX_BATCH_PREFILL_TOKENS=32768 \
+  -e MAX_INPUT_TOKENS=8000 \
+  -e MAX_TOTAL_TOKENS=8192 \
+  -e RUST_BACKTRACE=full \
+  registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
 ```

 If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
@ -92,9 +91,9 @@ ghcr.io/huggingface/text-generation-inference \
 --port $SAFETY_INFERENCE_PORT
 ```

-## Dell distribution relies on ChromDB for vector database usage
+## Dell distribution relies on ChromaDB for vector database usage

-You can start a chrom-db easily using docker.
+You can start a chroma-db easily using docker.
 ```bash
 # This is where the indices are persisted
 mkdir -p chromadb
--- a/llama_stack/templates/dell/doc_template.md
+++ b/llama_stack/templates/dell/doc_template.md
@ -28,9 +28,9 @@ The following environment variables can be configured:
 {% endif %}


-## Setting up TGI server
+## Setting up Inference server using Dell Enterprise Hub's custom TGI container

-Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker:
+You can

 ```bash
 export INFERENCE_PORT=8181
@ -42,19 +42,18 @@ export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
 export CUDA_VISIBLE_DEVICES=0
 export LLAMA_STACK_PORT=8321

-docker run --rm -it \
+docker run \
+  -it \
  --network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
+  --shm-size 1g \
  -p $INFERENCE_PORT:$INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
-ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $INFERENCE_MODEL \
--port $INFERENCE_PORT --hostname 0.0.0.0
+  -e NUM_SHARD=1 \
+  -e MAX_BATCH_PREFILL_TOKENS=32768 \
+  -e MAX_INPUT_TOKENS=8000 \
+  -e MAX_TOTAL_TOKENS=8192 \
+  -e RUST_BACKTRACE=full \
+  registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
 ```

 If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
@ -81,9 +80,9 @@ ghcr.io/huggingface/text-generation-inference \
 --port $SAFETY_INFERENCE_PORT
 ```

-## Dell distribution relies on ChromDB for vector database usage
+## Dell distribution relies on ChromaDB for vector database usage

-You can start a chrom-db easily using docker.
+You can start a chroma-db easily using docker.
 ```bash
 # This is where the indices are persisted
 mkdir -p chromadb