docs: Simplify vLLM deployment in K8s deployment guide (#1655)

# What does this PR do? * Removes the use of `huggingface-cli` * Simplifies HF cache mount path * Simplifies vLLM server startup command * Separates PVC/secret creation from deployment/service * Fixes a typo: "pod" should be "deployment" Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-24 12:08:50 -04:00 · 2025-03-24 12:08:50 -04:00 · 9ff82036f7
commit 9ff82036f7
parent 9e1ddf2b53
1 changed files with 20 additions and 20 deletions
--- a/docs/source/distributions/kubernetes_deployment.md
+++ b/docs/source/distributions/kubernetes_deployment.md
@ -8,7 +8,7 @@ First, create a local Kubernetes cluster via Kind:
 kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
 ```
-Start vLLM server as a Kubernetes Pod and Service:
+First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
 ```bash
 cat <<EOF |kubectl apply -f -
@ -31,7 +31,12 @@ metadata:
 type: Opaque
 data:
  token: $(HF_TOKEN)
---
+```
 Next, start the vLLM server as a Kubernetes Deployment and Service:
 ```bash
 cat <<EOF |kubectl apply -f -
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@ -47,28 +52,23 @@ spec:
        app.kubernetes.io/name: vllm
    spec:
      containers:
-      - name: llama-stack
+      - name: vllm
-        image: $(VLLM_IMAGE)
+        image: vllm/vllm-openai:latest
-        command:
+        command: ["/bin/sh", "-c"]
-            - bash
+        args: [
-            - -c
+          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
-            - |
+        ]
-              MODEL="meta-llama/Llama-3.2-1B-Instruct"
+        env:
-              MODEL_PATH=/app/model/$(basename $MODEL)
+        - name: HUGGING_FACE_HUB_TOKEN
-              huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
+          valueFrom:
-              huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
+            secretKeyRef:
-              python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
+              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
-            mountPath: /app/model
+            mountPath: /root/.cache/huggingface
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token-secret
                key: token
      volumes:
      - name: llama-storage
        persistentVolumeClaim: