Add Kubernetes deployment guide (#899)

This PR moves some content from [the recent blog post](https://blog.vllm.ai/2025/01/27/intro-to-llama-stack-with-vllm.html) to here as a more official guide for users who'd like to deploy Llama Stack on Kubernetes. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-12-04 02:03:44 +00:00 · 2025-02-06 13:28:02 -05:00 · 2025-02-06 13:28:02 -05:00 · 09ed0e9c9f
commit 09ed0e9c9f
parent a25e3b405c
2 changed files with 214 additions and 1 deletions
--- a/docs/source/distributions/index.md
+++ b/docs/source/distributions/index.md
@ -14,7 +14,12 @@ Another simple way to start interacting with Llama Stack is to just spin up a co
 **Conda**:
-Lastly, if you have a custom or an advanced setup or you are developing on Llama Stack you can also build a custom Llama Stack server. Using `llama stack build` and `llama stack run` you can build/run a custom Llama Stack server containing the exact combination of providers you wish. We have also provided various templates to make getting started easier. See [Building a Custom Distribution](building_distro) for more details.
+If you have a custom or an advanced setup or you are developing on Llama Stack you can also build a custom Llama Stack server. Using `llama stack build` and `llama stack run` you can build/run a custom Llama Stack server containing the exact combination of providers you wish. We have also provided various templates to make getting started easier. See [Building a Custom Distribution](building_distro) for more details.
 **Kubernetes**:
 If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](kubernetes_deployment) for more details.
 ```{toctree}
@ -25,4 +30,5 @@ importing_as_library
 building_distro
 configuration
 selection
 kubernetes_deployment
 ```
--- a/docs/source/distributions/kubernetes_deployment.md
+++ b/docs/source/distributions/kubernetes_deployment.md
@ -0,0 +1,207 @@
 # Kubernetes Deployment Guide
 Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster. In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a vLLM inference service in the same cluster for demonstration purposes.
 First, create a local Kubernetes cluster via Kind:
 ```bash
 kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
 ```
 Start vLLM server as a Kubernetes Pod and Service:
 ```bash
 cat <<EOF |kubectl apply -f -
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: vllm-models
 spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
 ---
 apiVersion: v1
 kind: Secret
 metadata:
  name: hf-token-secret
 type: Opaque
 data:
  token: $(HF_TOKEN)
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: vllm-server
 spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: llama-stack
        image: $(VLLM_IMAGE)
        command:
            - bash
            - -c
            - |
              MODEL="meta-llama/Llama-3.2-1B-Instruct"
              MODEL_PATH=/app/model/$(basename $MODEL)
              huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
              huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
              python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /app/model
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token-secret
                key: token
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: vllm-server
 spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
 EOF
 ```
 We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
 ```bash
 $ kubectl logs -l app.kubernetes.io/name=vllm
 ...
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
 ```
 Then we can modify the Llama Stack run configuration YAML with the following inference provider:
 ```yaml
 providers:
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://vllm-server.default.svc.cluster.local:8000/v1
      max_tokens: 4096
      api_token: fake
 ```
 Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:
 ```bash
 cat >/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s <<EOF
 FROM distribution-myenv:dev
 RUN apt-get update && apt-get install -y git
 RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source
 ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
 EOF
 podman build -f /tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s /tmp/test-vllm-llama-stack
 ```
 We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:
 ```bash
 cat <<EOF |kubectl apply -f -
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: llama-pvc
 spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: llama-stack-server
 spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: llama-stack
  template:
    metadata:
      labels:
        app.kubernetes.io/name: llama-stack
    spec:
      containers:
      - name: llama-stack
        image: localhost/llama-stack-run-k8s:latest
        imagePullPolicy: IfNotPresent
        command: ["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]
        ports:
          - containerPort: 5000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.llama
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: llama-pvc
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: llama-stack-service
 spec:
  selector:
    app.kubernetes.io/name: llama-stack
  ports:
  - protocol: TCP
    port: 5000
    targetPort: 5000
  type: ClusterIP
 EOF
 ```
 We can check that the LlamaStack server has started:
 ```bash
 $ kubectl logs -l app.kubernetes.io/name=llama-stack
 ...
 INFO:     Started server process [1]
 INFO:     Waiting for application startup.
 INFO:     ASGI 'lifespan' protocol appears unsupported.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
 ```
 Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:
 ```bash
 kubectl port-forward service/llama-stack-service 5000:5000
 llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
 ```