llama-stack-mirror/docs/source/deploying/kubernetes_deployment.md at 919ee3199bed49a701cea65a757a004e5ae38c9e

phoenix-oss/llama-stack-mirror

Fork 1

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-18 10:52:28 +00:00

Kelly Brown b096794959

SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 2s

Details

Integration Tests / discover-tests (push) Successful in 2s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 17s

Details

Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 19s

Details

Python Package Build Test / build (3.12) (push) Failing after 14s

Details

Test Llama Stack Build / build-custom-container-distribution (push) Failing after 14s

Details

Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 15s

Details

SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 20s

Details

Unit Tests / unit-tests (3.13) (push) Failing after 15s

Details

Test Llama Stack Build / generate-matrix (push) Successful in 16s

Details

Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 20s

Details

Test External Providers / test-external-providers (venv) (push) Failing after 17s

Details

Update ReadTheDocs / update-readthedocs (push) Failing after 15s

Details

Test Llama Stack Build / build-single-provider (push) Failing after 21s

Details

Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 18s

Details

Unit Tests / unit-tests (3.12) (push) Failing after 22s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 25s

Details

Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 23s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 26s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 19s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 28s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 21s

Details

Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 23s

Details

Python Package Build Test / build (3.13) (push) Failing after 44s

Details

Test Llama Stack Build / build (push) Failing after 25s

Details

Integration Tests / test-matrix (push) Failing after 46s

Details

Pre-commit / pre-commit (push) Successful in 2m24s

Details

docs: Reorganize documentation on the webpage (#2651 )

# What does this PR do?
Reorganizes the Llama stack webpage into more concise index pages,
introduce more of a workflow, and reduce repetition of content.

New nav structure so far based on #2637 

Further discussions in
https://github.com/meta-llama/llama-stack/discussions/2585

**Preview:**
![Screenshot 2025-07-09 at 2 31
53 PM](https://github.com/user-attachments/assets/4c1f3845-b328-4f12-9f20-3f09375007af)

You can also build a full local preview locally 

 **Feedback**
Looking for feedback on page titles and general feedback on the new
structure

**Follow up documentation**
I plan on reducing some sections and standardizing some terminology in a
follow up PR.
More discussions on that in
https://github.com/meta-llama/llama-stack/discussions/2585

2025-07-15 14:19:35 -07:00

5.9 KiB

Raw Blame History

Kubernetes Deployment Guide

Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster.

Prerequisites

In this guide, we'll use a local Kind cluster and a vLLM inference service in the same cluster for demonstration purposes.

Note: You can also deploy the Llama Stack server in an AWS EKS cluster. See Deploying Llama Stack Server in AWS EKS for more details.

First, create a local Kubernetes cluster via Kind:

kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test

First set your hugging face token as an environment variable.

export HF_TOKEN=$(echo -n "your-hf-token" | base64)

Now create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
data:
  token: $HF_TOKEN
EOF

Next, start the vLLM server as a Kubernetes Deployment and Service:

cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF

We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):

$ kubectl logs -l app.kubernetes.io/name=vllm
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then we can modify the Llama Stack run configuration YAML with the following inference provider:

providers:
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://vllm-server.default.svc.cluster.local:8000/v1
      max_tokens: 4096
      api_token: fake

Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:

tmp_dir=$(mktemp -d) && cat >$tmp_dir/Containerfile.llama-stack-run-k8s <<EOF
FROM distribution-myenv:dev

RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source

ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
EOF
podman build -f $tmp_dir/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s $tmp_dir

Deploying Llama Stack Server in Kubernetes

We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-stack-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: llama-stack
  template:
    metadata:
      labels:
        app.kubernetes.io/name: llama-stack
    spec:
      containers:
      - name: llama-stack
        image: localhost/llama-stack-run-k8s:latest
        imagePullPolicy: IfNotPresent
        command: ["python", "-m", "llama_stack.distribution.server.server", "--config", "/app/config.yaml"]
        ports:
          - containerPort: 5000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.llama
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-stack-service
spec:
  selector:
    app.kubernetes.io/name: llama-stack
  ports:
  - protocol: TCP
    port: 5000
    targetPort: 5000
  type: ClusterIP
EOF

Verifying the Deployment

We can check that the LlamaStack server has started:

$ kubectl logs -l app.kubernetes.io/name=llama-stack
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)

Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:

kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"

Deploying Llama Stack Server in AWS EKS

We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster. Once you have an EKS cluster, you can run the following script to deploy the Llama Stack server.

cd docs/source/distributions/eks
./apply.sh

This script will:

Set up a default storage class for AWS EKS
Deploy the Llama Stack server in a Kubernetes Pod and Service

5.9 KiB Raw Blame History

Kubernetes Deployment Guide

Prerequisites

Deploying Llama Stack Server in Kubernetes

Verifying the Deployment

Deploying Llama Stack Server in AWS EKS

5.9 KiB

Raw Blame History