phoenix-oss/llama-stack-mirror

Fork 1

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-13 08:36:09 +00:00

ehhuang 84fa83b788

Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 5s

Details

Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s

Details

Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 9s

Details

SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 18s

Details

Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 12s

Details

SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 16s

Details

Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 10s

Details

Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 16s

Details

Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 12s

Details

Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 14s

Details

Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 22s

Details

Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 13s

Details

Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 12s

Details

Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 11s

Details

Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 7s

Details

Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 11s

Details

Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 9s

Details

Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 9s

Details

Integration Tests / test-matrix (server, 3.12, inspect) (push) Failing after 10s

Details

Integration Tests / test-matrix (server, 3.12, agents) (push) Failing after 14s

Details

Integration Tests / test-matrix (server, 3.12, providers) (push) Failing after 10s

Details

Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 7s

Details

Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 9s

Details

Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 11s

Details

Integration Tests / test-matrix (server, 3.12, inference) (push) Failing after 13s

Details

Integration Tests / test-matrix (server, 3.12, tool_runtime) (push) Failing after 10s

Details

Integration Tests / test-matrix (server, 3.12, datasets) (push) Failing after 9s

Details

Integration Tests / test-matrix (server, 3.12, vector_io) (push) Failing after 12s

Details

Integration Tests / test-matrix (server, 3.12, post_training) (push) Failing after 12s

Details

Integration Tests / test-matrix (server, 3.13, inspect) (push) Failing after 15s

Details

Integration Tests / test-matrix (server, 3.12, scoring) (push) Failing after 13s

Details

Integration Tests / test-matrix (server, 3.13, datasets) (push) Failing after 17s

Details

Integration Tests / test-matrix (server, 3.13, providers) (push) Failing after 11s

Details

Integration Tests / test-matrix (server, 3.13, agents) (push) Failing after 12s

Details

Integration Tests / test-matrix (server, 3.13, inference) (push) Failing after 14s

Details

Integration Tests / test-matrix (server, 3.13, post_training) (push) Failing after 10s

Details

Integration Tests / test-matrix (server, 3.13, tool_runtime) (push) Failing after 13s

Details

Integration Tests / test-matrix (server, 3.13, scoring) (push) Failing after 15s

Details

Integration Tests / test-matrix (server, 3.13, vector_io) (push) Failing after 11s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 12s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 13s

Details

Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 8s

Details

Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 9s

Details

Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 11s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 11s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 11s

Details

Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 15s

Details

Python Package Build Test / build (3.12) (push) Failing after 33s

Details

Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 41s

Details

Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 40s

Details

Python Package Build Test / build (3.13) (push) Failing after 33s

Details

Test External Providers / test-external-providers (venv) (push) Failing after 8s

Details

Update ReadTheDocs / update-readthedocs (push) Failing after 10s

Details

Unit Tests / unit-tests (3.12) (push) Failing after 14s

Details

Unit Tests / unit-tests (3.13) (push) Failing after 12s

Details

Pre-commit / pre-commit (push) Successful in 1m23s

Details

fix: update k8s templates (#2645 )

# What does this PR do?
- fix env variables
- use gpu for vllm
- add eks/apply.py for aws
- add template to set hf secret

## Test Plan
bash apply.sh

Co-authored-by: Eric Huang <erichuang@fb.com>

2025-07-08 15:57:01 -07:00

5.9 KiB

Raw Blame History

Kubernetes Deployment Guide

Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster.

Prerequisites

In this guide, we'll use a local Kind cluster and a vLLM inference service in the same cluster for demonstration purposes.

Note: You can also deploy the Llama Stack server in an AWS EKS cluster. See Deploying Llama Stack Server in AWS EKS for more details.

First, create a local Kubernetes cluster via Kind:

kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test

First set your hugging face token as an environment variable.

export HF_TOKEN=$(echo -n "your-hf-token" | base64)

Now create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
data:
  token: $HF_TOKEN
EOF

Next, start the vLLM server as a Kubernetes Deployment and Service:

cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF

We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):

$ kubectl logs -l app.kubernetes.io/name=vllm
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then we can modify the Llama Stack run configuration YAML with the following inference provider:

providers:
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://vllm-server.default.svc.cluster.local:8000/v1
      max_tokens: 4096
      api_token: fake

Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:

tmp_dir=$(mktemp -d) && cat >$tmp_dir/Containerfile.llama-stack-run-k8s <<EOF
FROM distribution-myenv:dev

RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source

ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml
EOF
podman build -f $tmp_dir/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s $tmp_dir

Deploying Llama Stack Server in Kubernetes

We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-stack-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: llama-stack
  template:
    metadata:
      labels:
        app.kubernetes.io/name: llama-stack
    spec:
      containers:
      - name: llama-stack
        image: localhost/llama-stack-run-k8s:latest
        imagePullPolicy: IfNotPresent
        command: ["python", "-m", "llama_stack.distribution.server.server", "--config", "/app/config.yaml"]
        ports:
          - containerPort: 5000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.llama
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-stack-service
spec:
  selector:
    app.kubernetes.io/name: llama-stack
  ports:
  - protocol: TCP
    port: 5000
    targetPort: 5000
  type: ClusterIP
EOF

Verifying the Deployment

We can check that the LlamaStack server has started:

$ kubectl logs -l app.kubernetes.io/name=llama-stack
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)

Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:

kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"

Deploying Llama Stack Server in AWS EKS

We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster. Once you have an EKS cluster, you can run the following script to deploy the Llama Stack server.

cd docs/source/distributions/eks
./apply.sh

This script will:

Set up a default storage class for AWS EKS
Deploy the Llama Stack server in a Kubernetes Pod and Service

5.9 KiB Raw Blame History

Kubernetes Deployment Guide

Prerequisites

Deploying Llama Stack Server in Kubernetes

Verifying the Deployment

Deploying Llama Stack Server in AWS EKS

5.9 KiB

Raw Blame History