phoenix-oss/llama-stack-mirror

Fork 1

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-23 16:37:28 +00:00

Kai Wu 62c758932d add readme

2025-08-03 14:35:45 -07:00

5.7 KiB

Raw Blame History

Llama Stack Kubernetes Deployment Guide

This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory.

Prerequisites

Before you begin, ensure you have:

A Kubernetes cluster up and running
kubectl installed and configured to access your cluster
envsubst command available (part of the gettext package)
Hugging Face API token (required for downloading models)
NVIDIA NGC API key (required for NIM models) For the cluster setup, please do:

Install Kubernetes nvidia operator, this will enable the GPU features:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Install prometheus and grafana for gpu monitoring following this guide.

Environment Setup

The deployment requires several environment variables to be set:

# Required environment variables
export HF_TOKEN=your_hugging_face_token  # Required for vLLM to download models
export NGC_API_KEY=your_ngc_api_key      # Required for NIM to download models

# Optional environment variables with defaults
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Default inference model
export CODE_MODEL=bigcode/starcoder2-7b                  # Default code model
export OLLAMA_MODEL=llama-guard3:1b                      # Default safety model
export USE_EBS=false                                     # Use EBS storage (true/false)
export TAVILY_SEARCH_API_KEY=your_tavily_api_key         # Optional for search functionality

Deployment Steps

Clone the repository (if you haven't already):

git clone https://github.com/meta-llama/llama-stack.git
git checkout k8s_demo
cd llama-stack/docs/source/distributions/k8s

Deploy the stack:

export NGC_API_KEY=your_ngc_api_key
export HF_TOKEN=your_hugging_face_token
./apply.sh

The deployment process:

Creates Kubernetes secrets for authentication
Deploys all components:
- vLLM server (inference)
- Ollama safety service
- Llama NIM (code model)
- PostgreSQL database
- Chroma vector database
- Llama Stack server
- UI service
- Ingress configuration

Storage Options

The deployment supports two storage options:

EBS Storage (persistent):
- Set USE_EBS=true for persistent storage
- Data will persist across pod restarts
- Requires EBS CSI driver in your cluster
emptyDir Storage (non-persistent):
- Default option (USE_EBS=false)
- Data will be lost when pods restart
- Useful for testing or when EBS is not available

Accessing the Services

After deployment, you can access the services:

Check available service endpoint:

kubectl get svc
kubectl get svc -n prometheus

Port forward to access locally:

To access the UI at http://localhost:8322, do:

kubectl port-forward svc/llama-stack-service 8321:8321

To use the llama-stack endpoint at http://localhost:8321, do:

kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus

To check the grafana endpoint at http://localhost:31509, do:

kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus

Configuration

Model Configuration

You can customize the models used by change environment variables in apply.sh:

export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Change to your preferred model
export CODE_MODEL=bigcode/starcoder2-7b                  # Change to your preferred code model
export OLLAMA_MODEL=llama-guard3:1b                      # Change to your preferred safety model

Stack Configuration

The stack configuration is defined in stack_run_config.yaml. This file configures:

API providers
Models
Database connections
Tool integrations

If you need to modify this configuration, edit the file before running apply.sh.

Monitoring

The deployment includes Prometheus monitoring capabilities:

# Install Prometheus monitoring
./install-prometheus.sh

Cleanup

To remove all deployed resources:

./delete.sh

This will:

Delete all deployments, services, and configmaps
Remove persistent volume claims
Delete secrets

Troubleshooting

Common Issues

Secret creation fails:
- Ensure your HF_TOKEN and NGC_API_KEY are correctly set
- Check for any existing secrets that might conflict
Pods stuck in pending state:
- Check if your cluster has enough resources
- For GPU-based deployments, ensure GPU nodes are available
Models fail to download:
- Verify your HF_TOKEN and NGC_API_KEY are valid
- Check pod logs for specific error messages:
```
kubectl logs -f deployment/vllm-server
kubectl logs -f deployment/llm-nim-code
```
Services not accessible:
- Verify all pods are running:
```
kubectl get pods
```
- Check service endpoints:
```
kubectl get endpoints
```

Viewing Logs

# View logs for specific components
kubectl logs -f deployment/llama-stack-server
kubectl logs -f deployment/vllm-server
kubectl logs -f deployment/llama-stack-ui

Advanced Configuration

Custom Resource Limits

You can modify the resource limits in the YAML template files before deployment:

vllm-k8s.yaml.template: vLLM server resources
stack-k8s.yaml.template: Llama Stack server resources
llama-nim.yaml.template: NIM server resources

5.7 KiB Raw Blame History