mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-23 08:33:09 +00:00

History

Kai Wu f02fda0bd7 demo		2025-08-05 13:33:32 -07:00
..
apply.sh	demo	2025-08-05 13:33:32 -07:00
chroma-k8s.yaml.template	docs(kubernetes): add more fleshed-out example of a Demo Kubernetes cluster (#2329 )	2025-06-02 13:07:08 -07:00
delete.sh	demo	2025-08-05 13:33:32 -07:00
ingress-k8s.yaml.template	working now	2025-07-31 10:19:53 -07:00
install-prometheus.sh	demo	2025-08-05 13:33:32 -07:00
jaeger-k8s.yaml.template	demo	2025-08-05 13:33:32 -07:00
kube-prometheus-stack.values	demo	2025-08-05 13:33:32 -07:00
llama-nim.yaml.template	demo	2025-08-05 13:33:32 -07:00
ollama-safety-k8s.yaml.template	second checkpoint	2025-08-02 13:16:35 -07:00
port-foward.sh	demo	2025-08-05 13:33:32 -07:00
postgres-k8s.yaml.template	docs(kubernetes): add more fleshed-out example of a Demo Kubernetes cluster (#2329 )	2025-06-02 13:07:08 -07:00
prometheus-value.yaml	ready	2025-08-03 14:01:27 -07:00
README.md	demo	2025-08-05 13:33:32 -07:00
set-secret.yaml.template	working now	2025-07-31 10:19:53 -07:00
stack-configmap.yaml	demo	2025-08-05 13:33:32 -07:00
stack-k8s.yaml.template	demo	2025-08-05 13:33:32 -07:00
stack_run_config.yaml	demo	2025-08-05 13:33:32 -07:00
ui-k8s.yaml.template	working now	2025-07-31 10:19:53 -07:00
ui-service-k8s.yaml.template	working now	2025-07-31 10:19:53 -07:00
vllm-k8s.yaml.template	demo	2025-08-05 13:33:32 -07:00
vllm-safety-k8s.yaml.template	first draft	2025-07-25 10:41:06 -07:00

README.md

Llama Stack Kubernetes Deployment Guide

This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory.

Prerequisites

Before you begin, ensure you have:

A Kubernetes cluster up and running
kubectl installed and configured to access your cluster
envsubst command available (part of the gettext package)
Hugging Face API token (required for downloading models)
NVIDIA NGC API key (required for NIM models) For the cluster setup, please do:

Install Kubernetes nvidia operator, this will enable the GPU features:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Install prometheus and grafana for gpu monitoring following this guide.

Environment Setup

The deployment requires several environment variables to be set:

# Required environment variables
export HF_TOKEN=your_hugging_face_token  # Required for vLLM to download models
export NGC_API_KEY=your_ngc_api_key      # Required for NIM to download models

# Optional environment variables with defaults
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Default inference model
export CODE_MODEL=bigcode/starcoder2-7b                  # Default code model
export OLLAMA_MODEL=llama-guard3:1b                      # Default safety model
export USE_EBS=false                                     # Use EBS storage (true/false)
export TAVILY_SEARCH_API_KEY=your_tavily_api_key         # Optional for search functionality

Deployment Steps

Clone the repository (if you haven't already):

git clone https://github.com/meta-llama/llama-stack.git
git checkout k8s_demo
cd llama-stack/docs/source/distributions/k8s

Deploy the stack:

export NGC_API_KEY=your_ngc_api_key
export HF_TOKEN=your_hugging_face_token
./apply.sh

The deployment process:

Creates Kubernetes secrets for authentication
Deploys all components:
- vLLM server (inference)
- Ollama safety service
- Llama NIM (code model)
- PostgreSQL database
- Chroma vector database
- Jaeger (distributed tracing)
- Llama Stack server
- UI service
- Ingress configuration

Storage Options

The deployment supports two storage options:

EBS Storage (persistent):
- Set USE_EBS=true for persistent storage
- Data will persist across pod restarts
- Requires EBS CSI driver in your cluster
emptyDir Storage (non-persistent):
- Default option (USE_EBS=false)
- Data will be lost when pods restart
- Useful for testing or when EBS is not available

Accessing the Services

After deployment, you can access the services:

Check available service endpoint:

kubectl get svc
kubectl get svc -n prometheus

Port forward to access locally:

To access the UI at http://localhost:8322, do:

kubectl port-forward svc/llama-stack-service 8321:8321

To use the llama-stack endpoint at http://localhost:8321, do:

kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus

To check the grafana endpoint at http://localhost:31509, do:

kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus

Configuration

Model Configuration

You can customize the models used by change environment variables in apply.sh:

export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Change to your preferred model
export CODE_MODEL=bigcode/starcoder2-7b                  # Change to your preferred code model
export OLLAMA_MODEL=llama-guard3:1b                      # Change to your preferred safety model

Stack Configuration

The stack configuration is defined in stack_run_config.yaml. This file configures:

API providers
Models
Database connections
Tool integrations

If you need to modify this configuration, edit the file before running apply.sh.

Monitoring and Telemetry

Prometheus Monitoring

The deployment includes Prometheus monitoring capabilities:

# Install Prometheus monitoring
./install-prometheus.sh

Jaeger Tracing

The deployment includes Jaeger for distributed tracing:

Access the Jaeger UI:
```
kubectl port-forward svc/jaeger 16686:16686
```
Then open http://localhost:16686 in your browser.
Trace Configuration:
- Traces are automatically sent from llama-stack to Jaeger
- The service name is set to "llama-stack" by default
- Traces include spans for API calls, model inference, and other operations
Troubleshooting Traces:
- If traces are not appearing in Jaeger:
  - Verify Jaeger is running: kubectl get pods | grep jaeger
  - Check llama-stack logs: kubectl logs -f deployment/llama-stack-server
  - Ensure the OTLP endpoint is correctly configured in the stack configuration
  - Verify network connectivity between llama-stack and Jaeger

Cleanup

To remove all deployed resources:

./delete.sh

This will:

Delete all deployments, services, and configmaps
Remove persistent volume claims
Delete secrets

Troubleshooting

Common Issues

Secret creation fails:
- Ensure your HF_TOKEN and NGC_API_KEY are correctly set
- Check for any existing secrets that might conflict
Pods stuck in pending state:
- Check if your cluster has enough resources
- For GPU-based deployments, ensure GPU nodes are available
Models fail to download:
- Verify your HF_TOKEN and NGC_API_KEY are valid
- Check pod logs for specific error messages:
```
kubectl logs -f deployment/vllm-server
kubectl logs -f deployment/llm-nim-code
```
Services not accessible:
- Verify all pods are running:
```
kubectl get pods
```
- Check service endpoints:
```
kubectl get endpoints
```
Traces not appearing in Jaeger:
- Check if the Jaeger pod is running: kubectl get pods | grep jaeger
- Verify the llama-stack server is waiting for Jaeger to be ready before starting
- Check the telemetry configuration in stack_run_config.yaml
- Ensure the OTLP endpoint is correctly set to http://jaeger.default.svc.cluster.local:4318

Viewing Logs

# View logs for specific components
kubectl logs -f deployment/llama-stack-server
kubectl logs -f deployment/vllm-server
kubectl logs -f deployment/llama-stack-ui
kubectl logs -f deployment/jaeger

Advanced Configuration

Custom Resource Limits

You can modify the resource limits in the YAML template files before deployment:

vllm-k8s.yaml.template: vLLM server resources
stack-k8s.yaml.template: Llama Stack server resources
llama-nim.yaml.template: NIM server resources
jaeger-k8s.yaml.template: Jaeger server resources