llama-stack-mirror/docs/source/distributions/k8s/README.md

# Llama Stack Kubernetes Deployment Guide

This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory.

## Prerequisites

Before you begin, ensure you have:

- A Kubernetes cluster up and running
- `kubectl` installed and configured to access your cluster
- `envsubst` command available (part of the `gettext` package)
- Hugging Face API token (required for downloading models)
- NVIDIA NGC API key (required for NIM models)
For the cluster setup, please do:
1. Install Kubernetes nvidia operator, this will enable the GPU features:
```bash
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
```
2. Install prometheus and grafana for gpu monitoring following [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html).

## Environment Setup

The deployment requires several environment variables to be set:

```bash
# Required environment variables
export HF_TOKEN=your_hugging_face_token  # Required for vLLM to download models
export NGC_API_KEY=your_ngc_api_key      # Required for NIM to download models

# Optional environment variables with defaults
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Default inference model
export CODE_MODEL=bigcode/starcoder2-7b                  # Default code model
export OLLAMA_MODEL=llama-guard3:1b                      # Default safety model
export USE_EBS=false                                     # Use EBS storage (true/false)
export TAVILY_SEARCH_API_KEY=your_tavily_api_key         # Optional for search functionality
```

## Deployment Steps

1. **Clone the repository** (if you haven't already):
   ```bash
   git clone https://github.com/meta-llama/llama-stack.git
   git checkout k8s_demo
   cd llama-stack/docs/source/distributions/k8s
   ```

2. **Deploy the stack**:
   ```bash
   export NGC_API_KEY=your_ngc_api_key
   export HF_TOKEN=your_hugging_face_token
   ./apply.sh
   ```

The deployment process:
1. Creates Kubernetes secrets for authentication
2. Deploys all components:
   - vLLM server (inference)
   - Ollama safety service
   - Llama NIM (code model)
   - PostgreSQL database
   - Chroma vector database
   - Jaeger (distributed tracing)
   - Llama Stack server
   - UI service
   - Ingress configuration

## Storage Options

The deployment supports two storage options:

1. **EBS Storage** (persistent):
   - Set `USE_EBS=true` for persistent storage
   - Data will persist across pod restarts
   - Requires EBS CSI driver in your cluster

2. **emptyDir Storage** (non-persistent):
   - Default option (`USE_EBS=false`)
   - Data will be lost when pods restart
   - Useful for testing or when EBS is not available

## Accessing the Services

After deployment, you can access the services:

1. **Check available service endpoint**:
   ```bash
   kubectl get svc
   kubectl get svc -n prometheus
   ```

2. **Port forward to access locally**:
   - To access the UI at http://localhost:8322, do:
   ```bash
   kubectl port-forward svc/llama-stack-service 8321:8321
   ```
   - To use the llama-stack endpoint at http://localhost:8321, do:
   ```bash
   kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus
   ```
   - To check the grafana endpoint at http://localhost:31509, do:
   ```bash
   kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus
   ```


## Configuration

### Model Configuration

You can customize the models used by change environment variables in `apply.sh`:

```bash
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Change to your preferred model
export CODE_MODEL=bigcode/starcoder2-7b                  # Change to your preferred code model
export OLLAMA_MODEL=llama-guard3:1b                      # Change to your preferred safety model
```

### Stack Configuration

The stack configuration is defined in `stack_run_config.yaml`. This file configures:
- API providers
- Models
- Database connections
- Tool integrations

If you need to modify this configuration, edit the file before running `apply.sh`.

## Monitoring and Telemetry

### Prometheus Monitoring

The deployment includes Prometheus monitoring capabilities:

```bash
# Install Prometheus monitoring
./install-prometheus.sh
```

### Jaeger Tracing

The deployment includes Jaeger for distributed tracing:

1. **Access the Jaeger UI**:
   ```bash
   kubectl port-forward svc/jaeger 16686:16686
   ```
   Then open http://localhost:16686 in your browser.

2. **Trace Configuration**:
   - Traces are automatically sent from llama-stack to Jaeger
   - The service name is set to "llama-stack" by default
   - Traces include spans for API calls, model inference, and other operations

3. **Troubleshooting Traces**:
   - If traces are not appearing in Jaeger:
     - Verify Jaeger is running: `kubectl get pods | grep jaeger`
     - Check llama-stack logs: `kubectl logs -f deployment/llama-stack-server`
     - Ensure the OTLP endpoint is correctly configured in the stack configuration
     - Verify network connectivity between llama-stack and Jaeger

## Cleanup

To remove all deployed resources:

```bash
./delete.sh
```

This will:
1. Delete all deployments, services, and configmaps
2. Remove persistent volume claims
3. Delete secrets

## Troubleshooting

### Common Issues

1. **Secret creation fails**:
   - Ensure your HF_TOKEN and NGC_API_KEY are correctly set
   - Check for any existing secrets that might conflict

2. **Pods stuck in pending state**:
   - Check if your cluster has enough resources
   - For GPU-based deployments, ensure GPU nodes are available

3. **Models fail to download**:
   - Verify your HF_TOKEN and NGC_API_KEY are valid
   - Check pod logs for specific error messages:
     ```bash
     kubectl logs -f deployment/vllm-server
     kubectl logs -f deployment/llm-nim-code
     ```

4. **Services not accessible**:
   - Verify all pods are running:
     ```bash
     kubectl get pods
     ```
   - Check service endpoints:
     ```bash
     kubectl get endpoints
     ```

5. **Traces not appearing in Jaeger**:
   - Check if the Jaeger pod is running: `kubectl get pods | grep jaeger`
   - Verify the llama-stack server is waiting for Jaeger to be ready before starting
   - Check the telemetry configuration in `stack_run_config.yaml`
   - Ensure the OTLP endpoint is correctly set to `http://jaeger.default.svc.cluster.local:4318`

### Viewing Logs

```bash
# View logs for specific components
kubectl logs -f deployment/llama-stack-server
kubectl logs -f deployment/vllm-server
kubectl logs -f deployment/llama-stack-ui
kubectl logs -f deployment/jaeger
```

## Advanced Configuration

### Custom Resource Limits

You can modify the resource limits in the YAML template files before deployment:

- `vllm-k8s.yaml.template`: vLLM server resources
- `stack-k8s.yaml.template`: Llama Stack server resources
- `llama-nim.yaml.template`: NIM server resources
- `jaeger-k8s.yaml.template`: Jaeger server resources

## Additional Resources

- [Llama Stack Documentation](https://github.com/meta-llama/llama-stack)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [Jaeger Tracing Documentation](https://www.jaegertracing.io/docs/)