mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-03 19:57:35 +00:00
Merge dc2743912a
into d266c59c2a
This commit is contained in:
commit
c005f50d16
2 changed files with 744 additions and 0 deletions
|
@ -10,5 +10,6 @@ import TabItem from '@theme/TabItem';
|
||||||
|
|
||||||
# Deploying Llama Stack
|
# Deploying Llama Stack
|
||||||
|
|
||||||
|
[**→ Multiple Llama Stack Servers Guide**](./Launching_Multiple_LlamaStack_Servers.md)
|
||||||
[**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx)
|
[**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx)
|
||||||
[**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx)
|
[**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx)
|
||||||
|
|
743
docs/docs/deploying/launching_multiple_llamastack_servers.md
Normal file
743
docs/docs/deploying/launching_multiple_llamastack_servers.md
Normal file
|
@ -0,0 +1,743 @@
|
||||||
|
# Production Deployment: Multiple Llama Stack Servers
|
||||||
|
|
||||||
|
A production-focused guide for deploying multiple Llama Stack servers using container images and systemd services.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Production Use Cases](#production-use-cases)
|
||||||
|
2. [Container-Based Deployment](#container-based-deployment)
|
||||||
|
3. [Systemd Service Deployment](#systemd-service-deployment)
|
||||||
|
4. [Docker Compose Deployment](#docker-compose-deployment)
|
||||||
|
5. [Kubernetes Deployment](#kubernetes-deployment)
|
||||||
|
6. [Load Balancing & High Availability](#load-balancing--high-availability)
|
||||||
|
7. [Monitoring & Logging](#monitoring--logging)
|
||||||
|
8. [Production Best Practices](#production-best-practices)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Production Use Cases
|
||||||
|
|
||||||
|
### When to Deploy Multiple Llama Stack Servers
|
||||||
|
|
||||||
|
**Provider Isolation**: Separate servers for different AI providers (local vs. cloud)
|
||||||
|
```
|
||||||
|
Server 1: Ollama + local models (internal traffic)
|
||||||
|
Server 2: OpenAI + Anthropic (external API traffic)
|
||||||
|
Server 3: Enterprise providers (Bedrock, Azure)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Workload Segmentation**: Different servers for different workloads
|
||||||
|
```
|
||||||
|
Server 1: Real-time inference (low latency)
|
||||||
|
Server 2: Batch processing (high throughput)
|
||||||
|
Server 3: Embeddings & vector operations
|
||||||
|
```
|
||||||
|
|
||||||
|
**Multi-Tenancy**: Isolated servers per tenant/environment
|
||||||
|
```
|
||||||
|
Server 1: Production tenant A
|
||||||
|
Server 2: Production tenant B
|
||||||
|
Server 3: Staging environment
|
||||||
|
```
|
||||||
|
|
||||||
|
**High Availability**: Load-balanced instances for fault tolerance
|
||||||
|
```
|
||||||
|
Server 1-3: Same config, load balanced
|
||||||
|
Server 4-6: Backup cluster
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Container-Based Deployment
|
||||||
|
|
||||||
|
### Method 1: Docker Containers
|
||||||
|
|
||||||
|
#### Use Official LlamaStack Container Approach
|
||||||
|
|
||||||
|
**Option 1: Use Starter Distribution with Container Runtime**
|
||||||
|
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
# Simple Dockerfile leveraging the starter distribution
|
||||||
|
FROM python:3.12-slim
|
||||||
|
|
||||||
|
# Install system dependencies
|
||||||
|
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Install LlamaStack
|
||||||
|
RUN pip install --no-cache-dir llama-stack
|
||||||
|
|
||||||
|
# Initialize starter distribution
|
||||||
|
RUN llama stack build --template starter --name production-server
|
||||||
|
|
||||||
|
# Create non-root user
|
||||||
|
RUN useradd -r -s /bin/false -m llamastack
|
||||||
|
USER llamastack
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Health check
|
||||||
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:8321/v1/health || exit 1
|
||||||
|
|
||||||
|
# Use starter distribution configs
|
||||||
|
CMD ["llama", "stack", "run", "~/.llama/distributions/starter/starter-run.yaml"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Use Standard Python Base Image (Recommended)**
|
||||||
|
|
||||||
|
Since LlamaStack doesn't provide a Dockerfile, use the standard Python installation approach:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use the simple approach with standard Python image
|
||||||
|
# No need to clone and build - just use pip install directly in containers
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2b: Check for Official Images (Future)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if official images become available
|
||||||
|
docker search meta-llama/llama-stack
|
||||||
|
docker search llamastack
|
||||||
|
|
||||||
|
# For now, use the pip-based approach in Option 1 or 3
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 3: Lightweight Container with Starter Distribution**
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM python:3.12-alpine
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
RUN apk add --no-cache curl gcc musl-dev linux-headers
|
||||||
|
|
||||||
|
# Install LlamaStack
|
||||||
|
RUN pip install --no-cache-dir llama-stack
|
||||||
|
|
||||||
|
# Initialize starter distribution
|
||||||
|
RUN llama stack build --template starter --name starter
|
||||||
|
|
||||||
|
# Create non-root user
|
||||||
|
RUN adduser -D llamastack
|
||||||
|
USER llamastack
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:8321/v1/health || exit 1
|
||||||
|
|
||||||
|
# Use CLI port override instead of modifying YAML
|
||||||
|
CMD ["llama", "stack", "run", "/home/llamastack/.llama/distributions/starter/starter-run.yaml", "--port", "8321"]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Prepare Server Configurations
|
||||||
|
|
||||||
|
**Server 1 Config (server1.yaml):**
|
||||||
|
```yaml
|
||||||
|
version: 2
|
||||||
|
image_name: production-server1
|
||||||
|
providers:
|
||||||
|
inference:
|
||||||
|
- provider_id: ollama
|
||||||
|
provider_type: remote::ollama
|
||||||
|
config:
|
||||||
|
url: http://ollama-service:11434
|
||||||
|
vector_io:
|
||||||
|
- provider_id: faiss
|
||||||
|
provider_type: inline::faiss
|
||||||
|
config:
|
||||||
|
kvstore:
|
||||||
|
type: sqlite
|
||||||
|
db_path: /data/server1/faiss_store.db
|
||||||
|
metadata_store:
|
||||||
|
type: sqlite
|
||||||
|
db_path: /data/server1/registry.db
|
||||||
|
server:
|
||||||
|
port: 8321
|
||||||
|
```
|
||||||
|
|
||||||
|
**Server 2 Config (server2.yaml):**
|
||||||
|
```yaml
|
||||||
|
version: 2
|
||||||
|
image_name: production-server2
|
||||||
|
providers:
|
||||||
|
inference:
|
||||||
|
- provider_id: openai
|
||||||
|
provider_type: remote::openai
|
||||||
|
config:
|
||||||
|
api_key: ${OPENAI_API_KEY}
|
||||||
|
- provider_id: anthropic
|
||||||
|
provider_type: remote::anthropic
|
||||||
|
config:
|
||||||
|
api_key: ${ANTHROPIC_API_KEY}
|
||||||
|
metadata_store:
|
||||||
|
type: sqlite
|
||||||
|
db_path: /data/server2/registry.db
|
||||||
|
server:
|
||||||
|
port: 8322
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Build and Run Containers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build images
|
||||||
|
docker build -t llamastack-server1 -f Dockerfile.server1 .
|
||||||
|
docker build -t llamastack-server2 -f Dockerfile.server2 .
|
||||||
|
|
||||||
|
# Create volumes for persistent data
|
||||||
|
docker volume create llamastack-server1-data
|
||||||
|
docker volume create llamastack-server2-data
|
||||||
|
|
||||||
|
# Run Server 1
|
||||||
|
docker run -d \
|
||||||
|
--name llamastack-server1 \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-p 8321:8321 \
|
||||||
|
-v llamastack-server1-data:/data/server1 \
|
||||||
|
-e GROQ_API_KEY="${GROQ_API_KEY}" \
|
||||||
|
--network llamastack-network \
|
||||||
|
llamastack-server1
|
||||||
|
|
||||||
|
# Run Server 2
|
||||||
|
docker run -d \
|
||||||
|
--name llamastack-server2 \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-p 8322:8322 \
|
||||||
|
-v llamastack-server2-data:/data/server2 \
|
||||||
|
-e OPENAI_API_KEY="${OPENAI_API_KEY}" \
|
||||||
|
-e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
|
||||||
|
--network llamastack-network \
|
||||||
|
llamastack-server2
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Docker Compose Deployment
|
||||||
|
|
||||||
|
### Method 2: Docker Compose (Recommended)
|
||||||
|
|
||||||
|
**docker-compose.yml:**
|
||||||
|
```yaml
|
||||||
|
# Note: Version specification is optional in modern Docker Compose
|
||||||
|
# Using latest Docker Compose format
|
||||||
|
services:
|
||||||
|
# Ollama service for local models
|
||||||
|
ollama:
|
||||||
|
image: ollama/ollama:latest
|
||||||
|
container_name: ollama-service
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "11434:11434"
|
||||||
|
volumes:
|
||||||
|
- ollama-data:/root/.ollama
|
||||||
|
networks:
|
||||||
|
- llamastack-network
|
||||||
|
|
||||||
|
# Server 1: Local providers
|
||||||
|
llamastack-server1:
|
||||||
|
build:
|
||||||
|
context: .
|
||||||
|
dockerfile: Dockerfile.server1
|
||||||
|
container_name: llamastack-server1
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8321:8321"
|
||||||
|
volumes:
|
||||||
|
- server1-data:/data/server1
|
||||||
|
- ./configs/server1.yaml:/app/configs/server.yaml:ro
|
||||||
|
environment:
|
||||||
|
- OLLAMA_URL=http://ollama:11434
|
||||||
|
- GROQ_API_KEY=${GROQ_API_KEY}
|
||||||
|
depends_on:
|
||||||
|
- ollama
|
||||||
|
networks:
|
||||||
|
- llamastack-network
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8321/v1/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 60s
|
||||||
|
|
||||||
|
# Server 2: Cloud providers
|
||||||
|
llamastack-server2:
|
||||||
|
build:
|
||||||
|
context: .
|
||||||
|
dockerfile: Dockerfile.server2
|
||||||
|
container_name: llamastack-server2
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8322:8322"
|
||||||
|
volumes:
|
||||||
|
- server2-data:/data/server2
|
||||||
|
- ./configs/server2.yaml:/app/configs/server.yaml:ro
|
||||||
|
environment:
|
||||||
|
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
||||||
|
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
|
||||||
|
- GROQ_API_KEY=${GROQ_API_KEY}
|
||||||
|
networks:
|
||||||
|
- llamastack-network
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8322/v1/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 60s
|
||||||
|
|
||||||
|
# Load balancer (optional)
|
||||||
|
nginx:
|
||||||
|
image: nginx:alpine
|
||||||
|
container_name: llamastack-lb
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "80:80"
|
||||||
|
- "443:443"
|
||||||
|
volumes:
|
||||||
|
- ./nginx.conf:/etc/nginx/nginx.conf:ro
|
||||||
|
- ./ssl:/etc/nginx/ssl:ro
|
||||||
|
depends_on:
|
||||||
|
- llamastack-server1
|
||||||
|
- llamastack-server2
|
||||||
|
networks:
|
||||||
|
- llamastack-network
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
ollama-data:
|
||||||
|
server1-data:
|
||||||
|
server2-data:
|
||||||
|
|
||||||
|
networks:
|
||||||
|
llamastack-network:
|
||||||
|
driver: bridge
|
||||||
|
```
|
||||||
|
|
||||||
|
**Environment file (.env):**
|
||||||
|
```bash
|
||||||
|
OPENAI_API_KEY=your_openai_key_here
|
||||||
|
ANTHROPIC_API_KEY=your_anthropic_key_here
|
||||||
|
GROQ_API_KEY=your_groq_key_here
|
||||||
|
```
|
||||||
|
|
||||||
|
**Deploy with Docker Compose:**
|
||||||
|
```bash
|
||||||
|
# Start all services
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
|
# Scale services
|
||||||
|
docker-compose up -d --scale llamastack-server1=3
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f llamastack-server1
|
||||||
|
docker-compose logs -f llamastack-server2
|
||||||
|
|
||||||
|
# Stop services
|
||||||
|
docker-compose down
|
||||||
|
|
||||||
|
# Update services
|
||||||
|
docker-compose pull && docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kubernetes Deployment
|
||||||
|
|
||||||
|
### Method 3: Kubernetes
|
||||||
|
|
||||||
|
**ConfigMap (llamastack-configs.yaml):**
|
||||||
|
```yaml
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: llamastack-configs
|
||||||
|
data:
|
||||||
|
server1.yaml: |
|
||||||
|
version: 2
|
||||||
|
image_name: k8s-server1
|
||||||
|
providers:
|
||||||
|
inference:
|
||||||
|
- provider_id: ollama
|
||||||
|
provider_type: remote::ollama
|
||||||
|
config:
|
||||||
|
url: http://ollama-service:11434
|
||||||
|
metadata_store:
|
||||||
|
type: sqlite
|
||||||
|
db_path: /data/registry.db
|
||||||
|
server:
|
||||||
|
port: 8321
|
||||||
|
|
||||||
|
server2.yaml: |
|
||||||
|
version: 2
|
||||||
|
image_name: k8s-server2
|
||||||
|
providers:
|
||||||
|
inference:
|
||||||
|
- provider_id: openai
|
||||||
|
provider_type: remote::openai
|
||||||
|
config:
|
||||||
|
api_key: ${OPENAI_API_KEY}
|
||||||
|
metadata_store:
|
||||||
|
type: sqlite
|
||||||
|
db_path: /data/registry.db
|
||||||
|
server:
|
||||||
|
port: 8322
|
||||||
|
```
|
||||||
|
|
||||||
|
**Deployments (llamastack-deployments.yaml):**
|
||||||
|
```yaml
|
||||||
|
# Server 1 Deployment - Local Providers
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server1
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: llamastack-server1
|
||||||
|
app.kubernetes.io/component: inference
|
||||||
|
spec:
|
||||||
|
replicas: 3
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: llamastack-server1
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: llamastack-server1
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: llamastack
|
||||||
|
image: llamastack-server1:latest
|
||||||
|
ports:
|
||||||
|
- containerPort: 8321
|
||||||
|
name: http-api
|
||||||
|
env:
|
||||||
|
- name: GROQ_API_KEY
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: llamastack-secrets
|
||||||
|
key: groq-api-key
|
||||||
|
- name: OLLAMA_URL
|
||||||
|
value: "http://ollama-service:11434"
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /app/configs
|
||||||
|
subPath: server1.yaml
|
||||||
|
- name: data
|
||||||
|
mountPath: /data
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /v1/health
|
||||||
|
port: 8321
|
||||||
|
initialDelaySeconds: 60
|
||||||
|
periodSeconds: 30
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /v1/health
|
||||||
|
port: 8321
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 5
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: "512Mi"
|
||||||
|
cpu: "250m"
|
||||||
|
limits:
|
||||||
|
memory: "1Gi"
|
||||||
|
cpu: "500m"
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: llamastack-configs
|
||||||
|
- name: data
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: llamastack-server1-pvc
|
||||||
|
|
||||||
|
---
|
||||||
|
# Server 2 Deployment - Cloud Providers
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server2
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: llamastack-server2
|
||||||
|
app.kubernetes.io/component: inference
|
||||||
|
spec:
|
||||||
|
replicas: 2
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: llamastack-server2
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: llamastack-server2
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: llamastack
|
||||||
|
image: llamastack-server2:latest
|
||||||
|
ports:
|
||||||
|
- containerPort: 8322
|
||||||
|
name: http-api
|
||||||
|
env:
|
||||||
|
- name: OPENAI_API_KEY
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: llamastack-secrets
|
||||||
|
key: openai-api-key
|
||||||
|
- name: ANTHROPIC_API_KEY
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: llamastack-secrets
|
||||||
|
key: anthropic-api-key
|
||||||
|
- name: GROQ_API_KEY
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: llamastack-secrets
|
||||||
|
key: groq-api-key
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /app/configs
|
||||||
|
subPath: server2.yaml
|
||||||
|
- name: data
|
||||||
|
mountPath: /data
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /v1/health
|
||||||
|
port: 8322
|
||||||
|
initialDelaySeconds: 60
|
||||||
|
periodSeconds: 30
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /v1/health
|
||||||
|
port: 8322
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 5
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: "256Mi"
|
||||||
|
cpu: "100m"
|
||||||
|
limits:
|
||||||
|
memory: "512Mi"
|
||||||
|
cpu: "250m"
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: llamastack-configs
|
||||||
|
- name: data
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: llamastack-server2-pvc
|
||||||
|
```
|
||||||
|
|
||||||
|
**Services (llamastack-services.yaml):**
|
||||||
|
```yaml
|
||||||
|
# Server 1 Service
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server1-service
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: llamastack-server1
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: llamastack-server1
|
||||||
|
ports:
|
||||||
|
- port: 8321
|
||||||
|
targetPort: 8321
|
||||||
|
protocol: TCP
|
||||||
|
name: http-api
|
||||||
|
type: LoadBalancer
|
||||||
|
|
||||||
|
---
|
||||||
|
# Server 2 Service
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server2-service
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: llamastack-server2
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: llamastack-server2
|
||||||
|
ports:
|
||||||
|
- port: 8322
|
||||||
|
targetPort: 8322
|
||||||
|
protocol: TCP
|
||||||
|
name: http-api
|
||||||
|
type: LoadBalancer
|
||||||
|
```
|
||||||
|
|
||||||
|
**Persistent Volume Claims (llamastack-pvc.yaml):**
|
||||||
|
```yaml
|
||||||
|
# PVC for Server 1
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server1-pvc
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 10Gi
|
||||||
|
storageClassName: fast-ssd
|
||||||
|
|
||||||
|
---
|
||||||
|
# PVC for Server 2
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: llamastack-server2-pvc
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 5Gi
|
||||||
|
storageClassName: fast-ssd
|
||||||
|
```
|
||||||
|
|
||||||
|
**Secrets (llamastack-secrets.yaml):**
|
||||||
|
```yaml
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Secret
|
||||||
|
metadata:
|
||||||
|
name: llamastack-secrets
|
||||||
|
type: Opaque
|
||||||
|
stringData:
|
||||||
|
groq-api-key: "your_groq_api_key_here"
|
||||||
|
openai-api-key: "your_openai_api_key_here"
|
||||||
|
anthropic-api-key: "your_anthropic_api_key_here"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Deploy to Kubernetes:**
|
||||||
|
```bash
|
||||||
|
# Apply configurations
|
||||||
|
kubectl apply -f llamastack-configs.yaml
|
||||||
|
kubectl apply -f llamastack-secrets.yaml
|
||||||
|
kubectl apply -f llamastack-pvc.yaml
|
||||||
|
kubectl apply -f llamastack-deployment.yaml
|
||||||
|
kubectl apply -f llamastack-service.yaml
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
kubectl get pods -l app=llamastack-server1
|
||||||
|
kubectl get services
|
||||||
|
|
||||||
|
# Scale deployment
|
||||||
|
kubectl scale deployment llamastack-server1 --replicas=5
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
kubectl logs -f deployment/llamastack-server1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Load Balancing & High Availability
|
||||||
|
|
||||||
|
### NGINX Load Balancer Configuration
|
||||||
|
|
||||||
|
**nginx.conf:**
|
||||||
|
```nginx
|
||||||
|
upstream llamastack_local {
|
||||||
|
least_conn;
|
||||||
|
server llamastack-server1:8321 max_fails=3 fail_timeout=30s;
|
||||||
|
server llamastack-server1-2:8321 max_fails=3 fail_timeout=30s;
|
||||||
|
server llamastack-server1-3:8321 max_fails=3 fail_timeout=30s;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream llamastack_cloud {
|
||||||
|
least_conn;
|
||||||
|
server llamastack-server2:8322 max_fails=3 fail_timeout=30s;
|
||||||
|
server llamastack-server2-2:8322 max_fails=3 fail_timeout=30s;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name localhost;
|
||||||
|
|
||||||
|
# Health check endpoint
|
||||||
|
location /health {
|
||||||
|
access_log off;
|
||||||
|
return 200 "healthy\n";
|
||||||
|
add_header Content-Type text/plain;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Route to local providers
|
||||||
|
location /v1/local/ {
|
||||||
|
rewrite ^/v1/local/(.*) /v1/$1 break;
|
||||||
|
proxy_pass http://llamastack_local;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_connect_timeout 30s;
|
||||||
|
proxy_send_timeout 30s;
|
||||||
|
proxy_read_timeout 30s;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Route to cloud providers
|
||||||
|
location /v1/cloud/ {
|
||||||
|
rewrite ^/v1/cloud/(.*) /v1/$1 break;
|
||||||
|
proxy_pass http://llamastack_cloud;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_connect_timeout 30s;
|
||||||
|
proxy_send_timeout 30s;
|
||||||
|
proxy_read_timeout 30s;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Default routing
|
||||||
|
location /v1/ {
|
||||||
|
proxy_pass http://llamastack_local;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring & Logging
|
||||||
|
|
||||||
|
### Prometheus Monitoring
|
||||||
|
|
||||||
|
**prometheus.yml:**
|
||||||
|
```yaml
|
||||||
|
global:
|
||||||
|
scrape_interval: 15s
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: 'llamastack-servers'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['llamastack-server1:8321', 'llamastack-server2:8322']
|
||||||
|
metrics_path: /metrics
|
||||||
|
scrape_interval: 30s
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grafana Dashboard
|
||||||
|
|
||||||
|
**Key metrics to monitor:**
|
||||||
|
- Request latency (p50, p95, p99)
|
||||||
|
- Request rate (requests/second)
|
||||||
|
- Error rate (4xx, 5xx responses)
|
||||||
|
- Container resource usage (CPU, memory)
|
||||||
|
- Provider-specific metrics (API quotas, rate limits)
|
||||||
|
|
||||||
|
### Centralized Logging
|
||||||
|
|
||||||
|
**docker-compose.yml addition:**
|
||||||
|
```yaml
|
||||||
|
# ELK Stack for logging
|
||||||
|
elasticsearch:
|
||||||
|
image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
|
||||||
|
environment:
|
||||||
|
- discovery.type=single-node
|
||||||
|
volumes:
|
||||||
|
- elasticsearch-data:/usr/share/elasticsearch/data
|
||||||
|
|
||||||
|
logstash:
|
||||||
|
image: docker.elastic.co/logstash/logstash:8.8.0
|
||||||
|
volumes:
|
||||||
|
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
|
||||||
|
|
||||||
|
kibana:
|
||||||
|
image: docker.elastic.co/kibana/kibana:8.8.0
|
||||||
|
ports:
|
||||||
|
- "5601:5601"
|
||||||
|
```
|
||||||
|
|
||||||
|
*This guide focuses on production deployments and operational best practices for multiple Llama Stack servers.*
|
Loading…
Add table
Add a link
Reference in a new issue