Merge dc2743912a into d266c59c2a

2025-10-03 19:57:35 +00:00 · 2025-10-03 05:03:25 -07:00 · 2025-10-03 05:03:25 -07:00 · c005f50d16
commit c005f50d16
parent d266c59c2a dc2743912a
2 changed files with 744 additions and 0 deletions
--- a/docs/docs/deploying/index.mdx
+++ b/docs/docs/deploying/index.mdx
@ -10,5 +10,6 @@ import TabItem from '@theme/TabItem';

 # Deploying Llama Stack

+[**→ Multiple Llama Stack Servers Guide**](./Launching_Multiple_LlamaStack_Servers.md)
 [**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx)
 [**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx)
--- a/docs/docs/deploying/launching_multiple_llamastack_servers.md
+++ b/docs/docs/deploying/launching_multiple_llamastack_servers.md
@ -0,0 +1,743 @@
+# Production Deployment: Multiple Llama Stack Servers
+
+A production-focused guide for deploying multiple Llama Stack servers using container images and systemd services.
+
+## Table of Contents
+
+1. [Production Use Cases](#production-use-cases)
+2. [Container-Based Deployment](#container-based-deployment)
+3. [Systemd Service Deployment](#systemd-service-deployment)
+4. [Docker Compose Deployment](#docker-compose-deployment)
+5. [Kubernetes Deployment](#kubernetes-deployment)
+6. [Load Balancing & High Availability](#load-balancing--high-availability)
+7. [Monitoring & Logging](#monitoring--logging)
+8. [Production Best Practices](#production-best-practices)
+
+---
+
+## Production Use Cases
+
+### When to Deploy Multiple Llama Stack Servers
+
+**Provider Isolation**: Separate servers for different AI providers (local vs. cloud)
+```
+Server 1: Ollama + local models (internal traffic)
+Server 2: OpenAI + Anthropic (external API traffic)
+Server 3: Enterprise providers (Bedrock, Azure)
+```
+
+**Workload Segmentation**: Different servers for different workloads
+```
+Server 1: Real-time inference (low latency)
+Server 2: Batch processing (high throughput)
+Server 3: Embeddings & vector operations
+```
+
+**Multi-Tenancy**: Isolated servers per tenant/environment
+```
+Server 1: Production tenant A
+Server 2: Production tenant B
+Server 3: Staging environment
+```
+
+**High Availability**: Load-balanced instances for fault tolerance
+```
+Server 1-3: Same config, load balanced
+Server 4-6: Backup cluster
+```
+
+---
+
+## Container-Based Deployment
+
+### Method 1: Docker Containers
+
+#### Use Official LlamaStack Container Approach
+
+**Option 1: Use Starter Distribution with Container Runtime**
+
+
+```dockerfile
+# Simple Dockerfile leveraging the starter distribution
+FROM python:3.12-slim
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
+
+# Install LlamaStack
+RUN pip install --no-cache-dir llama-stack
+
+# Initialize starter distribution
+RUN llama stack build --template starter --name production-server
+
+# Create non-root user
+RUN useradd -r -s /bin/false -m llamastack
+USER llamastack
+
+WORKDIR /app
+
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+  CMD curl -f http://localhost:8321/v1/health || exit 1
+
+# Use starter distribution configs
+CMD ["llama", "stack", "run", "~/.llama/distributions/starter/starter-run.yaml"]
+```
+
+**Option 2: Use Standard Python Base Image (Recommended)**
+
+Since LlamaStack doesn't provide a Dockerfile, use the standard Python installation approach:
+
+```bash
+# Use the simple approach with standard Python image
+# No need to clone and build - just use pip install directly in containers
+```
+
+**Option 2b: Check for Official Images (Future)**
+
+```bash
+# Check if official images become available
+docker search meta-llama/llama-stack
+docker search llamastack
+
+# For now, use the pip-based approach in Option 1 or 3
+```
+
+**Option 3: Lightweight Container with Starter Distribution**
+
+```dockerfile
+FROM python:3.12-alpine
+
+# Install dependencies
+RUN apk add --no-cache curl gcc musl-dev linux-headers
+
+# Install LlamaStack
+RUN pip install --no-cache-dir llama-stack
+
+# Initialize starter distribution
+RUN llama stack build --template starter --name starter
+
+# Create non-root user
+RUN adduser -D llamastack
+USER llamastack
+
+WORKDIR /app
+
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+  CMD curl -f http://localhost:8321/v1/health || exit 1
+
+# Use CLI port override instead of modifying YAML
+CMD ["llama", "stack", "run", "/home/llamastack/.llama/distributions/starter/starter-run.yaml", "--port", "8321"]
+```
+
+#### Prepare Server Configurations
+
+**Server 1 Config (server1.yaml):**
+```yaml
+version: 2
+image_name: production-server1
+providers:
+  inference:
+    - provider_id: ollama
+      provider_type: remote::ollama
+      config:
+        url: http://ollama-service:11434
+  vector_io:
+    - provider_id: faiss
+      provider_type: inline::faiss
+      config:
+        kvstore:
+          type: sqlite
+          db_path: /data/server1/faiss_store.db
+metadata_store:
+  type: sqlite
+  db_path: /data/server1/registry.db
+server:
+  port: 8321
+```
+
+**Server 2 Config (server2.yaml):**
+```yaml
+version: 2
+image_name: production-server2
+providers:
+  inference:
+    - provider_id: openai
+      provider_type: remote::openai
+      config:
+        api_key: ${OPENAI_API_KEY}
+    - provider_id: anthropic
+      provider_type: remote::anthropic
+      config:
+        api_key: ${ANTHROPIC_API_KEY}
+metadata_store:
+  type: sqlite
+  db_path: /data/server2/registry.db
+server:
+  port: 8322
+```
+
+#### Build and Run Containers
+
+```bash
+# Build images
+docker build -t llamastack-server1 -f Dockerfile.server1 .
+docker build -t llamastack-server2 -f Dockerfile.server2 .
+
+# Create volumes for persistent data
+docker volume create llamastack-server1-data
+docker volume create llamastack-server2-data
+
+# Run Server 1
+docker run -d \
+  --name llamastack-server1 \
+  --restart unless-stopped \
+  -p 8321:8321 \
+  -v llamastack-server1-data:/data/server1 \
+  -e GROQ_API_KEY="${GROQ_API_KEY}" \
+  --network llamastack-network \
+  llamastack-server1
+
+# Run Server 2
+docker run -d \
+  --name llamastack-server2 \
+  --restart unless-stopped \
+  -p 8322:8322 \
+  -v llamastack-server2-data:/data/server2 \
+  -e OPENAI_API_KEY="${OPENAI_API_KEY}" \
+  -e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
+  --network llamastack-network \
+  llamastack-server2
+```
+
+
+## Docker Compose Deployment
+
+### Method 2: Docker Compose (Recommended)
+
+**docker-compose.yml:**
+```yaml
+# Note: Version specification is optional in modern Docker Compose
+# Using latest Docker Compose format
+services:
+  # Ollama service for local models
+  ollama:
+    image: ollama/ollama:latest
+    container_name: ollama-service
+    restart: unless-stopped
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama-data:/root/.ollama
+    networks:
+      - llamastack-network
+
+  # Server 1: Local providers
+  llamastack-server1:
+    build:
+      context: .
+      dockerfile: Dockerfile.server1
+    container_name: llamastack-server1
+    restart: unless-stopped
+    ports:
+      - "8321:8321"
+    volumes:
+      - server1-data:/data/server1
+      - ./configs/server1.yaml:/app/configs/server.yaml:ro
+    environment:
+      - OLLAMA_URL=http://ollama:11434
+      - GROQ_API_KEY=${GROQ_API_KEY}
+    depends_on:
+      - ollama
+    networks:
+      - llamastack-network
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8321/v1/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+
+  # Server 2: Cloud providers
+  llamastack-server2:
+    build:
+      context: .
+      dockerfile: Dockerfile.server2
+    container_name: llamastack-server2
+    restart: unless-stopped
+    ports:
+      - "8322:8322"
+    volumes:
+      - server2-data:/data/server2
+      - ./configs/server2.yaml:/app/configs/server.yaml:ro
+    environment:
+      - OPENAI_API_KEY=${OPENAI_API_KEY}
+      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
+      - GROQ_API_KEY=${GROQ_API_KEY}
+    networks:
+      - llamastack-network
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8322/v1/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+
+  # Load balancer (optional)
+  nginx:
+    image: nginx:alpine
+    container_name: llamastack-lb
+    restart: unless-stopped
+    ports:
+      - "80:80"
+      - "443:443"
+    volumes:
+      - ./nginx.conf:/etc/nginx/nginx.conf:ro
+      - ./ssl:/etc/nginx/ssl:ro
+    depends_on:
+      - llamastack-server1
+      - llamastack-server2
+    networks:
+      - llamastack-network
+
+volumes:
+  ollama-data:
+  server1-data:
+  server2-data:
+
+networks:
+  llamastack-network:
+    driver: bridge
+```
+
+**Environment file (.env):**
+```bash
+OPENAI_API_KEY=your_openai_key_here
+ANTHROPIC_API_KEY=your_anthropic_key_here
+GROQ_API_KEY=your_groq_key_here
+```
+
+**Deploy with Docker Compose:**
+```bash
+# Start all services
+docker-compose up -d
+
+# Scale services
+docker-compose up -d --scale llamastack-server1=3
+
+# View logs
+docker-compose logs -f llamastack-server1
+docker-compose logs -f llamastack-server2
+
+# Stop services
+docker-compose down
+
+# Update services
+docker-compose pull && docker-compose up -d
+```
+
+---
+
+## Kubernetes Deployment
+
+### Method 3: Kubernetes
+
+**ConfigMap (llamastack-configs.yaml):**
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llamastack-configs
+data:
+  server1.yaml: |
+    version: 2
+    image_name: k8s-server1
+    providers:
+      inference:
+        - provider_id: ollama
+          provider_type: remote::ollama
+          config:
+            url: http://ollama-service:11434
+    metadata_store:
+      type: sqlite
+      db_path: /data/registry.db
+    server:
+      port: 8321
+
+  server2.yaml: |
+    version: 2
+    image_name: k8s-server2
+    providers:
+      inference:
+        - provider_id: openai
+          provider_type: remote::openai
+          config:
+            api_key: ${OPENAI_API_KEY}
+    metadata_store:
+      type: sqlite
+      db_path: /data/registry.db
+    server:
+      port: 8322
+```
+
+**Deployments (llamastack-deployments.yaml):**
+```yaml
+# Server 1 Deployment - Local Providers
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llamastack-server1
+  labels:
+    app.kubernetes.io/name: llamastack-server1
+    app.kubernetes.io/component: inference
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: llamastack-server1
+  template:
+    metadata:
+      labels:
+        app: llamastack-server1
+    spec:
+      containers:
+      - name: llamastack
+        image: llamastack-server1:latest
+        ports:
+        - containerPort: 8321
+          name: http-api
+        env:
+        - name: GROQ_API_KEY
+          valueFrom:
+            secretKeyRef:
+              name: llamastack-secrets
+              key: groq-api-key
+        - name: OLLAMA_URL
+          value: "http://ollama-service:11434"
+        volumeMounts:
+        - name: config
+          mountPath: /app/configs
+          subPath: server1.yaml
+        - name: data
+          mountPath: /data
+        livenessProbe:
+          httpGet:
+            path: /v1/health
+            port: 8321
+          initialDelaySeconds: 60
+          periodSeconds: 30
+        readinessProbe:
+          httpGet:
+            path: /v1/health
+            port: 8321
+          initialDelaySeconds: 10
+          periodSeconds: 5
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "250m"
+          limits:
+            memory: "1Gi"
+            cpu: "500m"
+      volumes:
+      - name: config
+        configMap:
+          name: llamastack-configs
+      - name: data
+        persistentVolumeClaim:
+          claimName: llamastack-server1-pvc
+
+---
+# Server 2 Deployment - Cloud Providers
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llamastack-server2
+  labels:
+    app.kubernetes.io/name: llamastack-server2
+    app.kubernetes.io/component: inference
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app: llamastack-server2
+  template:
+    metadata:
+      labels:
+        app: llamastack-server2
+    spec:
+      containers:
+      - name: llamastack
+        image: llamastack-server2:latest
+        ports:
+        - containerPort: 8322
+          name: http-api
+        env:
+        - name: OPENAI_API_KEY
+          valueFrom:
+            secretKeyRef:
+              name: llamastack-secrets
+              key: openai-api-key
+        - name: ANTHROPIC_API_KEY
+          valueFrom:
+            secretKeyRef:
+              name: llamastack-secrets
+              key: anthropic-api-key
+        - name: GROQ_API_KEY
+          valueFrom:
+            secretKeyRef:
+              name: llamastack-secrets
+              key: groq-api-key
+        volumeMounts:
+        - name: config
+          mountPath: /app/configs
+          subPath: server2.yaml
+        - name: data
+          mountPath: /data
+        livenessProbe:
+          httpGet:
+            path: /v1/health
+            port: 8322
+          initialDelaySeconds: 60
+          periodSeconds: 30
+        readinessProbe:
+          httpGet:
+            path: /v1/health
+            port: 8322
+          initialDelaySeconds: 10
+          periodSeconds: 5
+        resources:
+          requests:
+            memory: "256Mi"
+            cpu: "100m"
+          limits:
+            memory: "512Mi"
+            cpu: "250m"
+      volumes:
+      - name: config
+        configMap:
+          name: llamastack-configs
+      - name: data
+        persistentVolumeClaim:
+          claimName: llamastack-server2-pvc
+```
+
+**Services (llamastack-services.yaml):**
+```yaml
+# Server 1 Service
+apiVersion: v1
+kind: Service
+metadata:
+  name: llamastack-server1-service
+  labels:
+    app.kubernetes.io/name: llamastack-server1
+spec:
+  selector:
+    app: llamastack-server1
+  ports:
+  - port: 8321
+    targetPort: 8321
+    protocol: TCP
+    name: http-api
+  type: LoadBalancer
+
+---
+# Server 2 Service
+apiVersion: v1
+kind: Service
+metadata:
+  name: llamastack-server2-service
+  labels:
+    app.kubernetes.io/name: llamastack-server2
+spec:
+  selector:
+    app: llamastack-server2
+  ports:
+  - port: 8322
+    targetPort: 8322
+    protocol: TCP
+    name: http-api
+  type: LoadBalancer
+```
+
+**Persistent Volume Claims (llamastack-pvc.yaml):**
+```yaml
+# PVC for Server 1
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: llamastack-server1-pvc
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
+  storageClassName: fast-ssd
+
+---
+# PVC for Server 2
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: llamastack-server2-pvc
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
+  storageClassName: fast-ssd
+```
+
+**Secrets (llamastack-secrets.yaml):**
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: llamastack-secrets
+type: Opaque
+stringData:
+  groq-api-key: "your_groq_api_key_here"
+  openai-api-key: "your_openai_api_key_here"
+  anthropic-api-key: "your_anthropic_api_key_here"
+```
+
+**Deploy to Kubernetes:**
+```bash
+# Apply configurations
+kubectl apply -f llamastack-configs.yaml
+kubectl apply -f llamastack-secrets.yaml
+kubectl apply -f llamastack-pvc.yaml
+kubectl apply -f llamastack-deployment.yaml
+kubectl apply -f llamastack-service.yaml
+
+# Check status
+kubectl get pods -l app=llamastack-server1
+kubectl get services
+
+# Scale deployment
+kubectl scale deployment llamastack-server1 --replicas=5
+
+# View logs
+kubectl logs -f deployment/llamastack-server1
+```
+
+---
+
+## Load Balancing & High Availability
+
+### NGINX Load Balancer Configuration
+
+**nginx.conf:**
+```nginx
+upstream llamastack_local {
+    least_conn;
+    server llamastack-server1:8321 max_fails=3 fail_timeout=30s;
+    server llamastack-server1-2:8321 max_fails=3 fail_timeout=30s;
+    server llamastack-server1-3:8321 max_fails=3 fail_timeout=30s;
+}
+
+upstream llamastack_cloud {
+    least_conn;
+    server llamastack-server2:8322 max_fails=3 fail_timeout=30s;
+    server llamastack-server2-2:8322 max_fails=3 fail_timeout=30s;
+}
+
+server {
+    listen 80;
+    server_name localhost;
+
+    # Health check endpoint
+    location /health {
+        access_log off;
+        return 200 "healthy\n";
+        add_header Content-Type text/plain;
+    }
+
+    # Route to local providers
+    location /v1/local/ {
+        rewrite ^/v1/local/(.*) /v1/$1 break;
+        proxy_pass http://llamastack_local;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_connect_timeout 30s;
+        proxy_send_timeout 30s;
+        proxy_read_timeout 30s;
+    }
+
+    # Route to cloud providers
+    location /v1/cloud/ {
+        rewrite ^/v1/cloud/(.*) /v1/$1 break;
+        proxy_pass http://llamastack_cloud;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_connect_timeout 30s;
+        proxy_send_timeout 30s;
+        proxy_read_timeout 30s;
+    }
+
+    # Default routing
+    location /v1/ {
+        proxy_pass http://llamastack_local;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    }
+}
+```
+
+---
+
+## Monitoring & Logging
+
+### Prometheus Monitoring
+
+**prometheus.yml:**
+```yaml
+global:
+  scrape_interval: 15s
+
+scrape_configs:
+  - job_name: 'llamastack-servers'
+    static_configs:
+      - targets: ['llamastack-server1:8321', 'llamastack-server2:8322']
+    metrics_path: /metrics
+    scrape_interval: 30s
+```
+
+### Grafana Dashboard
+
+**Key metrics to monitor:**
+- Request latency (p50, p95, p99)
+- Request rate (requests/second)
+- Error rate (4xx, 5xx responses)
+- Container resource usage (CPU, memory)
+- Provider-specific metrics (API quotas, rate limits)
+
+### Centralized Logging
+
+**docker-compose.yml addition:**
+```yaml
+  # ELK Stack for logging
+  elasticsearch:
+    image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
+    environment:
+      - discovery.type=single-node
+    volumes:
+      - elasticsearch-data:/usr/share/elasticsearch/data
+
+  logstash:
+    image: docker.elastic.co/logstash/logstash:8.8.0
+    volumes:
+      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
+
+  kibana:
+    image: docker.elastic.co/kibana/kibana:8.8.0
+    ports:
+      - "5601:5601"
+```
+
+*This guide focuses on production deployments and operational best practices for multiple Llama Stack servers.*