From 3e77ebf7725175c02b0c7ef64e9a15adbc0b784a Mon Sep 17 00:00:00 2001 From: Omar Abdelwahab Date: Wed, 1 Oct 2025 17:54:35 -0700 Subject: [PATCH 1/4] Added documentation for launching multiple llama stack servers --- .../Launching_Multiple_LlamaStack_Servers.md | 468 ++++++++++++++++++ docs/docs/deploying/index.mdx | 1 + 2 files changed, 469 insertions(+) create mode 100644 docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md diff --git a/docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md b/docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md new file mode 100644 index 000000000..729b14d77 --- /dev/null +++ b/docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md @@ -0,0 +1,468 @@ +# Multiple Llama Stack Servers: Starter Distro Guide + +A complete guide to running multiple Llama Stack servers using the **starter distribution** for first-time users. + +## Table of Contents + +1. [System Requirements](#system-requirements) +2. [Verify Llama Stack](#Verify-that-llama-stack-is-installed) +3. [Initialize Starter Distribution](#initialize-starter-distribution) +4. [Set Up Multiple Servers](#set-up-multiple-servers) +5. [Configure API Keys](#configure-api-keys) +6. [Start the Servers](#start-the-servers) +7. [Test Your Setup](#test-your-setup) +8. [Manage Your Servers](#manage-your-servers) +9. [Troubleshooting](#troubleshooting) + +--- + +## System Requirements + +### Minimum Requirements +- **Operating System**: Linux, macOS, or Windows with WSL2 +- **Python**: Version 3.12 or higher +- **RAM**: 8GB minimum (16GB recommended) +- **Storage**: 10GB free space minimum +- **Network**: Stable internet connection + +### Check Your System +```bash +# Check Python version +python3 --version + +# Check available RAM +free -h + +# Check disk space +df -h +``` + +--- + +## Verify Llama Stack + +### Step 1: Verify that llama stack is installed +```bash +# Verify installation +llama stack --help +``` + +### Step 2: Initialize Starter Distribution +```bash +# Initialize the starter distribution +llama stack build --template starter --name starter + +# This creates ~/.llama/distributions/starter/ +``` + +--- + +## Set Up Multiple Servers + +The starter distribution provides a comprehensive configuration with multiple providers. We'll create **2 servers** based on this starter config: + +- **Server 1** (Port 8321): Full starter config with all providers +- **Server 2** (Port 8322): Same config with different database paths (using CLI port override) + +### Step 1: Examine the Base Configuration + +```bash +# View the starter configuration +cat ~/.llama/distributions/starter/starter-run.yaml +``` + +### Step 2: Create Server 1 Configuration (Full Starter) + +```bash +# Copy the starter config for Server 1 +cp ~/.llama/distributions/starter/starter-run.yaml ~/server1-starter.yaml +``` + +### Step 3: Create Server 2 Configuration (Same Config, Different Databases) + +```bash +# Copy starter config for Server 2 +cp ~/.llama/distributions/starter/starter-run.yaml ~/server2-starter.yaml + +# Change the database paths to avoid conflicts (only change needed!) +sed -i 's|~/.llama/distributions/starter|~/.llama/distributions/starter2|g' ~/server2-starter.yaml +``` + +### Step 4: Create Separate Database Directories +```bash +# Create separate directories for Server 2 +mkdir -p ~/.llama/distributions/starter2 +``` + +**That's it!** No need to modify ports in YAML files - we'll use the CLI `--port` flag instead. + +--- + +## Configure API Keys + +The starter configuration supports many providers. Set up the API keys you need: + +### Essential API Keys + +```bash +# Groq (fast inference) +export GROQ_API_KEY="your_groq_api_key_here" + +# OpenAI (if you want to use GPT models) +export OPENAI_API_KEY="your_openai_api_key_here" + +# Anthropic (if you want Claude models) +export ANTHROPIC_API_KEY="your_anthropic_api_key_here" + +# Ollama (for local models) +export OLLAMA_URL="http://localhost:11434" +``` + +### Optional API Keys (Set only if you plan to use these providers) + +```bash +# Fireworks AI +export FIREWORKS_API_KEY="your_fireworks_api_key" + +# Together AI +export TOGETHER_API_KEY="your_together_api_key" + +# Gemini +export GEMINI_API_KEY="your_gemini_api_key" + +# NVIDIA +export NVIDIA_API_KEY="your_nvidia_api_key" +``` + +--- + +## Set Up Ollama (Optional) + +If you want to use local models through Ollama: + +### Install and Start Ollama + +**Linux:** +```bash +curl -fsSL https://ollama.com/install.sh | sh +ollama serve +``` + +**macOS:** +```bash +brew install ollama +ollama serve +``` + +### Download Models (in a new terminal) + +```bash +# Download popular models +ollama pull llama3.1:8b +ollama pull llama-guard3:8b +ollama pull all-minilm:l6-v2 + +# Verify models +ollama list +``` + +--- + +## Start the Servers + +### Method 1: Run in Separate Terminals (Recommended for Development) + +**Terminal 1 - Server 1:** +```bash +cd ~ +llama stack run ~/server1-starter.yaml --port 8321 +``` + +**Terminal 2 - Server 2 (Uses CLI port override!):** +```bash +cd ~ +llama stack run ~/server2-starter.yaml --port 8322 +``` + +### Method 2: Run in Background + +```bash +# Start Server 1 in background +cd ~ +nohup llama stack run ~/server1-starter.yaml --port 8321 > server1.log 2>&1 & + +# Start Server 2 in background with port override +nohup llama stack run ~/server2-starter.yaml --port 8322 > server2.log 2>&1 & +``` + +### Method 3: Alternative - Use Environment Variable + +```bash +# You can also set port via environment variable +export LLAMA_STACK_PORT=8322 +llama stack run ~/server2-starter.yaml + +# Or inline +LLAMA_STACK_PORT=8322 llama stack run ~/server2-starter.yaml +``` + +### Expected Output + +Both servers should start successfully: +``` +Starting server on port 8321... +Server is running at http://localhost:8321 +``` + +``` +Starting server on port 8322... +Server is running at http://localhost:8322 +``` + +--- + +## Test Your Setup + +### Step 1: Health Check + +```bash +# Test both servers +curl http://localhost:8321/v1/health +curl http://localhost:8322/v1/health +``` + +**Expected Response:** +```json +{"status": "OK"} +``` + +### Step 2: List Available Models + +```bash +# Check models on Server 1 +curl -s http://localhost:8321/v1/models | python3 -m json.tool + +# Check models on Server 2 +curl -s http://localhost:8322/v1/models | python3 -m json.tool +``` + +### Step 3: Test Inference with Different Providers + +**Test Groq on Server 1:** +```bash +curl -X POST http://localhost:8321/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [{"role": "user", "content": "Hello! How are you?"}], + "model": "groq/llama-3.1-8b-instant" + }' +``` + +**Test OpenAI on Server 2 (if you have OpenAI API key):** +```bash +curl -X POST http://localhost:8322/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [{"role": "user", "content": "Hello from server 2!"}], + "model": "openai/gpt-4o-mini" + }' +``` + +**Test Ollama (if you set it up):** +```bash +curl -X POST http://localhost:8321/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [{"role": "user", "content": "Hello from Ollama!"}], + "model": "ollama/llama3.1:8b" + }' +``` + +### Step 4: Test Embeddings + +```bash +curl -X POST http://localhost:8321/v1/embeddings \ + -H "Content-Type: application/json" \ + -d '{ + "input": "Hello world", + "model": "sentence-transformers/all-MiniLM-L6-v2" + }' +``` + +--- + +## Manage Your Servers + +### Check What's Running + +```bash +# Check server processes +lsof -i :8321 -i :8322 + +# Check all llama stack processes +ps aux | grep "llama.*stack" +``` + +### Stop Servers + +**Stop individual servers:** +```bash +# Stop Server 1 +kill $(lsof -t -i:8321) + +# Stop Server 2 +kill $(lsof -t -i:8322) +``` + +**Stop all servers:** +```bash +pkill -f "llama.*stack.*run" +``` + +### View Logs (if running in background) + +```bash +# Watch Server 1 logs +tail -f server1.log + +# Watch Server 2 logs +tail -f server2.log +``` + +### Restart Servers + +```bash +# Stop all servers first +pkill -f "llama.*stack.*run" +sleep 3 + +# Restart both servers +cd ~ +nohup llama stack run ~/server1-starter.yaml > server1.log 2>&1 & +nohup llama stack run ~/server2-starter.yaml > server2.log 2>&1 & +``` + +--- + +## Troubleshooting + +### Problem: "Port already in use" + +```bash +# Find what's using the ports +lsof -i :8321 -i :8322 + +# Kill processes using the ports +kill $(lsof -t -i:8321) +kill $(lsof -t -i:8322) +``` + +### Problem: "Provider not available" + +The starter config includes many providers that may not have API keys set. This is normal behavior: + +```bash +# Check which environment variables are set +env | grep -E "(GROQ|OPENAI|ANTHROPIC|OLLAMA)_" + +# Set missing API keys you want to use +export GROQ_API_KEY="your_key_here" +``` + +### Problem: "No models available" + +```bash +# Check available models +curl -s http://localhost:8321/v1/models | python3 -m json.tool + +# If empty, check your API keys are set correctly +echo $GROQ_API_KEY +echo $OPENAI_API_KEY +``` + +### Problem: Ollama connection issues + +```bash +# Check if Ollama is running +curl http://localhost:11434/api/version + +# If not running, start it +ollama serve + +# Verify OLLAMA_URL is set +echo $OLLAMA_URL +``` + +--- + +## Advanced Usage + +### Customize Provider Selection + +You can modify the YAML files to enable/disable specific providers: + +```yaml +# In your server config, comment out providers you don't want +providers: + inference: + # - provider_id: openai # Disabled + # provider_type: remote::openai + # config: + # api_key: ${env.OPENAI_API_KEY:=} + + - provider_id: groq # Enabled + provider_type: remote::groq + config: + api_key: ${env.GROQ_API_KEY:=} +``` + +### You Can Use Different Providers on Different Servers + +**Server 1 - Local providers** +- Enable: Ollama, vllm, other local providers +- Disable: OpenAI, Anthropic, Groq, Fireworks + +**Server 2 - Remote providers:** +- Enable: OpenAI, Anthropic, Gemini +- Disable: Ollama, vllm and local providers +--- + +## Summary + +You now have **2 Llama Stack servers** running with the starter distribution: + +### Server Configuration +- **Server 1**: `http://localhost:8321` (Full starter config) +- **Server 2**: `http://localhost:8322` (Modified starter config) + +### Key Files +- `~/server1-starter.yaml` - Server 1 configuration +- `~/server2-starter.yaml` - Server 2 configuration +- `server1.log` - Server 1 logs (if background) +- `server2.log` - Server 2 logs (if background) + +### Key Commands +```bash +# Health check +curl http://localhost:8321/v1/health +curl http://localhost:8322/v1/health + +# Stop servers +kill $(lsof -t -i:8321) +kill $(lsof -t -i:8322) + +# Check processes +lsof -i :8321 -i :8322 +``` + +### Next Steps +1. Create more servers with different configurations if needed. +2. Set up API keys for providers you want to use. +3. Test different models and providers. +4. Customize configurations for your specific needs. +5. Set up monitoring and logging for production use. + + +--- + +*This guide uses the official Llama Stack starter distribution for maximum compatibility and feature coverage.* diff --git a/docs/docs/deploying/index.mdx b/docs/docs/deploying/index.mdx index eaa0e2612..25fcab682 100644 --- a/docs/docs/deploying/index.mdx +++ b/docs/docs/deploying/index.mdx @@ -10,5 +10,6 @@ import TabItem from '@theme/TabItem'; # Deploying Llama Stack +[**→ Multiple Llama Stack Servers Guide**](./Launching_Multiple_LlamaStack_Servers.md) [**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx) [**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx) From a8ec9081abeac85032262e5f9cd0f233c77335c5 Mon Sep 17 00:00:00 2001 From: Omar Abdelwahab Date: Wed, 1 Oct 2025 21:59:57 -0700 Subject: [PATCH 2/4] lowercased the file name --- ...aStack_Servers.md => launching_multiple_llamastack_servers.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/docs/deploying/{Launching_Multiple_LlamaStack_Servers.md => launching_multiple_llamastack_servers.md} (100%) diff --git a/docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md b/docs/docs/deploying/launching_multiple_llamastack_servers.md similarity index 100% rename from docs/docs/deploying/Launching_Multiple_LlamaStack_Servers.md rename to docs/docs/deploying/launching_multiple_llamastack_servers.md From 3a999f4082e21216b40eb5aef588195982893d73 Mon Sep 17 00:00:00 2001 From: Omar Abdelwahab Date: Thu, 2 Oct 2025 09:30:26 -0700 Subject: [PATCH 3/4] updated the documentation for containerized production use cases --- .../launching_multiple_llamastack_servers.md | 1154 +++++++++++------ 1 file changed, 732 insertions(+), 422 deletions(-) diff --git a/docs/docs/deploying/launching_multiple_llamastack_servers.md b/docs/docs/deploying/launching_multiple_llamastack_servers.md index 729b14d77..1cf5eda6c 100644 --- a/docs/docs/deploying/launching_multiple_llamastack_servers.md +++ b/docs/docs/deploying/launching_multiple_llamastack_servers.md @@ -1,468 +1,778 @@ -# Multiple Llama Stack Servers: Starter Distro Guide +# Production Deployment: Multiple Llama Stack Servers -A complete guide to running multiple Llama Stack servers using the **starter distribution** for first-time users. +A production-focused guide for deploying multiple Llama Stack servers using container images and systemd services. ## Table of Contents -1. [System Requirements](#system-requirements) -2. [Verify Llama Stack](#Verify-that-llama-stack-is-installed) -3. [Initialize Starter Distribution](#initialize-starter-distribution) -4. [Set Up Multiple Servers](#set-up-multiple-servers) -5. [Configure API Keys](#configure-api-keys) -6. [Start the Servers](#start-the-servers) -7. [Test Your Setup](#test-your-setup) -8. [Manage Your Servers](#manage-your-servers) -9. [Troubleshooting](#troubleshooting) +1. [Production Use Cases](#production-use-cases) +2. [Container-Based Deployment](#container-based-deployment) +3. [Systemd Service Deployment](#systemd-service-deployment) +4. [Docker Compose Deployment](#docker-compose-deployment) +5. [Kubernetes Deployment](#kubernetes-deployment) +6. [Load Balancing & High Availability](#load-balancing--high-availability) +7. [Monitoring & Logging](#monitoring--logging) +8. [Production Best Practices](#production-best-practices) --- -## System Requirements +## Production Use Cases -### Minimum Requirements -- **Operating System**: Linux, macOS, or Windows with WSL2 -- **Python**: Version 3.12 or higher -- **RAM**: 8GB minimum (16GB recommended) -- **Storage**: 10GB free space minimum -- **Network**: Stable internet connection +### When to Deploy Multiple Llama Stack Servers -### Check Your System -```bash -# Check Python version -python3 --version +**Provider Isolation**: Separate servers for different AI providers (local vs. cloud) +``` +Server 1: Ollama + local models (internal traffic) +Server 2: OpenAI + Anthropic (external API traffic) +Server 3: Enterprise providers (Bedrock, Azure) +``` -# Check available RAM -free -h +**Workload Segmentation**: Different servers for different workloads +``` +Server 1: Real-time inference (low latency) +Server 2: Batch processing (high throughput) +Server 3: Embeddings & vector operations +``` -# Check disk space -df -h +**Multi-Tenancy**: Isolated servers per tenant/environment +``` +Server 1: Production tenant A +Server 2: Production tenant B +Server 3: Staging environment +``` + +**High Availability**: Load-balanced instances for fault tolerance +``` +Server 1-3: Same config, load balanced +Server 4-6: Backup cluster ``` --- -## Verify Llama Stack +## Container-Based Deployment -### Step 1: Verify that llama stack is installed -```bash -# Verify installation -llama stack --help +### Method 1: Docker Containers + +#### Build Custom Container Images + +**Create Dockerfile for Llama Stack:** + +```dockerfile +# Option 1: Recommended - Python slim (balanced size/compatibility) +FROM python:3.12-slim + +# Install system dependencies +RUN apt-get update && apt-get install -y \ + curl \ + && rm -rf /var/lib/apt/lists/* \ + && apt-get clean + +# Install Llama Stack +RUN pip install --no-cache-dir llama-stack + +# Create app directory +WORKDIR /app + +# Create non-root user for security +RUN useradd -r -s /bin/false -m llamastack +USER llamastack + +# Copy configuration +COPY --chown=llamastack:llamastack configs/ /app/configs/ + +# Expose port +EXPOSE 8321 + +# Health check +HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ + CMD curl -f http://localhost:8321/v1/health || exit 1 + +# Default command +CMD ["llama", "stack", "run", "/app/configs/server.yaml"] ``` -### Step 2: Initialize Starter Distribution -```bash -# Initialize the starter distribution -llama stack build --template starter --name starter +**Alternative: Ultra-lightweight Alpine version:** +```dockerfile +# Option 2: Alpine - Smallest size (~50MB total) +FROM python:3.12-alpine -# This creates ~/.llama/distributions/starter/ +# Install system dependencies +RUN apk add --no-cache curl gcc musl-dev linux-headers + +# Install Llama Stack +RUN pip install --no-cache-dir llama-stack + +# Create app directory +WORKDIR /app + +# Create non-root user +RUN adduser -D -s /bin/sh llamastack +USER llamastack + +# Copy configuration +COPY --chown=llamastack:llamastack configs/ /app/configs/ + +# Expose port +EXPOSE 8321 + +# Health check +HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ + CMD curl -f http://localhost:8321/v1/health || exit 1 + +# Default command +CMD ["llama", "stack", "run", "/app/configs/server.yaml"] ``` ---- +**Alternative: Multi-stage build for production:** +```dockerfile +# Option 3: Multi-stage - Minimal runtime image +FROM python:3.12-slim as builder -## Set Up Multiple Servers +# Install build dependencies +RUN apt-get update && apt-get install -y \ + build-essential \ + && rm -rf /var/lib/apt/lists/* -The starter distribution provides a comprehensive configuration with multiple providers. We'll create **2 servers** based on this starter config: +# Install Python packages +RUN pip install --user --no-cache-dir llama-stack -- **Server 1** (Port 8321): Full starter config with all providers -- **Server 2** (Port 8322): Same config with different database paths (using CLI port override) +# Runtime stage +FROM python:3.12-slim -### Step 1: Examine the Base Configuration +# Install only runtime dependencies +RUN apt-get update && apt-get install -y \ + curl \ + && rm -rf /var/lib/apt/lists/* \ + && apt-get clean -```bash -# View the starter configuration -cat ~/.llama/distributions/starter/starter-run.yaml +# Copy installed packages from builder +COPY --from=builder /root/.local /root/.local + +# Create non-root user +RUN useradd -r -s /bin/false -m llamastack +USER llamastack + +# Set PATH +ENV PATH="/root/.local/bin:$PATH" + +WORKDIR /app +COPY --chown=llamastack:llamastack configs/ /app/configs/ + +EXPOSE 8321 + +HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ + CMD curl -f http://localhost:8321/v1/health || exit 1 + +CMD ["llama", "stack", "run", "/app/configs/server.yaml"] ``` -### Step 2: Create Server 1 Configuration (Full Starter) - -```bash -# Copy the starter config for Server 1 -cp ~/.llama/distributions/starter/starter-run.yaml ~/server1-starter.yaml -``` - -### Step 3: Create Server 2 Configuration (Same Config, Different Databases) - -```bash -# Copy starter config for Server 2 -cp ~/.llama/distributions/starter/starter-run.yaml ~/server2-starter.yaml - -# Change the database paths to avoid conflicts (only change needed!) -sed -i 's|~/.llama/distributions/starter|~/.llama/distributions/starter2|g' ~/server2-starter.yaml -``` - -### Step 4: Create Separate Database Directories -```bash -# Create separate directories for Server 2 -mkdir -p ~/.llama/distributions/starter2 -``` - -**That's it!** No need to modify ports in YAML files - we'll use the CLI `--port` flag instead. - ---- - -## Configure API Keys - -The starter configuration supports many providers. Set up the API keys you need: - -### Essential API Keys - -```bash -# Groq (fast inference) -export GROQ_API_KEY="your_groq_api_key_here" - -# OpenAI (if you want to use GPT models) -export OPENAI_API_KEY="your_openai_api_key_here" - -# Anthropic (if you want Claude models) -export ANTHROPIC_API_KEY="your_anthropic_api_key_here" - -# Ollama (for local models) -export OLLAMA_URL="http://localhost:11434" -``` - -### Optional API Keys (Set only if you plan to use these providers) - -```bash -# Fireworks AI -export FIREWORKS_API_KEY="your_fireworks_api_key" - -# Together AI -export TOGETHER_API_KEY="your_together_api_key" - -# Gemini -export GEMINI_API_KEY="your_gemini_api_key" - -# NVIDIA -export NVIDIA_API_KEY="your_nvidia_api_key" -``` - ---- - -## Set Up Ollama (Optional) - -If you want to use local models through Ollama: - -### Install and Start Ollama - -**Linux:** -```bash -curl -fsSL https://ollama.com/install.sh | sh -ollama serve -``` - -**macOS:** -```bash -brew install ollama -ollama serve -``` - -### Download Models (in a new terminal) - -```bash -# Download popular models -ollama pull llama3.1:8b -ollama pull llama-guard3:8b -ollama pull all-minilm:l6-v2 - -# Verify models -ollama list -``` - ---- - -## Start the Servers - -### Method 1: Run in Separate Terminals (Recommended for Development) - -**Terminal 1 - Server 1:** -```bash -cd ~ -llama stack run ~/server1-starter.yaml --port 8321 -``` - -**Terminal 2 - Server 2 (Uses CLI port override!):** -```bash -cd ~ -llama stack run ~/server2-starter.yaml --port 8322 -``` - -### Method 2: Run in Background - -```bash -# Start Server 1 in background -cd ~ -nohup llama stack run ~/server1-starter.yaml --port 8321 > server1.log 2>&1 & - -# Start Server 2 in background with port override -nohup llama stack run ~/server2-starter.yaml --port 8322 > server2.log 2>&1 & -``` - -### Method 3: Alternative - Use Environment Variable - -```bash -# You can also set port via environment variable -export LLAMA_STACK_PORT=8322 -llama stack run ~/server2-starter.yaml - -# Or inline -LLAMA_STACK_PORT=8322 llama stack run ~/server2-starter.yaml -``` - -### Expected Output - -Both servers should start successfully: -``` -Starting server on port 8321... -Server is running at http://localhost:8321 -``` - -``` -Starting server on port 8322... -Server is running at http://localhost:8322 -``` - ---- - -## Test Your Setup - -### Step 1: Health Check - -```bash -# Test both servers -curl http://localhost:8321/v1/health -curl http://localhost:8322/v1/health -``` - -**Expected Response:** -```json -{"status": "OK"} -``` - -### Step 2: List Available Models - -```bash -# Check models on Server 1 -curl -s http://localhost:8321/v1/models | python3 -m json.tool - -# Check models on Server 2 -curl -s http://localhost:8322/v1/models | python3 -m json.tool -``` - -### Step 3: Test Inference with Different Providers - -**Test Groq on Server 1:** -```bash -curl -X POST http://localhost:8321/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "Hello! How are you?"}], - "model": "groq/llama-3.1-8b-instant" - }' -``` - -**Test OpenAI on Server 2 (if you have OpenAI API key):** -```bash -curl -X POST http://localhost:8322/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "Hello from server 2!"}], - "model": "openai/gpt-4o-mini" - }' -``` - -**Test Ollama (if you set it up):** -```bash -curl -X POST http://localhost:8321/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "Hello from Ollama!"}], - "model": "ollama/llama3.1:8b" - }' -``` - -### Step 4: Test Embeddings - -```bash -curl -X POST http://localhost:8321/v1/embeddings \ - -H "Content-Type: application/json" \ - -d '{ - "input": "Hello world", - "model": "sentence-transformers/all-MiniLM-L6-v2" - }' -``` - ---- - -## Manage Your Servers - -### Check What's Running - -```bash -# Check server processes -lsof -i :8321 -i :8322 - -# Check all llama stack processes -ps aux | grep "llama.*stack" -``` - -### Stop Servers - -**Stop individual servers:** -```bash -# Stop Server 1 -kill $(lsof -t -i:8321) - -# Stop Server 2 -kill $(lsof -t -i:8322) -``` - -**Stop all servers:** -```bash -pkill -f "llama.*stack.*run" -``` - -### View Logs (if running in background) - -```bash -# Watch Server 1 logs -tail -f server1.log - -# Watch Server 2 logs -tail -f server2.log -``` - -### Restart Servers - -```bash -# Stop all servers first -pkill -f "llama.*stack.*run" -sleep 3 - -# Restart both servers -cd ~ -nohup llama stack run ~/server1-starter.yaml > server1.log 2>&1 & -nohup llama stack run ~/server2-starter.yaml > server2.log 2>&1 & -``` - ---- - -## Troubleshooting - -### Problem: "Port already in use" - -```bash -# Find what's using the ports -lsof -i :8321 -i :8322 - -# Kill processes using the ports -kill $(lsof -t -i:8321) -kill $(lsof -t -i:8322) -``` - -### Problem: "Provider not available" - -The starter config includes many providers that may not have API keys set. This is normal behavior: - -```bash -# Check which environment variables are set -env | grep -E "(GROQ|OPENAI|ANTHROPIC|OLLAMA)_" - -# Set missing API keys you want to use -export GROQ_API_KEY="your_key_here" -``` - -### Problem: "No models available" - -```bash -# Check available models -curl -s http://localhost:8321/v1/models | python3 -m json.tool - -# If empty, check your API keys are set correctly -echo $GROQ_API_KEY -echo $OPENAI_API_KEY -``` - -### Problem: Ollama connection issues - -```bash -# Check if Ollama is running -curl http://localhost:11434/api/version - -# If not running, start it -ollama serve - -# Verify OLLAMA_URL is set -echo $OLLAMA_URL -``` - ---- - -## Advanced Usage - -### Customize Provider Selection - -You can modify the YAML files to enable/disable specific providers: +#### Prepare Server Configurations +**Server 1 Config (server1.yaml):** ```yaml -# In your server config, comment out providers you don't want +version: 2 +image_name: production-server1 providers: inference: - # - provider_id: openai # Disabled - # provider_type: remote::openai - # config: - # api_key: ${env.OPENAI_API_KEY:=} - - - provider_id: groq # Enabled - provider_type: remote::groq + - provider_id: ollama + provider_type: remote::ollama config: - api_key: ${env.GROQ_API_KEY:=} + url: http://ollama-service:11434 + vector_io: + - provider_id: faiss + provider_type: inline::faiss + config: + kvstore: + type: sqlite + db_path: /data/server1/faiss_store.db +metadata_store: + type: sqlite + db_path: /data/server1/registry.db +server: + port: 8321 ``` -### You Can Use Different Providers on Different Servers +**Server 2 Config (server2.yaml):** +```yaml +version: 2 +image_name: production-server2 +providers: + inference: + - provider_id: openai + provider_type: remote::openai + config: + api_key: ${OPENAI_API_KEY} + - provider_id: anthropic + provider_type: remote::anthropic + config: + api_key: ${ANTHROPIC_API_KEY} +metadata_store: + type: sqlite + db_path: /data/server2/registry.db +server: + port: 8322 +``` -**Server 1 - Local providers** -- Enable: Ollama, vllm, other local providers -- Disable: OpenAI, Anthropic, Groq, Fireworks +#### Build and Run Containers -**Server 2 - Remote providers:** -- Enable: OpenAI, Anthropic, Gemini -- Disable: Ollama, vllm and local providers ---- - -## Summary - -You now have **2 Llama Stack servers** running with the starter distribution: - -### Server Configuration -- **Server 1**: `http://localhost:8321` (Full starter config) -- **Server 2**: `http://localhost:8322` (Modified starter config) - -### Key Files -- `~/server1-starter.yaml` - Server 1 configuration -- `~/server2-starter.yaml` - Server 2 configuration -- `server1.log` - Server 1 logs (if background) -- `server2.log` - Server 2 logs (if background) - -### Key Commands ```bash -# Health check -curl http://localhost:8321/v1/health -curl http://localhost:8322/v1/health +# Build images +docker build -t llamastack-server1 -f Dockerfile.server1 . +docker build -t llamastack-server2 -f Dockerfile.server2 . -# Stop servers -kill $(lsof -t -i:8321) -kill $(lsof -t -i:8322) +# Create volumes for persistent data +docker volume create llamastack-server1-data +docker volume create llamastack-server2-data -# Check processes -lsof -i :8321 -i :8322 +# Run Server 1 +docker run -d \ + --name llamastack-server1 \ + --restart unless-stopped \ + -p 8321:8321 \ + -v llamastack-server1-data:/data/server1 \ + -e GROQ_API_KEY="${GROQ_API_KEY}" \ + --network llamastack-network \ + llamastack-server1 + +# Run Server 2 +docker run -d \ + --name llamastack-server2 \ + --restart unless-stopped \ + -p 8322:8322 \ + -v llamastack-server2-data:/data/server2 \ + -e OPENAI_API_KEY="${OPENAI_API_KEY}" \ + -e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \ + --network llamastack-network \ + llamastack-server2 ``` -### Next Steps -1. Create more servers with different configurations if needed. -2. Set up API keys for providers you want to use. -3. Test different models and providers. -4. Customize configurations for your specific needs. -5. Set up monitoring and logging for production use. +## Docker Compose Deployment + +### Method 2: Docker Compose (Recommended) + +**docker-compose.yml:** +```yaml +# Note: Version specification is optional in modern Docker Compose +# Using latest Docker Compose format +services: + # Ollama service for local models + ollama: + image: ollama/ollama:latest + container_name: ollama-service + restart: unless-stopped + ports: + - "11434:11434" + volumes: + - ollama-data:/root/.ollama + networks: + - llamastack-network + + # Server 1: Local providers + llamastack-server1: + build: + context: . + dockerfile: Dockerfile.server1 + container_name: llamastack-server1 + restart: unless-stopped + ports: + - "8321:8321" + volumes: + - server1-data:/data/server1 + - ./configs/server1.yaml:/app/configs/server.yaml:ro + environment: + - OLLAMA_URL=http://ollama:11434 + - GROQ_API_KEY=${GROQ_API_KEY} + depends_on: + - ollama + networks: + - llamastack-network + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8321/v1/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 60s + + # Server 2: Cloud providers + llamastack-server2: + build: + context: . + dockerfile: Dockerfile.server2 + container_name: llamastack-server2 + restart: unless-stopped + ports: + - "8322:8322" + volumes: + - server2-data:/data/server2 + - ./configs/server2.yaml:/app/configs/server.yaml:ro + environment: + - OPENAI_API_KEY=${OPENAI_API_KEY} + - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} + - GROQ_API_KEY=${GROQ_API_KEY} + networks: + - llamastack-network + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8322/v1/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 60s + + # Load balancer (optional) + nginx: + image: nginx:alpine + container_name: llamastack-lb + restart: unless-stopped + ports: + - "80:80" + - "443:443" + volumes: + - ./nginx.conf:/etc/nginx/nginx.conf:ro + - ./ssl:/etc/nginx/ssl:ro + depends_on: + - llamastack-server1 + - llamastack-server2 + networks: + - llamastack-network + +volumes: + ollama-data: + server1-data: + server2-data: + +networks: + llamastack-network: + driver: bridge +``` + +**Environment file (.env):** +```bash +OPENAI_API_KEY=your_openai_key_here +ANTHROPIC_API_KEY=your_anthropic_key_here +GROQ_API_KEY=your_groq_key_here +``` + +**Deploy with Docker Compose:** +```bash +# Start all services +docker-compose up -d + +# Scale services +docker-compose up -d --scale llamastack-server1=3 + +# View logs +docker-compose logs -f llamastack-server1 +docker-compose logs -f llamastack-server2 + +# Stop services +docker-compose down + +# Update services +docker-compose pull && docker-compose up -d +``` --- -*This guide uses the official Llama Stack starter distribution for maximum compatibility and feature coverage.* +## Kubernetes Deployment + +### Method 3: Kubernetes + +**ConfigMap (llamastack-configs.yaml):** +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: llamastack-configs +data: + server1.yaml: | + version: 2 + image_name: k8s-server1 + providers: + inference: + - provider_id: ollama + provider_type: remote::ollama + config: + url: http://ollama-service:11434 + metadata_store: + type: sqlite + db_path: /data/registry.db + server: + port: 8321 + + server2.yaml: | + version: 2 + image_name: k8s-server2 + providers: + inference: + - provider_id: openai + provider_type: remote::openai + config: + api_key: ${OPENAI_API_KEY} + metadata_store: + type: sqlite + db_path: /data/registry.db + server: + port: 8322 +``` + +**Deployments (llamastack-deployments.yaml):** +```yaml +# Server 1 Deployment - Local Providers +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llamastack-server1 + labels: + app.kubernetes.io/name: llamastack-server1 + app.kubernetes.io/component: inference +spec: + replicas: 3 + selector: + matchLabels: + app: llamastack-server1 + template: + metadata: + labels: + app: llamastack-server1 + spec: + containers: + - name: llamastack + image: llamastack-server1:latest + ports: + - containerPort: 8321 + name: http-api + env: + - name: GROQ_API_KEY + valueFrom: + secretKeyRef: + name: llamastack-secrets + key: groq-api-key + - name: OLLAMA_URL + value: "http://ollama-service:11434" + volumeMounts: + - name: config + mountPath: /app/configs + subPath: server1.yaml + - name: data + mountPath: /data + livenessProbe: + httpGet: + path: /v1/health + port: 8321 + initialDelaySeconds: 60 + periodSeconds: 30 + readinessProbe: + httpGet: + path: /v1/health + port: 8321 + initialDelaySeconds: 10 + periodSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + volumes: + - name: config + configMap: + name: llamastack-configs + - name: data + persistentVolumeClaim: + claimName: llamastack-server1-pvc + +--- +# Server 2 Deployment - Cloud Providers +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llamastack-server2 + labels: + app.kubernetes.io/name: llamastack-server2 + app.kubernetes.io/component: inference +spec: + replicas: 2 + selector: + matchLabels: + app: llamastack-server2 + template: + metadata: + labels: + app: llamastack-server2 + spec: + containers: + - name: llamastack + image: llamastack-server2:latest + ports: + - containerPort: 8322 + name: http-api + env: + - name: OPENAI_API_KEY + valueFrom: + secretKeyRef: + name: llamastack-secrets + key: openai-api-key + - name: ANTHROPIC_API_KEY + valueFrom: + secretKeyRef: + name: llamastack-secrets + key: anthropic-api-key + - name: GROQ_API_KEY + valueFrom: + secretKeyRef: + name: llamastack-secrets + key: groq-api-key + volumeMounts: + - name: config + mountPath: /app/configs + subPath: server2.yaml + - name: data + mountPath: /data + livenessProbe: + httpGet: + path: /v1/health + port: 8322 + initialDelaySeconds: 60 + periodSeconds: 30 + readinessProbe: + httpGet: + path: /v1/health + port: 8322 + initialDelaySeconds: 10 + periodSeconds: 5 + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "250m" + volumes: + - name: config + configMap: + name: llamastack-configs + - name: data + persistentVolumeClaim: + claimName: llamastack-server2-pvc +``` + +**Services (llamastack-services.yaml):** +```yaml +# Server 1 Service +apiVersion: v1 +kind: Service +metadata: + name: llamastack-server1-service + labels: + app.kubernetes.io/name: llamastack-server1 +spec: + selector: + app: llamastack-server1 + ports: + - port: 8321 + targetPort: 8321 + protocol: TCP + name: http-api + type: LoadBalancer + +--- +# Server 2 Service +apiVersion: v1 +kind: Service +metadata: + name: llamastack-server2-service + labels: + app.kubernetes.io/name: llamastack-server2 +spec: + selector: + app: llamastack-server2 + ports: + - port: 8322 + targetPort: 8322 + protocol: TCP + name: http-api + type: LoadBalancer +``` + +**Persistent Volume Claims (llamastack-pvc.yaml):** +```yaml +# PVC for Server 1 +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: llamastack-server1-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi + storageClassName: fast-ssd + +--- +# PVC for Server 2 +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: llamastack-server2-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi + storageClassName: fast-ssd +``` + +**Secrets (llamastack-secrets.yaml):** +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: llamastack-secrets +type: Opaque +stringData: + groq-api-key: "your_groq_api_key_here" + openai-api-key: "your_openai_api_key_here" + anthropic-api-key: "your_anthropic_api_key_here" +``` + +**Deploy to Kubernetes:** +```bash +# Apply configurations +kubectl apply -f llamastack-configs.yaml +kubectl apply -f llamastack-secrets.yaml +kubectl apply -f llamastack-pvc.yaml +kubectl apply -f llamastack-deployment.yaml +kubectl apply -f llamastack-service.yaml + +# Check status +kubectl get pods -l app=llamastack-server1 +kubectl get services + +# Scale deployment +kubectl scale deployment llamastack-server1 --replicas=5 + +# View logs +kubectl logs -f deployment/llamastack-server1 +``` + +--- + +## Load Balancing & High Availability + +### NGINX Load Balancer Configuration + +**nginx.conf:** +```nginx +upstream llamastack_local { + least_conn; + server llamastack-server1:8321 max_fails=3 fail_timeout=30s; + server llamastack-server1-2:8321 max_fails=3 fail_timeout=30s; + server llamastack-server1-3:8321 max_fails=3 fail_timeout=30s; +} + +upstream llamastack_cloud { + least_conn; + server llamastack-server2:8322 max_fails=3 fail_timeout=30s; + server llamastack-server2-2:8322 max_fails=3 fail_timeout=30s; +} + +server { + listen 80; + server_name localhost; + + # Health check endpoint + location /health { + access_log off; + return 200 "healthy\n"; + add_header Content-Type text/plain; + } + + # Route to local providers + location /v1/local/ { + rewrite ^/v1/local/(.*) /v1/$1 break; + proxy_pass http://llamastack_local; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_connect_timeout 30s; + proxy_send_timeout 30s; + proxy_read_timeout 30s; + } + + # Route to cloud providers + location /v1/cloud/ { + rewrite ^/v1/cloud/(.*) /v1/$1 break; + proxy_pass http://llamastack_cloud; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_connect_timeout 30s; + proxy_send_timeout 30s; + proxy_read_timeout 30s; + } + + # Default routing + location /v1/ { + proxy_pass http://llamastack_local; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + } +} +``` + +--- + +## Monitoring & Logging + +### Prometheus Monitoring + +**prometheus.yml:** +```yaml +global: + scrape_interval: 15s + +scrape_configs: + - job_name: 'llamastack-servers' + static_configs: + - targets: ['llamastack-server1:8321', 'llamastack-server2:8322'] + metrics_path: /metrics + scrape_interval: 30s +``` + +### Grafana Dashboard + +**Key metrics to monitor:** +- Request latency (p50, p95, p99) +- Request rate (requests/second) +- Error rate (4xx, 5xx responses) +- Container resource usage (CPU, memory) +- Provider-specific metrics (API quotas, rate limits) + +### Centralized Logging + +**docker-compose.yml addition:** +```yaml + # ELK Stack for logging + elasticsearch: + image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0 + environment: + - discovery.type=single-node + volumes: + - elasticsearch-data:/usr/share/elasticsearch/data + + logstash: + image: docker.elastic.co/logstash/logstash:8.8.0 + volumes: + - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf + + kibana: + image: docker.elastic.co/kibana/kibana:8.8.0 + ports: + - "5601:5601" +``` + +*This guide focuses on production deployments and operational best practices for multiple Llama Stack servers.* From dc2743912aa8a3639d587d0b6ab589902ca055a7 Mon Sep 17 00:00:00 2001 From: Omar Abdelwahab Date: Thu, 2 Oct 2025 15:16:31 -0700 Subject: [PATCH 4/4] Added minor changes --- .../launching_multiple_llamastack_servers.md | 115 ++++++------------ 1 file changed, 40 insertions(+), 75 deletions(-) diff --git a/docs/docs/deploying/launching_multiple_llamastack_servers.md b/docs/docs/deploying/launching_multiple_llamastack_servers.md index 1cf5eda6c..3a715a132 100644 --- a/docs/docs/deploying/launching_multiple_llamastack_servers.md +++ b/docs/docs/deploying/launching_multiple_llamastack_servers.md @@ -52,117 +52,82 @@ Server 4-6: Backup cluster ### Method 1: Docker Containers -#### Build Custom Container Images +#### Use Official LlamaStack Container Approach + +**Option 1: Use Starter Distribution with Container Runtime** -**Create Dockerfile for Llama Stack:** ```dockerfile -# Option 1: Recommended - Python slim (balanced size/compatibility) +# Simple Dockerfile leveraging the starter distribution FROM python:3.12-slim # Install system dependencies -RUN apt-get update && apt-get install -y \ - curl \ - && rm -rf /var/lib/apt/lists/* \ - && apt-get clean +RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/* -# Install Llama Stack +# Install LlamaStack RUN pip install --no-cache-dir llama-stack -# Create app directory -WORKDIR /app +# Initialize starter distribution +RUN llama stack build --template starter --name production-server -# Create non-root user for security +# Create non-root user RUN useradd -r -s /bin/false -m llamastack USER llamastack -# Copy configuration -COPY --chown=llamastack:llamastack configs/ /app/configs/ - -# Expose port -EXPOSE 8321 +WORKDIR /app # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8321/v1/health || exit 1 -# Default command -CMD ["llama", "stack", "run", "/app/configs/server.yaml"] +# Use starter distribution configs +CMD ["llama", "stack", "run", "~/.llama/distributions/starter/starter-run.yaml"] ``` -**Alternative: Ultra-lightweight Alpine version:** +**Option 2: Use Standard Python Base Image (Recommended)** + +Since LlamaStack doesn't provide a Dockerfile, use the standard Python installation approach: + +```bash +# Use the simple approach with standard Python image +# No need to clone and build - just use pip install directly in containers +``` + +**Option 2b: Check for Official Images (Future)** + +```bash +# Check if official images become available +docker search meta-llama/llama-stack +docker search llamastack + +# For now, use the pip-based approach in Option 1 or 3 +``` + +**Option 3: Lightweight Container with Starter Distribution** + ```dockerfile -# Option 2: Alpine - Smallest size (~50MB total) FROM python:3.12-alpine -# Install system dependencies +# Install dependencies RUN apk add --no-cache curl gcc musl-dev linux-headers -# Install Llama Stack +# Install LlamaStack RUN pip install --no-cache-dir llama-stack -# Create app directory -WORKDIR /app +# Initialize starter distribution +RUN llama stack build --template starter --name starter # Create non-root user -RUN adduser -D -s /bin/sh llamastack +RUN adduser -D llamastack USER llamastack -# Copy configuration -COPY --chown=llamastack:llamastack configs/ /app/configs/ - -# Expose port -EXPOSE 8321 - -# Health check -HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ - CMD curl -f http://localhost:8321/v1/health || exit 1 - -# Default command -CMD ["llama", "stack", "run", "/app/configs/server.yaml"] -``` - -**Alternative: Multi-stage build for production:** -```dockerfile -# Option 3: Multi-stage - Minimal runtime image -FROM python:3.12-slim as builder - -# Install build dependencies -RUN apt-get update && apt-get install -y \ - build-essential \ - && rm -rf /var/lib/apt/lists/* - -# Install Python packages -RUN pip install --user --no-cache-dir llama-stack - -# Runtime stage -FROM python:3.12-slim - -# Install only runtime dependencies -RUN apt-get update && apt-get install -y \ - curl \ - && rm -rf /var/lib/apt/lists/* \ - && apt-get clean - -# Copy installed packages from builder -COPY --from=builder /root/.local /root/.local - -# Create non-root user -RUN useradd -r -s /bin/false -m llamastack -USER llamastack - -# Set PATH -ENV PATH="/root/.local/bin:$PATH" - WORKDIR /app -COPY --chown=llamastack:llamastack configs/ /app/configs/ - -EXPOSE 8321 HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8321/v1/health || exit 1 -CMD ["llama", "stack", "run", "/app/configs/server.yaml"] +# Use CLI port override instead of modifying YAML +CMD ["llama", "stack", "run", "/home/llamastack/.llama/distributions/starter/starter-run.yaml", "--port", "8321"] ``` #### Prepare Server Configurations