# Llama Stack Benchmark Suite on Kubernetes ## Motivation Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM. ### Why This Benchmark Suite Exists **Performance Validation**: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing: - Llama Stack inference (with vLLM backend) - Direct vLLM inference calls - Both under identical Kubernetes deployment conditions **Production Readiness Assessment**: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness. **Regression Detection (TODO)**: As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments. **Resource Planning**: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about: - Kubernetes resource allocation (CPU, memory, GPU) - Auto-scaling configurations - Cost optimization strategies ### Key Metrics Captured The benchmark suite measures critical performance indicators: - **Throughput**: Requests per second under sustained load - **Latency Distribution**: P50, P95, P99 response times - **Time to First Token (TTFT)**: Critical for streaming applications - **Inter-Token Latency (ITL)**: Token generation speed for streaming - **Error Rates**: Request failures and timeout analysis This data enables data-driven architectural decisions and performance optimization efforts. ## Setup **1. Deploy base k8s infrastructure:** ```bash cd ../../docs/source/distributions/k8s ./apply.sh ``` **2. Deploy benchmark components:** ```bash ./apply.sh ``` **3. Verify deployment:** ```bash kubectl get pods # Should see: llama-stack-benchmark-server, vllm-server, etc. ``` ## Benchmark Results We use [GuideLLM](https://github.com/neuralmagic/guidellm) against our k8s deployment for comprehensive performance testing. ### Performance - 1 vLLM Replica We vary the number of Llama Stack replicas with 1 vLLM replica and compare performance below. ![Performance - 1 vLLM Replica](results/vllm_replica1_benchmark_results.png) For full results see the `benchmarking/k8s-benchmark/results/` directory. ## Quick Start Follow the instructions below to run benchmarks similar to the ones above. ### Comprehensive Benchmark Suite **Run all benchmarks with different cluster configurations:** ```bash ./scripts/run-all-benchmarks.sh ``` This script will automatically: - Scale deployments to different configurations - Run benchmarks for each setup - Generate output files with meaningful names that include setup information ### Individual Benchmarks **Benchmark Llama Stack (runs against current cluster setup):** ```bash ./scripts/run-guidellm-benchmark.sh --target stack ``` **Benchmark vLLM direct (runs against current cluster setup):** ```bash ./scripts/run-guidellm-benchmark.sh --target vllm ``` **Benchmark with custom parameters:** ```bash ./scripts/run-guidellm-benchmark.sh --target stack --max-seconds 120 --prompt-tokens 1024 --output-tokens 512 ``` **Benchmark with custom output file:** ```bash ./scripts/run-guidellm-benchmark.sh --target stack --output-file results/my-custom-benchmark.txt ``` ### Generating Charts Once the benchmarks are run, you can generate performance charts from benchmark results: ```bash uv run ./scripts/generate_charts.py ``` This loads runs in the `results/` directory and creates visualizations comparing different configurations and replica counts. ## Benchmark Workflow The benchmark suite is organized into two main scripts with distinct responsibilities: ### 1. `run-all-benchmarks.sh` - Orchestration & Scaling - **Purpose**: Manages different cluster configurations and orchestrates benchmark runs - **Responsibilities**: - Scales Kubernetes deployments (vLLM replicas, Stack replicas, worker counts) - Runs benchmarks for each configuration - Generates meaningful output filenames with setup information - **Use case**: Running comprehensive performance testing across multiple configurations ### 2. `run-guidellm-benchmark.sh` - Single Benchmark Execution - **Purpose**: Executes a single benchmark against the current cluster state - **Responsibilities**: - Runs GuideLLM benchmark with configurable parameters - Accepts custom output file paths - No cluster scaling - benchmarks current deployment state - **Use case**: Testing specific configurations or custom scenarios ### Typical Workflow 1. **Comprehensive Testing**: Use `run-all-benchmarks.sh` to automatically test multiple configurations 2. **Custom Testing**: Use `run-guidellm-benchmark.sh` for specific parameter testing or manual cluster configurations 3. **Analysis**: Use `generate_charts.py` to visualize results from either approach ## Command Reference ### run-all-benchmarks.sh Orchestrates multiple benchmark runs with different cluster configurations. This script: - Automatically scales deployments before each benchmark - Runs benchmarks against the configured cluster setup - Generates meaningfully named output files ```bash ./scripts/run-all-benchmarks.sh ``` **Configuration**: Edit the `configs` array in the script to customize benchmark configurations: ```bash # Each line: (target, stack_replicas, vllm_replicas, stack_workers) configs=( "stack 1 1 1" "stack 1 1 2" "stack 1 1 4" "vllm 1 1 -" ) ``` **Output files**: Generated with setup information in filename: - Stack: `guidellm-benchmark-stack-s{replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt` - vLLM: `guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt` ### run-guidellm-benchmark.sh Options Runs a single benchmark against the current cluster setup (no scaling). ```bash ./scripts/run-guidellm-benchmark.sh [options] Options: -t, --target Target to benchmark (default: stack) -s, --max-seconds Maximum duration in seconds (default: 60) -p, --prompt-tokens Number of prompt tokens (default: 512) -o, --output-tokens Number of output tokens (default: 256) -r, --rate-type Rate type (default: concurrent) -c, --rate Rate (default: 1,2,4,8,16,32,64,128) --output-file Output file path (default: auto-generated) --stack-deployment Name of the stack deployment (default: llama-stack-benchmark-server) --vllm-deployment Name of the vllm deployment (default: vllm-server) --stack-url URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai) -h, --help Show help message Examples: ./scripts/run-guidellm-benchmark.sh --target vllm # Benchmark vLLM direct ./scripts/run-guidellm-benchmark.sh --target stack # Benchmark Llama Stack (default) ./scripts/run-guidellm-benchmark.sh -t vllm -s 60 -p 512 -o 256 # vLLM with custom parameters ./scripts/run-guidellm-benchmark.sh --output-file results/my-benchmark.txt # Specify custom output file ./scripts/run-guidellm-benchmark.sh --stack-deployment my-stack-server # Use custom stack deployment name ``` ## Local Testing ### Running Benchmark Locally For local development without Kubernetes: **1. (Optional) Start Mock OpenAI server:** There is a simple mock OpenAI server if you don't have an inference provider available. The `openai-mock-server.py` provides: - **OpenAI-compatible API** for testing without real models - **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var - **Consistent responses** for reproducible benchmarks - **Lightweight testing** without GPU requirements ```bash uv run python openai-mock-server.py --port 8080 ``` **2. Start Stack server:** ```bash LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv run uvicorn llama_stack.core.server.server:create_app --port 8321 --workers 4 --factory ``` **3. Run GuideLLM benchmark:** ```bash GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \ --target "http://localhost:8321/v1/openai/v1" \ --model "meta-llama/Llama-3.2-3B-Instruct" \ --rate-type sweep \ --max-seconds 60 \ --data "prompt_tokens=256,output_tokens=128" --output-path='output.html' ```