diff --git a/benchmarking/k8s-benchmark/README.md b/benchmarking/k8s-benchmark/README.md index 3b0d0c4db..9b5e140f0 100644 --- a/benchmarking/k8s-benchmark/README.md +++ b/benchmarking/k8s-benchmark/README.md @@ -26,6 +26,7 @@ The benchmark suite measures critical performance indicators: - **Throughput**: Requests per second under sustained load - **Latency Distribution**: P50, P95, P99 response times - **Time to First Token (TTFT)**: Critical for streaming applications +- **Inter-Token Latency (ITL)**: Token generation speed for streaming - **Error Rates**: Request failures and timeout analysis This data enables data-driven architectural decisions and performance optimization efforts. @@ -49,49 +50,148 @@ kubectl get pods # Should see: llama-stack-benchmark-server, vllm-server, etc. ``` +## Benchmark Results + +We use [GuideLLM](https://github.com/neuralmagic/guidellm) against our k8s deployment for comprehensive performance testing. + + +### Performance - 1 vLLM Replica + +We vary the number of Llama Stack replicas with 1 vLLM replica and compare performance below. + +![Performance - 1 vLLM Replica](results/vllm_replica1_benchmark_results.png) + + +For full results see the `benchmarking/k8s-benchmark/results/` directory. + + ## Quick Start -### Basic Benchmarks +Follow the instructions below to run benchmarks similar to the ones above. -**Benchmark Llama Stack (default):** +### Comprehensive Benchmark Suite + +**Run all benchmarks with different cluster configurations:** ```bash -./run-benchmark.sh +./scripts/run-all-benchmarks.sh ``` -**Benchmark vLLM direct:** +This script will automatically: +- Scale deployments to different configurations +- Run benchmarks for each setup +- Generate output files with meaningful names that include setup information + +### Individual Benchmarks + +**Benchmark Llama Stack (runs against current cluster setup):** ```bash -./run-benchmark.sh --target vllm +./scripts/run-guidellm-benchmark.sh --target stack ``` -### Custom Configuration - -**Extended benchmark with high concurrency:** +**Benchmark vLLM direct (runs against current cluster setup):** ```bash -./run-benchmark.sh --target vllm --duration 120 --concurrent 20 +./scripts/run-guidellm-benchmark.sh --target vllm ``` -**Short test run:** +**Benchmark with custom parameters:** ```bash -./run-benchmark.sh --target stack --duration 30 --concurrent 5 +./scripts/run-guidellm-benchmark.sh --target stack --max-seconds 120 --prompt-tokens 1024 --output-tokens 512 ``` +**Benchmark with custom output file:** +```bash +./scripts/run-guidellm-benchmark.sh --target stack --output-file results/my-custom-benchmark.txt +``` + +### Generating Charts + +Once the benchmarks are run, you can generate performance charts from benchmark results: + +```bash +uv run ./scripts/generate_charts.py +``` + +This loads runs in the `results/` directory and creates visualizations comparing different configurations and replica counts. + +## Benchmark Workflow + +The benchmark suite is organized into two main scripts with distinct responsibilities: + +### 1. `run-all-benchmarks.sh` - Orchestration & Scaling +- **Purpose**: Manages different cluster configurations and orchestrates benchmark runs +- **Responsibilities**: + - Scales Kubernetes deployments (vLLM replicas, Stack replicas, worker counts) + - Runs benchmarks for each configuration + - Generates meaningful output filenames with setup information +- **Use case**: Running comprehensive performance testing across multiple configurations + +### 2. `run-guidellm-benchmark.sh` - Single Benchmark Execution +- **Purpose**: Executes a single benchmark against the current cluster state +- **Responsibilities**: + - Runs GuideLLM benchmark with configurable parameters + - Accepts custom output file paths + - No cluster scaling - benchmarks current deployment state +- **Use case**: Testing specific configurations or custom scenarios + +### Typical Workflow +1. **Comprehensive Testing**: Use `run-all-benchmarks.sh` to automatically test multiple configurations +2. **Custom Testing**: Use `run-guidellm-benchmark.sh` for specific parameter testing or manual cluster configurations +3. **Analysis**: Use `generate_charts.py` to visualize results from either approach + ## Command Reference -### run-benchmark.sh Options +### run-all-benchmarks.sh + +Orchestrates multiple benchmark runs with different cluster configurations. This script: +- Automatically scales deployments before each benchmark +- Runs benchmarks against the configured cluster setup +- Generates meaningfully named output files ```bash -./run-benchmark.sh [options] +./scripts/run-all-benchmarks.sh +``` + +**Configuration**: Edit the `configs` array in the script to customize benchmark configurations: +```bash +# Each line: (target, stack_replicas, vllm_replicas, stack_workers) +configs=( + "stack 1 1 1" + "stack 1 1 2" + "stack 1 1 4" + "vllm 1 1 -" +) +``` + +**Output files**: Generated with setup information in filename: +- Stack: `guidellm-benchmark-stack-s{replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt` +- vLLM: `guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt` + +### run-guidellm-benchmark.sh Options + +Runs a single benchmark against the current cluster setup (no scaling). + +```bash +./scripts/run-guidellm-benchmark.sh [options] Options: -t, --target Target to benchmark (default: stack) - -d, --duration Duration in seconds (default: 60) - -c, --concurrent Number of concurrent users (default: 10) + -s, --max-seconds Maximum duration in seconds (default: 60) + -p, --prompt-tokens Number of prompt tokens (default: 512) + -o, --output-tokens Number of output tokens (default: 256) + -r, --rate-type Rate type (default: concurrent) + -c, --rate Rate (default: 1,2,4,8,16,32,64,128) + --output-file Output file path (default: auto-generated) + --stack-deployment Name of the stack deployment (default: llama-stack-benchmark-server) + --vllm-deployment Name of the vllm deployment (default: vllm-server) + --stack-url URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai) -h, --help Show help message Examples: - ./run-benchmark.sh --target vllm # Benchmark vLLM direct - ./run-benchmark.sh --target stack # Benchmark Llama Stack - ./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users + ./scripts/run-guidellm-benchmark.sh --target vllm # Benchmark vLLM direct + ./scripts/run-guidellm-benchmark.sh --target stack # Benchmark Llama Stack (default) + ./scripts/run-guidellm-benchmark.sh -t vllm -s 60 -p 512 -o 256 # vLLM with custom parameters + ./scripts/run-guidellm-benchmark.sh --output-file results/my-benchmark.txt # Specify custom output file + ./scripts/run-guidellm-benchmark.sh --stack-deployment my-stack-server # Use custom stack deployment name ``` ## Local Testing @@ -100,55 +200,30 @@ Examples: For local development without Kubernetes: -**1. Start OpenAI mock server:** -```bash -uv run python openai-mock-server.py --port 8080 -``` - -**2. Run benchmark against mock server:** -```bash -uv run python benchmark.py \ - --base-url http://localhost:8080/v1 \ - --model mock-inference \ - --duration 30 \ - --concurrent 5 -``` - -**3. Test against local vLLM server:** -```bash -# If you have vLLM running locally on port 8000 -uv run python benchmark.py \ - --base-url http://localhost:8000/v1 \ - --model meta-llama/Llama-3.2-3B-Instruct \ - --duration 30 \ - --concurrent 5 -``` - -**4. Profile the running server:** -```bash -./profile_running_server.sh -``` - - - -### OpenAI Mock Server +**1. (Optional) Start Mock OpenAI server:** +There is a simple mock OpenAI server if you don't have an inference provider available. The `openai-mock-server.py` provides: - **OpenAI-compatible API** for testing without real models - **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var - **Consistent responses** for reproducible benchmarks - **Lightweight testing** without GPU requirements -**Mock server usage:** ```bash uv run python openai-mock-server.py --port 8080 ``` -The mock server is also deployed in k8s as `openai-mock-service:8080` and can be used by changing the Llama Stack configuration to use the `mock-vllm-inference` provider. +**2. Start Stack server:** +```bash +LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv run uvicorn llama_stack.core.server.server:create_app --port 8321 --workers 4 --factory +``` -## Files in this Directory - -- `benchmark.py` - Core benchmark script with async streaming support -- `run-benchmark.sh` - Main script with target selection and configuration -- `openai-mock-server.py` - Mock OpenAI API server for local testing -- `README.md` - This documentation file +**3. Run GuideLLM benchmark:** +```bash +GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \ + --target "http://localhost:8321/v1/openai/v1" \ + --model "meta-llama/Llama-3.2-3B-Instruct" \ + --rate-type sweep \ + --max-seconds 60 \ + --data "prompt_tokens=256,output_tokens=128" --output-path='output.html' +``` diff --git a/benchmarking/k8s-benchmark/benchmark.py b/benchmarking/k8s-benchmark/benchmark.py deleted file mode 100644 index d5e34aa23..000000000 --- a/benchmarking/k8s-benchmark/benchmark.py +++ /dev/null @@ -1,265 +0,0 @@ -# Copyright (c) Meta Platforms, Inc. and affiliates. -# All rights reserved. -# -# This source code is licensed under the terms described in the LICENSE file in -# the root directory of this source tree. - -""" -Simple benchmark script for Llama Stack with OpenAI API compatibility. -""" - -import argparse -import asyncio -import os -import random -import statistics -import time - -import aiohttp - - -class BenchmarkStats: - def __init__(self): - self.response_times = [] - self.ttft_times = [] - self.chunks_received = [] - self.errors = [] - self.success_count = 0 - self.total_requests = 0 - self.concurrent_users = 0 - self.start_time = None - self.end_time = None - self._lock = asyncio.Lock() - - async def add_result(self, response_time: float, chunks: int, ttft: float = None, error: str = None): - async with self._lock: - self.total_requests += 1 - if error: - self.errors.append(error) - else: - self.success_count += 1 - self.response_times.append(response_time) - self.chunks_received.append(chunks) - if ttft is not None: - self.ttft_times.append(ttft) - - def print_summary(self): - if not self.response_times: - print("No successful requests to report") - if self.errors: - print(f"Total errors: {len(self.errors)}") - print("First 5 errors:") - for error in self.errors[:5]: - print(f" {error}") - return - - total_time = self.end_time - self.start_time - success_rate = (self.success_count / self.total_requests) * 100 - - print(f"\n{'=' * 60}") - print("BENCHMARK RESULTS") - - print("\nResponse Time Statistics:") - print(f" Mean: {statistics.mean(self.response_times):.3f}s") - print(f" Median: {statistics.median(self.response_times):.3f}s") - print(f" Min: {min(self.response_times):.3f}s") - print(f" Max: {max(self.response_times):.3f}s") - - if len(self.response_times) > 1: - print(f" Std Dev: {statistics.stdev(self.response_times):.3f}s") - - percentiles = [50, 90, 95, 99] - sorted_times = sorted(self.response_times) - print("\nPercentiles:") - for p in percentiles: - idx = int(len(sorted_times) * p / 100) - 1 - idx = max(0, min(idx, len(sorted_times) - 1)) - print(f" P{p}: {sorted_times[idx]:.3f}s") - - if self.ttft_times: - print("\nTime to First Token (TTFT) Statistics:") - print(f" Mean: {statistics.mean(self.ttft_times):.3f}s") - print(f" Median: {statistics.median(self.ttft_times):.3f}s") - print(f" Min: {min(self.ttft_times):.3f}s") - print(f" Max: {max(self.ttft_times):.3f}s") - - if len(self.ttft_times) > 1: - print(f" Std Dev: {statistics.stdev(self.ttft_times):.3f}s") - - sorted_ttft = sorted(self.ttft_times) - print("\nTTFT Percentiles:") - for p in percentiles: - idx = int(len(sorted_ttft) * p / 100) - 1 - idx = max(0, min(idx, len(sorted_ttft) - 1)) - print(f" P{p}: {sorted_ttft[idx]:.3f}s") - - if self.chunks_received: - print("\nStreaming Statistics:") - print(f" Mean chunks per response: {statistics.mean(self.chunks_received):.1f}") - print(f" Total chunks received: {sum(self.chunks_received)}") - - print(f"{'=' * 60}") - print(f"Total time: {total_time:.2f}s") - print(f"Concurrent users: {self.concurrent_users}") - print(f"Total requests: {self.total_requests}") - print(f"Successful requests: {self.success_count}") - print(f"Failed requests: {len(self.errors)}") - print(f"Success rate: {success_rate:.1f}%") - print(f"Requests per second: {self.success_count / total_time:.2f}") - - if self.errors: - print("\nErrors (showing first 5):") - for error in self.errors[:5]: - print(f" {error}") - - -class LlamaStackBenchmark: - def __init__(self, base_url: str, model_id: str): - self.base_url = base_url.rstrip("/") - self.model_id = model_id - self.headers = {"Content-Type": "application/json"} - self.test_messages = [ - [{"role": "user", "content": "Hi"}], - [{"role": "user", "content": "What is the capital of France?"}], - [{"role": "user", "content": "Explain quantum physics in simple terms."}], - [{"role": "user", "content": "Write a short story about a robot learning to paint."}], - [ - {"role": "user", "content": "What is machine learning?"}, - {"role": "assistant", "content": "Machine learning is a subset of AI..."}, - {"role": "user", "content": "Can you give me a practical example?"}, - ], - ] - - async def make_async_streaming_request(self) -> tuple[float, int, float | None, str | None]: - """Make a single async streaming chat completion request.""" - messages = random.choice(self.test_messages) - payload = {"model": self.model_id, "messages": messages, "stream": True, "max_tokens": 100} - - start_time = time.time() - chunks_received = 0 - ttft = None - error = None - - session = aiohttp.ClientSession() - - try: - async with session.post( - f"{self.base_url}/chat/completions", - headers=self.headers, - json=payload, - timeout=aiohttp.ClientTimeout(total=30), - ) as response: - if response.status == 200: - async for line in response.content: - if line: - line_str = line.decode("utf-8").strip() - if line_str.startswith("data: "): - chunks_received += 1 - if ttft is None: - ttft = time.time() - start_time - if line_str == "data: [DONE]": - break - - if chunks_received == 0: - error = "No streaming chunks received" - else: - text = await response.text() - error = f"HTTP {response.status}: {text[:100]}" - - except Exception as e: - error = f"Request error: {str(e)}" - finally: - await session.close() - - response_time = time.time() - start_time - return response_time, chunks_received, ttft, error - - async def run_benchmark(self, duration: int, concurrent_users: int) -> BenchmarkStats: - """Run benchmark using async requests for specified duration.""" - stats = BenchmarkStats() - stats.concurrent_users = concurrent_users - stats.start_time = time.time() - - print(f"Starting benchmark: {duration}s duration, {concurrent_users} concurrent users") - print(f"Target URL: {self.base_url}/chat/completions") - print(f"Model: {self.model_id}") - - connector = aiohttp.TCPConnector(limit=concurrent_users) - async with aiohttp.ClientSession(connector=connector): - - async def worker(worker_id: int): - """Worker that sends requests sequentially until canceled.""" - request_count = 0 - while True: - try: - response_time, chunks, ttft, error = await self.make_async_streaming_request() - await stats.add_result(response_time, chunks, ttft, error) - request_count += 1 - - except asyncio.CancelledError: - break - except Exception as e: - await stats.add_result(0, 0, None, f"Worker {worker_id} error: {str(e)}") - - # Progress reporting task - async def progress_reporter(): - last_report_time = time.time() - while True: - try: - await asyncio.sleep(1) # Report every second - if time.time() >= last_report_time + 10: # Report every 10 seconds - elapsed = time.time() - stats.start_time - print( - f"Completed: {stats.total_requests} requests in {elapsed:.1f}s, RPS: {stats.total_requests / elapsed:.1f}" - ) - last_report_time = time.time() - except asyncio.CancelledError: - break - - # Spawn concurrent workers - tasks = [asyncio.create_task(worker(i)) for i in range(concurrent_users)] - progress_task = asyncio.create_task(progress_reporter()) - tasks.append(progress_task) - - # Wait for duration then cancel all tasks - await asyncio.sleep(duration) - - for task in tasks: - task.cancel() - - # Wait for all tasks to complete - await asyncio.gather(*tasks, return_exceptions=True) - - stats.end_time = time.time() - return stats - - -def main(): - parser = argparse.ArgumentParser(description="Llama Stack Benchmark Tool") - parser.add_argument( - "--base-url", - default=os.getenv("BENCHMARK_BASE_URL", "http://localhost:8000/v1/openai/v1"), - help="Base URL for the API (default: http://localhost:8000/v1/openai/v1)", - ) - parser.add_argument( - "--model", default=os.getenv("INFERENCE_MODEL", "test-model"), help="Model ID to use for requests" - ) - parser.add_argument("--duration", type=int, default=60, help="Duration in seconds to run benchmark (default: 60)") - parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent users (default: 10)") - - args = parser.parse_args() - - benchmark = LlamaStackBenchmark(args.base_url, args.model) - - try: - stats = asyncio.run(benchmark.run_benchmark(args.duration, args.concurrent)) - stats.print_summary() - - except KeyboardInterrupt: - print("\nBenchmark interrupted by user") - except Exception as e: - print(f"Benchmark failed: {e}") - - -if __name__ == "__main__": - main() diff --git a/benchmarking/k8s-benchmark/profile_running_server.sh b/benchmarking/k8s-benchmark/profile_running_server.sh deleted file mode 100755 index 65d620583..000000000 --- a/benchmarking/k8s-benchmark/profile_running_server.sh +++ /dev/null @@ -1,52 +0,0 @@ -#!/bin/bash - -# Copyright (c) Meta Platforms, Inc. and affiliates. -# All rights reserved. -# -# This source code is licensed under the terms described in the LICENSE file in -# the root directory of this source tree. - -# Script to profile an already running Llama Stack server -# Usage: ./profile_running_server.sh [duration_seconds] [output_file] - -DURATION=${1:-60} # Default 60 seconds -OUTPUT_FILE=${2:-"llama_stack_profile"} # Default output file - -echo "Looking for running Llama Stack server..." - -# Find the server PID -SERVER_PID=$(ps aux | grep "llama_stack.core.server.server" | grep -v grep | awk '{print $2}' | head -1) - - -if [ -z "$SERVER_PID" ]; then - echo "Error: No running Llama Stack server found" - echo "Please start your server first with:" - echo "LLAMA_STACK_LOGGING=\"all=ERROR\" MOCK_INFERENCE_URL=http://localhost:8080 SAFETY_MODEL=llama-guard3:1b uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml" - exit 1 -fi - -echo "Found Llama Stack server with PID: $SERVER_PID" - -# Start py-spy profiling -echo "Starting py-spy profiling for ${DURATION} seconds..." -echo "Output will be saved to: ${OUTPUT_FILE}.svg" -echo "" -echo "You can now run your load test..." -echo "" - -# Get the full path to py-spy -PYSPY_PATH=$(which py-spy) - -# Check if running as root, if not, use sudo -if [ "$EUID" -ne 0 ]; then - echo "py-spy requires root permissions on macOS. Running with sudo..." - sudo "$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID -else - "$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID -fi - -echo "" -echo "Profiling completed! Results saved to: ${OUTPUT_FILE}.svg" -echo "" -echo "To view the flame graph:" -echo "open ${OUTPUT_FILE}.svg" diff --git a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw1-v1-20250922-103408.txt b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw1-v1-20250922-103408.txt new file mode 100644 index 000000000..0f707a968 --- /dev/null +++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw1-v1-20250922-103408.txt @@ -0,0 +1,171 @@ +Collecting uv + Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) +Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 144.3 MB/s eta 0:00:00 +Installing collected packages: uv +Successfully installed uv-0.8.19 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv + +[notice] A new release of pip is available: 24.0 -> 25.2 +[notice] To update, run: pip install --upgrade pip +Using Python 3.11.13 environment at: /usr/local +Resolved 61 packages in 551ms +Downloading pillow (6.3MiB) +Downloading hf-xet (3.0MiB) +Downloading tokenizers (3.1MiB) +Downloading pygments (1.2MiB) +Downloading pandas (11.8MiB) +Downloading aiohttp (1.7MiB) +Downloading pydantic-core (1.9MiB) +Downloading numpy (16.2MiB) +Downloading transformers (11.1MiB) +Downloading pyarrow (40.8MiB) + Downloading pydantic-core + Downloading aiohttp + Downloading tokenizers + Downloading hf-xet + Downloading pygments + Downloading pillow + Downloading numpy + Downloading pandas + Downloading transformers + Downloading pyarrow +Prepared 61 packages in 1.23s +Installed 61 packages in 114ms + + aiohappyeyeballs==2.6.1 + + aiohttp==3.12.15 + + aiosignal==1.4.0 + + annotated-types==0.7.0 + + anyio==4.10.0 + + attrs==25.3.0 + + certifi==2025.8.3 + + charset-normalizer==3.4.3 + + click==8.1.8 + + datasets==4.1.1 + + dill==0.4.0 + + filelock==3.19.1 + + frozenlist==1.7.0 + + fsspec==2025.9.0 + + ftfy==6.3.1 + + guidellm==0.3.0 + + h11==0.16.0 + + h2==4.3.0 + + hf-xet==1.1.10 + + hpack==4.1.0 + + httpcore==1.0.9 + + httpx==0.28.1 + + huggingface-hub==0.35.0 + + hyperframe==6.1.0 + + idna==3.10 + + loguru==0.7.3 + + markdown-it-py==4.0.0 + + mdurl==0.1.2 + + multidict==6.6.4 + + multiprocess==0.70.16 + + numpy==2.3.3 + + packaging==25.0 + + pandas==2.3.2 + + pillow==11.3.0 + + propcache==0.3.2 + + protobuf==6.32.1 + + pyarrow==21.0.0 + + pydantic==2.11.9 + + pydantic-core==2.33.2 + + pydantic-settings==2.10.1 + + pygments==2.19.2 + + python-dateutil==2.9.0.post0 + + python-dotenv==1.1.1 + + pytz==2025.2 + + pyyaml==6.0.2 + + regex==2025.9.18 + + requests==2.32.5 + + rich==14.1.0 + + safetensors==0.6.2 + + six==1.17.0 + + sniffio==1.3.1 + + tokenizers==0.22.1 + + tqdm==4.67.1 + + transformers==4.56.2 + + typing-extensions==4.15.0 + + typing-inspection==0.4.1 + + tzdata==2025.2 + + urllib3==2.5.0 + + wcwidth==0.2.14 + + xxhash==3.5.0 + + yarl==1.20.1 +Using Python 3.11.13 environment at: /usr/local +Audited 1 package in 3ms +Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured. +Creating backend... +Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct. +Creating request loader... +Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256. + + +╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ +│ [17:34:30] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.32s Lat, 1.0 Conc, 18 Comp, 1 Inc, 0 Err │ +│ Tok: 74.0 gen/s, 238.6 tot/s, 40.2ms TTFT, 13.4ms ITL, 546 Prompt, 246 Gen │ +│ [17:35:35] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.46s Lat, 2.0 Conc, 34 Comp, 2 Inc, 0 Err │ +│ Tok: 139.6 gen/s, 454.0 tot/s, 48.0ms TTFT, 14.1ms ITL, 546 Prompt, 243 Gen │ +│ [17:36:40] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.44s Lat, 3.9 Conc, 68 Comp, 4 Inc, 0 Err │ +│ Tok: 273.2 gen/s, 900.4 tot/s, 50.7ms TTFT, 14.3ms ITL, 546 Prompt, 238 Gen │ +│ [17:37:45] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.55s Lat, 7.7 Conc, 129 Comp, 8 Inc, 0 Err │ +│ Tok: 519.1 gen/s, 1699.8 tot/s, 66.0ms TTFT, 14.6ms ITL, 547 Prompt, 240 Gen │ +│ [17:38:50] ⠋ 100% concurrent@16 (complete) Req: 4.1 req/s, 3.76s Lat, 15.5 Conc, 247 Comp, 16 Inc, 0 Err │ +│ Tok: 1005.5 gen/s, 3256.7 tot/s, 101.0ms TTFT, 15.0ms ITL, 547 Prompt, 244 Gen │ +│ [17:39:56] ⠋ 100% concurrent@32 (complete) Req: 8.1 req/s, 3.84s Lat, 30.9 Conc, 483 Comp, 32 Inc, 0 Err │ +│ Tok: 1926.3 gen/s, 6327.2 tot/s, 295.7ms TTFT, 14.8ms ITL, 547 Prompt, 239 Gen │ +│ [17:41:03] ⠋ 100% concurrent@64 (complete) Req: 9.9 req/s, 6.05s Lat, 59.7 Conc, 576 Comp, 58 Inc, 0 Err │ +│ Tok: 2381.0 gen/s, 7774.5 tot/s, 1196.2ms TTFT, 20.2ms ITL, 547 Prompt, 241 Gen │ +│ [17:42:10] ⠋ 100% concurrent@128 (complete) Req: 9.2 req/s, 11.59s Lat, 107.2 Conc, 514 Comp, 117 Inc, 0 Err │ +│ Tok: 2233.4 gen/s, 7286.3 tot/s, 2403.9ms TTFT, 38.2ms ITL, 547 Prompt, 242 Gen │ +╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ +Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ] + +Benchmarks Metadata: + Run id:511a14fd-ba11-4ffa-92ef-7cc23db4dd38 + Duration:528.5 seconds + Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128] + Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None + Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct' + backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': + '/v1/chat/completions'} + Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None + Extras:None + + +Benchmarks Info: +=================================================================================================================================================== +Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total|| + Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err +--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|---- + concurrent@1| 17:34:35| 17:35:35| 60.0| 18| 1| 0| 546.4| 512.0| 0.0| 246.0| 14.0| 0.0| 9835| 512| 0| 4428| 14| 0 + concurrent@2| 17:35:40| 17:36:40| 60.0| 34| 2| 0| 546.4| 512.0| 0.0| 242.7| 80.0| 0.0| 18577| 1024| 0| 8253| 160| 0 + concurrent@4| 17:36:45| 17:37:45| 60.0| 68| 4| 0| 546.4| 512.0| 0.0| 238.1| 103.2| 0.0| 37156| 2048| 0| 16188| 413| 0 + concurrent@8| 17:37:50| 17:38:50| 60.0| 129| 8| 0| 546.7| 512.0| 0.0| 240.3| 180.0| 0.0| 70518| 4096| 0| 31001| 1440| 0 + concurrent@16| 17:38:55| 17:39:55| 60.0| 247| 16| 0| 546.6| 512.0| 0.0| 244.1| 142.6| 0.0| 135002| 8192| 0| 60300| 2281| 0 + concurrent@32| 17:40:01| 17:41:01| 60.0| 483| 32| 0| 546.5| 512.0| 0.0| 239.2| 123.2| 0.0| 263972| 16384| 0| 115540| 3944| 0 + concurrent@64| 17:41:08| 17:42:08| 60.0| 576| 58| 0| 546.6| 512.0| 0.0| 241.3| 13.9| 0.0| 314817| 29696| 0| 138976| 807| 0 +concurrent@128| 17:42:15| 17:43:15| 60.0| 514| 117| 0| 546.5| 512.0| 0.0| 241.6| 143.9| 0.0| 280911| 59904| 0| 124160| 16832| 0 +=================================================================================================================================================== + + +Benchmarks Stats: +======================================================================================================================================================= +Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) || + Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99 +--------------|-----------|------------|------------|------------|------|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|----- + concurrent@1| 0.30| 1.00| 74.0| 238.6| 3.32| 3.43| 3.61| 40.2| 39.3| 51.2| 13.4| 13.3| 14.0| 13.3| 13.2| 13.9 + concurrent@2| 0.58| 1.99| 139.6| 454.0| 3.46| 3.64| 3.74| 48.0| 45.8| 72.0| 14.1| 14.1| 14.5| 14.0| 14.0| 14.4 + concurrent@4| 1.15| 3.95| 273.2| 900.4| 3.44| 3.69| 3.74| 50.7| 47.2| 118.6| 14.3| 14.3| 14.4| 14.2| 14.2| 14.4 + concurrent@8| 2.16| 7.67| 519.1| 1699.8| 3.55| 3.76| 3.87| 66.0| 48.8| 208.2| 14.6| 14.5| 14.8| 14.5| 14.5| 14.8 + concurrent@16| 4.12| 15.48| 1005.5| 3256.7| 3.76| 3.90| 4.18| 101.0| 65.6| 396.7| 15.0| 15.0| 15.9| 15.0| 15.0| 15.9 + concurrent@32| 8.05| 30.89| 1926.3| 6327.2| 3.84| 4.04| 4.39| 295.7| 265.6| 720.4| 14.8| 14.9| 15.5| 14.8| 14.8| 15.3 + concurrent@64| 9.87| 59.74| 2381.0| 7774.5| 6.05| 6.18| 9.94| 1196.2| 1122.5| 4295.3| 20.2| 20.0| 25.8| 20.1| 19.9| 25.8 +concurrent@128| 9.25| 107.16| 2233.4| 7286.3| 11.59| 12.04| 14.46| 2403.9| 2322.3| 4001.5| 38.2| 38.5| 53.0| 38.0| 38.3| 52.7 +======================================================================================================================================================= + +Saving benchmarks report... +Benchmarks report saved to /benchmarks.json + +Benchmarking complete. diff --git a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw2-v1-20250922-104457.txt b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw2-v1-20250922-104457.txt new file mode 100644 index 000000000..21f1ef425 --- /dev/null +++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw2-v1-20250922-104457.txt @@ -0,0 +1,171 @@ +Collecting uv + Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) +Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 149.3 MB/s eta 0:00:00 +Installing collected packages: uv +Successfully installed uv-0.8.19 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv + +[notice] A new release of pip is available: 24.0 -> 25.2 +[notice] To update, run: pip install --upgrade pip +Using Python 3.11.13 environment at: /usr/local +Resolved 61 packages in 494ms +Downloading pandas (11.8MiB) +Downloading tokenizers (3.1MiB) +Downloading pygments (1.2MiB) +Downloading aiohttp (1.7MiB) +Downloading transformers (11.1MiB) +Downloading numpy (16.2MiB) +Downloading pillow (6.3MiB) +Downloading pydantic-core (1.9MiB) +Downloading hf-xet (3.0MiB) +Downloading pyarrow (40.8MiB) + Downloading pydantic-core + Downloading aiohttp + Downloading tokenizers + Downloading hf-xet + Downloading pillow + Downloading pygments + Downloading numpy + Downloading pandas + Downloading pyarrow + Downloading transformers +Prepared 61 packages in 1.24s +Installed 61 packages in 126ms + + aiohappyeyeballs==2.6.1 + + aiohttp==3.12.15 + + aiosignal==1.4.0 + + annotated-types==0.7.0 + + anyio==4.10.0 + + attrs==25.3.0 + + certifi==2025.8.3 + + charset-normalizer==3.4.3 + + click==8.1.8 + + datasets==4.1.1 + + dill==0.4.0 + + filelock==3.19.1 + + frozenlist==1.7.0 + + fsspec==2025.9.0 + + ftfy==6.3.1 + + guidellm==0.3.0 + + h11==0.16.0 + + h2==4.3.0 + + hf-xet==1.1.10 + + hpack==4.1.0 + + httpcore==1.0.9 + + httpx==0.28.1 + + huggingface-hub==0.35.0 + + hyperframe==6.1.0 + + idna==3.10 + + loguru==0.7.3 + + markdown-it-py==4.0.0 + + mdurl==0.1.2 + + multidict==6.6.4 + + multiprocess==0.70.16 + + numpy==2.3.3 + + packaging==25.0 + + pandas==2.3.2 + + pillow==11.3.0 + + propcache==0.3.2 + + protobuf==6.32.1 + + pyarrow==21.0.0 + + pydantic==2.11.9 + + pydantic-core==2.33.2 + + pydantic-settings==2.10.1 + + pygments==2.19.2 + + python-dateutil==2.9.0.post0 + + python-dotenv==1.1.1 + + pytz==2025.2 + + pyyaml==6.0.2 + + regex==2025.9.18 + + requests==2.32.5 + + rich==14.1.0 + + safetensors==0.6.2 + + six==1.17.0 + + sniffio==1.3.1 + + tokenizers==0.22.1 + + tqdm==4.67.1 + + transformers==4.56.2 + + typing-extensions==4.15.0 + + typing-inspection==0.4.1 + + tzdata==2025.2 + + urllib3==2.5.0 + + wcwidth==0.2.14 + + xxhash==3.5.0 + + yarl==1.20.1 +Using Python 3.11.13 environment at: /usr/local +Audited 1 package in 3ms +Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured. +Creating backend... +Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct. +Creating request loader... +Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256. + + +╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ +│ [17:45:18] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.42s Lat, 1.0 Conc, 17 Comp, 1 Inc, 0 Err │ +│ Tok: 73.9 gen/s, 233.7 tot/s, 50.2ms TTFT, 13.4ms ITL, 547 Prompt, 253 Gen │ +│ [17:46:23] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.42s Lat, 2.0 Conc, 34 Comp, 2 Inc, 0 Err │ +│ Tok: 134.7 gen/s, 447.4 tot/s, 50.8ms TTFT, 14.3ms ITL, 546 Prompt, 235 Gen │ +│ [17:47:28] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.55s Lat, 3.9 Conc, 66 Comp, 4 Inc, 0 Err │ +│ Tok: 268.7 gen/s, 873.1 tot/s, 54.9ms TTFT, 14.4ms ITL, 547 Prompt, 243 Gen │ +│ [17:48:33] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.56s Lat, 7.8 Conc, 130 Comp, 8 Inc, 0 Err │ +│ Tok: 526.1 gen/s, 1728.4 tot/s, 60.6ms TTFT, 14.7ms ITL, 547 Prompt, 239 Gen │ +│ [17:49:38] ⠋ 100% concurrent@16 (complete) Req: 4.1 req/s, 3.79s Lat, 15.7 Conc, 246 Comp, 16 Inc, 0 Err │ +│ Tok: 1006.9 gen/s, 3268.6 tot/s, 74.8ms TTFT, 15.3ms ITL, 547 Prompt, 243 Gen │ +│ [17:50:44] ⠋ 100% concurrent@32 (complete) Req: 7.8 req/s, 3.95s Lat, 30.9 Conc, 467 Comp, 32 Inc, 0 Err │ +│ Tok: 1912.0 gen/s, 6191.6 tot/s, 119.1ms TTFT, 15.7ms ITL, 547 Prompt, 244 Gen │ +│ [17:51:50] ⠋ 100% concurrent@64 (complete) Req: 13.0 req/s, 4.75s Lat, 61.8 Conc, 776 Comp, 64 Inc, 0 Err │ +│ Tok: 3154.3 gen/s, 10273.3 tot/s, 339.1ms TTFT, 18.3ms ITL, 547 Prompt, 242 Gen │ +│ [17:52:58] ⠋ 100% concurrent@128 (complete) Req: 15.1 req/s, 7.82s Lat, 117.7 Conc, 898 Comp, 127 Inc, 0 Err │ +│ Tok: 3617.4 gen/s, 11843.9 tot/s, 1393.8ms TTFT, 26.8ms ITL, 547 Prompt, 240 Gen │ +╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ +Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ] + +Benchmarks Metadata: + Run id:f73d408e-256a-4c32-aa40-05e8d7098b66 + Duration:529.2 seconds + Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128] + Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None + Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct' + backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': + '/v1/chat/completions'} + Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None + Extras:None + + +Benchmarks Info: +===================================================================================================================================================== +Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total || + Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err +--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|----- + concurrent@1| 17:45:23| 17:46:23| 60.0| 17| 1| 0| 546.6| 512.0| 0.0| 252.8| 136.0| 0.0| 9292| 512| 0| 4298| 136| 0 + concurrent@2| 17:46:28| 17:47:28| 60.0| 34| 2| 0| 546.4| 512.0| 0.0| 235.4| 130.0| 0.0| 18577| 1024| 0| 8003| 260| 0 + concurrent@4| 17:47:33| 17:48:33| 60.0| 66| 4| 0| 546.5| 512.0| 0.0| 243.0| 97.5| 0.0| 36072| 2048| 0| 16035| 390| 0 + concurrent@8| 17:48:38| 17:49:38| 60.0| 130| 8| 0| 546.6| 512.0| 0.0| 239.2| 146.0| 0.0| 71052| 4096| 0| 31090| 1168| 0 + concurrent@16| 17:49:43| 17:50:43| 60.0| 246| 16| 0| 546.6| 512.0| 0.0| 243.3| 112.3| 0.0| 134456| 8192| 0| 59862| 1797| 0 + concurrent@32| 17:50:49| 17:51:49| 60.0| 467| 32| 0| 546.6| 512.0| 0.0| 244.2| 147.3| 0.0| 255242| 16384| 0| 114038| 4714| 0 + concurrent@64| 17:51:55| 17:52:55| 60.0| 776| 64| 0| 546.5| 512.0| 0.0| 242.2| 106.1| 0.0| 424115| 32768| 0| 187916| 6788| 0 +concurrent@128| 17:53:03| 17:54:03| 60.0| 898| 127| 0| 546.5| 512.0| 0.0| 240.3| 69.8| 0.0| 490789| 65024| 0| 215810| 8864| 0 +===================================================================================================================================================== + + +Benchmarks Stats: +====================================================================================================================================================== +Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec)||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) || + Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99 +--------------|-----------|------------|------------|------------|-----|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|----- + concurrent@1| 0.29| 1.00| 73.9| 233.7| 3.42| 3.45| 3.50| 50.2| 50.9| 62.5| 13.4| 13.4| 13.5| 13.3| 13.3| 13.5 + concurrent@2| 0.57| 1.96| 134.7| 447.4| 3.42| 3.67| 4.12| 50.8| 49.2| 79.8| 14.3| 14.2| 15.9| 14.3| 14.2| 15.9 + concurrent@4| 1.11| 3.92| 268.7| 873.1| 3.55| 3.72| 3.80| 54.9| 51.7| 101.3| 14.4| 14.4| 14.5| 14.4| 14.4| 14.5 + concurrent@8| 2.20| 7.82| 526.1| 1728.4| 3.56| 3.78| 3.93| 60.6| 49.8| 189.5| 14.7| 14.7| 14.8| 14.6| 14.6| 14.8 + concurrent@16| 4.14| 15.66| 1006.9| 3268.6| 3.79| 3.94| 4.25| 74.8| 54.3| 328.4| 15.3| 15.3| 16.1| 15.2| 15.2| 16.0 + concurrent@32| 7.83| 30.91| 1912.0| 6191.6| 3.95| 4.07| 4.53| 119.1| 80.5| 674.0| 15.7| 15.6| 17.4| 15.7| 15.6| 17.3 + concurrent@64| 13.03| 61.85| 3154.3| 10273.3| 4.75| 4.93| 5.43| 339.1| 321.1| 1146.6| 18.3| 18.4| 19.3| 18.2| 18.3| 19.2 +concurrent@128| 15.05| 117.71| 3617.4| 11843.9| 7.82| 8.58| 13.35| 1393.8| 1453.0| 5232.2| 26.8| 26.7| 36.0| 26.7| 26.6| 35.9 +====================================================================================================================================================== + +Saving benchmarks report... +Benchmarks report saved to /benchmarks.json + +Benchmarking complete. diff --git a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw4-v1-20250922-105539.txt b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw4-v1-20250922-105539.txt new file mode 100644 index 000000000..a192f0ba3 --- /dev/null +++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw4-v1-20250922-105539.txt @@ -0,0 +1,171 @@ +Collecting uv + Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) +Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 156.8 MB/s eta 0:00:00 +Installing collected packages: uv +Successfully installed uv-0.8.19 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv + +[notice] A new release of pip is available: 24.0 -> 25.2 +[notice] To update, run: pip install --upgrade pip +Using Python 3.11.13 environment at: /usr/local +Resolved 61 packages in 480ms +Downloading pillow (6.3MiB) +Downloading pydantic-core (1.9MiB) +Downloading pyarrow (40.8MiB) +Downloading aiohttp (1.7MiB) +Downloading numpy (16.2MiB) +Downloading pygments (1.2MiB) +Downloading transformers (11.1MiB) +Downloading pandas (11.8MiB) +Downloading tokenizers (3.1MiB) +Downloading hf-xet (3.0MiB) + Downloading pydantic-core + Downloading aiohttp + Downloading tokenizers + Downloading hf-xet + Downloading pygments + Downloading pillow + Downloading numpy + Downloading pandas + Downloading pyarrow + Downloading transformers +Prepared 61 packages in 1.25s +Installed 61 packages in 126ms + + aiohappyeyeballs==2.6.1 + + aiohttp==3.12.15 + + aiosignal==1.4.0 + + annotated-types==0.7.0 + + anyio==4.10.0 + + attrs==25.3.0 + + certifi==2025.8.3 + + charset-normalizer==3.4.3 + + click==8.1.8 + + datasets==4.1.1 + + dill==0.4.0 + + filelock==3.19.1 + + frozenlist==1.7.0 + + fsspec==2025.9.0 + + ftfy==6.3.1 + + guidellm==0.3.0 + + h11==0.16.0 + + h2==4.3.0 + + hf-xet==1.1.10 + + hpack==4.1.0 + + httpcore==1.0.9 + + httpx==0.28.1 + + huggingface-hub==0.35.0 + + hyperframe==6.1.0 + + idna==3.10 + + loguru==0.7.3 + + markdown-it-py==4.0.0 + + mdurl==0.1.2 + + multidict==6.6.4 + + multiprocess==0.70.16 + + numpy==2.3.3 + + packaging==25.0 + + pandas==2.3.2 + + pillow==11.3.0 + + propcache==0.3.2 + + protobuf==6.32.1 + + pyarrow==21.0.0 + + pydantic==2.11.9 + + pydantic-core==2.33.2 + + pydantic-settings==2.10.1 + + pygments==2.19.2 + + python-dateutil==2.9.0.post0 + + python-dotenv==1.1.1 + + pytz==2025.2 + + pyyaml==6.0.2 + + regex==2025.9.18 + + requests==2.32.5 + + rich==14.1.0 + + safetensors==0.6.2 + + six==1.17.0 + + sniffio==1.3.1 + + tokenizers==0.22.1 + + tqdm==4.67.1 + + transformers==4.56.2 + + typing-extensions==4.15.0 + + typing-inspection==0.4.1 + + tzdata==2025.2 + + urllib3==2.5.0 + + wcwidth==0.2.14 + + xxhash==3.5.0 + + yarl==1.20.1 +Using Python 3.11.13 environment at: /usr/local +Audited 1 package in 4ms +Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured. +Creating backend... +Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct. +Creating request loader... +Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256. + + +╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ +│ [17:55:59] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.33s Lat, 1.0 Conc, 18 Comp, 1 Inc, 0 Err │ +│ Tok: 74.0 gen/s, 238.0 tot/s, 49.6ms TTFT, 13.4ms ITL, 546 Prompt, 246 Gen │ +│ [17:57:04] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.32s Lat, 1.9 Conc, 35 Comp, 2 Inc, 0 Err │ +│ Tok: 137.1 gen/s, 457.5 tot/s, 50.6ms TTFT, 14.0ms ITL, 546 Prompt, 234 Gen │ +│ [17:58:09] ⠋ 100% concurrent@4 (complete) Req: 1.2 req/s, 3.42s Lat, 4.0 Conc, 69 Comp, 4 Inc, 0 Err │ +│ Tok: 276.7 gen/s, 907.2 tot/s, 52.7ms TTFT, 14.1ms ITL, 547 Prompt, 240 Gen │ +│ [17:59:14] ⠋ 100% concurrent@8 (complete) Req: 2.3 req/s, 3.47s Lat, 7.8 Conc, 134 Comp, 8 Inc, 0 Err │ +│ Tok: 541.4 gen/s, 1775.4 tot/s, 57.3ms TTFT, 14.3ms ITL, 547 Prompt, 240 Gen │ +│ [18:00:19] ⠋ 100% concurrent@16 (complete) Req: 4.3 req/s, 3.60s Lat, 15.6 Conc, 259 Comp, 16 Inc, 0 Err │ +│ Tok: 1034.8 gen/s, 3401.7 tot/s, 72.3ms TTFT, 14.8ms ITL, 547 Prompt, 239 Gen │ +│ [18:01:25] ⠋ 100% concurrent@32 (complete) Req: 8.4 req/s, 3.69s Lat, 31.1 Conc, 505 Comp, 32 Inc, 0 Err │ +│ Tok: 2029.7 gen/s, 6641.5 tot/s, 91.6ms TTFT, 15.0ms ITL, 547 Prompt, 241 Gen │ +│ [18:02:31] ⠋ 100% concurrent@64 (complete) Req: 13.6 req/s, 4.50s Lat, 61.4 Conc, 818 Comp, 64 Inc, 0 Err │ +│ Tok: 3333.9 gen/s, 10787.0 tot/s, 171.3ms TTFT, 17.8ms ITL, 547 Prompt, 244 Gen │ +│ [18:03:40] ⠋ 100% concurrent@128 (complete) Req: 16.1 req/s, 7.43s Lat, 119.5 Conc, 964 Comp, 122 Inc, 0 Err │ +│ Tok: 3897.0 gen/s, 12679.4 tot/s, 446.4ms TTFT, 28.9ms ITL, 547 Prompt, 243 Gen │ +╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ +Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ] + +Benchmarks Metadata: + Run id:5393e64f-d9f8-4548-95d8-da320bba1c24 + Duration:530.1 seconds + Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128] + Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None + Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct' + backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': + '/v1/chat/completions'} + Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None + Extras:None + + +Benchmarks Info: +=================================================================================================================================================== +Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total|| + Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err +--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|---- + concurrent@1| 17:56:04| 17:57:04| 60.0| 18| 1| 0| 546.4| 512.0| 0.0| 246.4| 256.0| 0.0| 9836| 512| 0| 4436| 256| 0 + concurrent@2| 17:57:09| 17:58:09| 60.0| 35| 2| 0| 546.4| 512.0| 0.0| 233.9| 132.0| 0.0| 19124| 1024| 0| 8188| 264| 0 + concurrent@4| 17:58:14| 17:59:14| 60.0| 69| 4| 0| 546.6| 512.0| 0.0| 239.9| 60.5| 0.0| 37715| 2048| 0| 16553| 242| 0 + concurrent@8| 17:59:19| 18:00:19| 60.0| 134| 8| 0| 546.6| 512.0| 0.0| 239.8| 126.6| 0.0| 73243| 4096| 0| 32135| 1013| 0 + concurrent@16| 18:00:24| 18:01:24| 60.0| 259| 16| 0| 546.6| 512.0| 0.0| 239.0| 115.7| 0.0| 141561| 8192| 0| 61889| 1851| 0 + concurrent@32| 18:01:30| 18:02:30| 60.0| 505| 32| 0| 546.5| 512.0| 0.0| 240.5| 113.2| 0.0| 275988| 16384| 0| 121466| 3623| 0 + concurrent@64| 18:02:37| 18:03:37| 60.0| 818| 64| 0| 546.6| 512.0| 0.0| 244.5| 132.4| 0.0| 447087| 32768| 0| 199988| 8475| 0 +concurrent@128| 18:03:45| 18:04:45| 60.0| 964| 122| 0| 546.5| 512.0| 0.0| 242.5| 133.1| 0.0| 526866| 62464| 0| 233789| 16241| 0 +=================================================================================================================================================== + + +Benchmarks Stats: +======================================================================================================================================================= +Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) || + Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99 +--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|----- + concurrent@1| 0.30| 1.00| 74.0| 238.0| 3.33| 3.44| 3.63| 49.6| 47.2| 66.1| 13.4| 13.3| 14.0| 13.3| 13.3| 14.0 + concurrent@2| 0.59| 1.95| 137.1| 457.5| 3.32| 3.61| 3.67| 50.6| 48.6| 80.4| 14.0| 14.0| 14.2| 13.9| 13.9| 14.1 + concurrent@4| 1.15| 3.95| 276.7| 907.2| 3.42| 3.61| 3.77| 52.7| 49.7| 106.9| 14.1| 14.0| 14.6| 14.0| 13.9| 14.5 + concurrent@8| 2.26| 7.83| 541.4| 1775.4| 3.47| 3.70| 3.79| 57.3| 50.9| 171.3| 14.3| 14.3| 14.4| 14.2| 14.2| 14.4 + concurrent@16| 4.33| 15.57| 1034.8| 3401.7| 3.60| 3.81| 4.22| 72.3| 52.0| 292.9| 14.8| 14.7| 16.3| 14.7| 14.7| 16.3 + concurrent@32| 8.44| 31.12| 2029.7| 6641.5| 3.69| 3.89| 4.24| 91.6| 62.6| 504.6| 15.0| 15.0| 15.4| 14.9| 14.9| 15.4 + concurrent@64| 13.64| 61.40| 3333.9| 10787.0| 4.50| 4.61| 5.67| 171.3| 101.2| 1165.6| 17.8| 17.7| 19.2| 17.7| 17.6| 19.1 +concurrent@128| 16.07| 119.45| 3897.0| 12679.4| 7.43| 7.63| 9.74| 446.4| 195.8| 2533.1| 28.9| 28.9| 31.0| 28.8| 28.8| 30.9 +======================================================================================================================================================= + +Saving benchmarks report... +Benchmarks report saved to /benchmarks.json + +Benchmarking complete. diff --git a/benchmarking/k8s-benchmark/results/guidellm-benchmark-vllm-v1-20250922-111127.txt b/benchmarking/k8s-benchmark/results/guidellm-benchmark-vllm-v1-20250922-111127.txt new file mode 100644 index 000000000..8bee7d905 --- /dev/null +++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-vllm-v1-20250922-111127.txt @@ -0,0 +1,170 @@ +Collecting uv + Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) +Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 126.9 MB/s eta 0:00:00 +Installing collected packages: uv +Successfully installed uv-0.8.19 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv + +[notice] A new release of pip is available: 24.0 -> 25.2 +[notice] To update, run: pip install --upgrade pip +Using Python 3.11.13 environment at: /usr/local +Resolved 61 packages in 561ms +Downloading hf-xet (3.0MiB) +Downloading pillow (6.3MiB) +Downloading transformers (11.1MiB) +Downloading pyarrow (40.8MiB) +Downloading numpy (16.2MiB) +Downloading pandas (11.8MiB) +Downloading tokenizers (3.1MiB) +Downloading pydantic-core (1.9MiB) +Downloading pygments (1.2MiB) +Downloading aiohttp (1.7MiB) + Downloading pydantic-core + Downloading aiohttp + Downloading tokenizers + Downloading hf-xet + Downloading pygments + Downloading pillow + Downloading numpy + Downloading pandas + Downloading transformers + Downloading pyarrow +Prepared 61 packages in 1.25s +Installed 61 packages in 114ms + + aiohappyeyeballs==2.6.1 + + aiohttp==3.12.15 + + aiosignal==1.4.0 + + annotated-types==0.7.0 + + anyio==4.10.0 + + attrs==25.3.0 + + certifi==2025.8.3 + + charset-normalizer==3.4.3 + + click==8.1.8 + + datasets==4.1.1 + + dill==0.4.0 + + filelock==3.19.1 + + frozenlist==1.7.0 + + fsspec==2025.9.0 + + ftfy==6.3.1 + + guidellm==0.3.0 + + h11==0.16.0 + + h2==4.3.0 + + hf-xet==1.1.10 + + hpack==4.1.0 + + httpcore==1.0.9 + + httpx==0.28.1 + + huggingface-hub==0.35.0 + + hyperframe==6.1.0 + + idna==3.10 + + loguru==0.7.3 + + markdown-it-py==4.0.0 + + mdurl==0.1.2 + + multidict==6.6.4 + + multiprocess==0.70.16 + + numpy==2.3.3 + + packaging==25.0 + + pandas==2.3.2 + + pillow==11.3.0 + + propcache==0.3.2 + + protobuf==6.32.1 + + pyarrow==21.0.0 + + pydantic==2.11.9 + + pydantic-core==2.33.2 + + pydantic-settings==2.10.1 + + pygments==2.19.2 + + python-dateutil==2.9.0.post0 + + python-dotenv==1.1.1 + + pytz==2025.2 + + pyyaml==6.0.2 + + regex==2025.9.18 + + requests==2.32.5 + + rich==14.1.0 + + safetensors==0.6.2 + + six==1.17.0 + + sniffio==1.3.1 + + tokenizers==0.22.1 + + tqdm==4.67.1 + + transformers==4.56.2 + + typing-extensions==4.15.0 + + typing-inspection==0.4.1 + + tzdata==2025.2 + + urllib3==2.5.0 + + wcwidth==0.2.14 + + xxhash==3.5.0 + + yarl==1.20.1 +Using Python 3.11.13 environment at: /usr/local +Audited 1 package in 3ms +Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured. +Creating backend... +Backend openai_http connected to http://vllm-server:8000 for model meta-llama/Llama-3.2-3B-Instruct. +Creating request loader... +Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256. + + +╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ +│ [18:11:47] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.35s Lat, 1.0 Conc, 17 Comp, 1 Inc, 0 Err │ +│ Tok: 76.4 gen/s, 239.4 tot/s, 29.6ms TTFT, 13.0ms ITL, 547 Prompt, 256 Gen │ +│ [18:12:52] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.53s Lat, 2.0 Conc, 32 Comp, 2 Inc, 0 Err │ +│ Tok: 145.0 gen/s, 454.5 tot/s, 36.9ms TTFT, 13.7ms ITL, 546 Prompt, 256 Gen │ +│ [18:13:57] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.59s Lat, 4.0 Conc, 64 Comp, 4 Inc, 0 Err │ +│ Tok: 284.8 gen/s, 892.7 tot/s, 59.0ms TTFT, 13.9ms ITL, 546 Prompt, 256 Gen │ +│ [18:15:02] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.70s Lat, 8.0 Conc, 128 Comp, 7 Inc, 0 Err │ +│ Tok: 553.5 gen/s, 1735.2 tot/s, 79.8ms TTFT, 14.2ms ITL, 547 Prompt, 256 Gen │ +│ [18:16:08] ⠋ 100% concurrent@16 (complete) Req: 4.2 req/s, 3.83s Lat, 16.0 Conc, 240 Comp, 16 Inc, 0 Err │ +│ Tok: 1066.9 gen/s, 3344.6 tot/s, 97.5ms TTFT, 14.6ms ITL, 547 Prompt, 256 Gen │ +│ [18:17:13] ⠋ 100% concurrent@32 (complete) Req: 8.1 req/s, 3.94s Lat, 31.8 Conc, 480 Comp, 31 Inc, 0 Err │ +│ Tok: 2069.7 gen/s, 6488.4 tot/s, 120.8ms TTFT, 15.0ms ITL, 547 Prompt, 256 Gen │ +│ [18:18:20] ⠋ 100% concurrent@64 (complete) Req: 13.6 req/s, 4.60s Lat, 62.3 Conc, 813 Comp, 57 Inc, 0 Err │ +│ Tok: 3472.1 gen/s, 10884.9 tot/s, 190.9ms TTFT, 17.3ms ITL, 547 Prompt, 256 Gen │ +│ [18:19:28] ⠋ 100% concurrent@128 (complete) Req: 16.8 req/s, 7.37s Lat, 123.5 Conc, 1005 Comp, 126 Inc, 0 Err │ +│ Tok: 4289.1 gen/s, 13445.8 tot/s, 356.4ms TTFT, 27.5ms ITL, 547 Prompt, 256 Gen │ +╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ +Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:43 < 0:00:00 ] + +Benchmarks Metadata: + Run id:8ccb6da1-83f4-4624-8d84-07c723b0b2a5 + Duration:530.4 seconds + Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128] + Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None + Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://vllm-server:8000' backend_model='meta-llama/Llama-3.2-3B-Instruct' backend_info={'max_output_tokens': + 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': '/v1/chat/completions'} + Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None + Extras:None + + +Benchmarks Info: +===================================================================================================================================================== +Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total || + Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err +--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|----- + concurrent@1| 18:11:52| 18:12:52| 60.0| 17| 1| 0| 546.5| 512.0| 0.0| 256.0| 231.0| 0.0| 9291| 512| 0| 4352| 231| 0 + concurrent@2| 18:12:57| 18:13:57| 60.0| 32| 2| 0| 546.5| 512.0| 0.0| 256.0| 251.0| 0.0| 17488| 1024| 0| 8192| 502| 0 + concurrent@4| 18:14:02| 18:15:02| 60.0| 64| 4| 0| 546.4| 512.0| 0.0| 256.0| 175.2| 0.0| 34972| 2048| 0| 16384| 701| 0 + concurrent@8| 18:15:07| 18:16:07| 60.0| 128| 7| 0| 546.6| 512.0| 0.0| 256.0| 50.7| 0.0| 69966| 3584| 0| 32768| 355| 0 + concurrent@16| 18:16:13| 18:17:13| 60.0| 240| 16| 0| 546.5| 512.0| 0.0| 256.0| 166.0| 0.0| 131170| 8192| 0| 61440| 2656| 0 + concurrent@32| 18:17:18| 18:18:18| 60.0| 480| 31| 0| 546.5| 512.0| 0.0| 256.0| 47.4| 0.0| 262339| 15872| 0| 122880| 1468| 0 + concurrent@64| 18:18:25| 18:19:25| 60.0| 813| 57| 0| 546.5| 512.0| 0.0| 256.0| 110.7| 0.0| 444341| 29184| 0| 208128| 6311| 0 +concurrent@128| 18:19:33| 18:20:33| 60.0| 1005| 126| 0| 546.5| 512.0| 0.0| 256.0| 65.8| 0.0| 549264| 64512| 0| 257280| 8296| 0 +===================================================================================================================================================== + + +Benchmarks Stats: +======================================================================================================================================================= +Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) || + Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99 +--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|----- + concurrent@1| 0.30| 1.00| 76.4| 239.4| 3.35| 3.35| 3.38| 29.6| 29.0| 38.9| 13.0| 13.0| 13.1| 13.0| 13.0| 13.0 + concurrent@2| 0.57| 2.00| 145.0| 454.5| 3.53| 3.53| 3.55| 36.9| 39.0| 59.6| 13.7| 13.7| 13.8| 13.6| 13.7| 13.7 + concurrent@4| 1.11| 4.00| 284.8| 892.7| 3.59| 3.59| 3.65| 59.0| 65.7| 88.2| 13.9| 13.8| 14.1| 13.8| 13.8| 14.0 + concurrent@8| 2.16| 7.99| 553.5| 1735.2| 3.70| 3.69| 3.76| 79.8| 80.7| 152.6| 14.2| 14.2| 14.5| 14.1| 14.1| 14.4 + concurrent@16| 4.17| 15.97| 1066.9| 3344.6| 3.83| 3.82| 3.99| 97.5| 96.3| 283.9| 14.6| 14.6| 14.9| 14.6| 14.6| 14.8 + concurrent@32| 8.08| 31.84| 2069.7| 6488.4| 3.94| 3.90| 4.31| 120.8| 101.7| 564.3| 15.0| 14.9| 15.9| 14.9| 14.8| 15.9 + concurrent@64| 13.56| 62.34| 3472.1| 10884.9| 4.60| 4.54| 5.43| 190.9| 133.9| 1113.2| 17.3| 17.2| 18.2| 17.2| 17.2| 18.2 +concurrent@128| 16.75| 123.45| 4289.1| 13445.8| 7.37| 7.21| 9.21| 356.4| 161.9| 2319.9| 27.5| 27.5| 28.8| 27.4| 27.4| 28.7 +======================================================================================================================================================= + +Saving benchmarks report... +Benchmarks report saved to /benchmarks.json + +Benchmarking complete. diff --git a/benchmarking/k8s-benchmark/results/vllm_replica1_benchmark_results.png b/benchmarking/k8s-benchmark/results/vllm_replica1_benchmark_results.png new file mode 100644 index 000000000..86c6c046e Binary files /dev/null and b/benchmarking/k8s-benchmark/results/vllm_replica1_benchmark_results.png differ diff --git a/benchmarking/k8s-benchmark/run-benchmark.sh b/benchmarking/k8s-benchmark/run-benchmark.sh deleted file mode 100755 index e1c826143..000000000 --- a/benchmarking/k8s-benchmark/run-benchmark.sh +++ /dev/null @@ -1,148 +0,0 @@ -#!/usr/bin/env bash - -# Copyright (c) Meta Platforms, Inc. and affiliates. -# All rights reserved. -# -# This source code is licensed under the terms described in the LICENSE file in -# the root directory of this source tree. - -set -euo pipefail - -# Default values -TARGET="stack" -DURATION=60 -CONCURRENT=10 - -# Parse command line arguments -usage() { - echo "Usage: $0 [options]" - echo "Options:" - echo " -t, --target Target to benchmark (default: stack)" - echo " -d, --duration Duration in seconds (default: 60)" - echo " -c, --concurrent Number of concurrent users (default: 10)" - echo " -h, --help Show this help message" - echo "" - echo "Examples:" - echo " $0 --target vllm # Benchmark vLLM direct" - echo " $0 --target stack # Benchmark Llama Stack (default)" - echo " $0 -t vllm -d 120 -c 20 # vLLM with 120s duration, 20 users" -} - -while [[ $# -gt 0 ]]; do - case $1 in - -t|--target) - TARGET="$2" - shift 2 - ;; - -d|--duration) - DURATION="$2" - shift 2 - ;; - -c|--concurrent) - CONCURRENT="$2" - shift 2 - ;; - -h|--help) - usage - exit 0 - ;; - *) - echo "Unknown option: $1" - usage - exit 1 - ;; - esac -done - -# Validate target -if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then - echo "Error: Target must be 'stack' or 'vllm'" - usage - exit 1 -fi - -# Set configuration based on target -if [[ "$TARGET" == "vllm" ]]; then - BASE_URL="http://vllm-server:8000/v1" - JOB_NAME="vllm-benchmark-job" - echo "Benchmarking vLLM direct..." -else - BASE_URL="http://llama-stack-benchmark-service:8323/v1/openai/v1" - JOB_NAME="stack-benchmark-job" - echo "Benchmarking Llama Stack..." -fi - -echo "Configuration:" -echo " Target: $TARGET" -echo " Base URL: $BASE_URL" -echo " Duration: ${DURATION}s" -echo " Concurrent users: $CONCURRENT" -echo "" - -# Create temporary job yaml -TEMP_YAML="/tmp/benchmark-job-temp-$(date +%s).yaml" -cat > "$TEMP_YAML" << EOF -apiVersion: batch/v1 -kind: Job -metadata: - name: $JOB_NAME - namespace: default -spec: - template: - spec: - containers: - - name: benchmark - image: python:3.11-slim - command: ["/bin/bash"] - args: - - "-c" - - | - pip install aiohttp && - python3 /benchmark/benchmark.py \\ - --base-url $BASE_URL \\ - --model \${INFERENCE_MODEL} \\ - --duration $DURATION \\ - --concurrent $CONCURRENT - env: - - name: INFERENCE_MODEL - value: "meta-llama/Llama-3.2-3B-Instruct" - volumeMounts: - - name: benchmark-script - mountPath: /benchmark - resources: - requests: - memory: "256Mi" - cpu: "250m" - limits: - memory: "512Mi" - cpu: "500m" - volumes: - - name: benchmark-script - configMap: - name: benchmark-script - restartPolicy: Never - backoffLimit: 3 -EOF - -echo "Creating benchmark ConfigMap..." -kubectl create configmap benchmark-script \ - --from-file=benchmark.py=benchmark.py \ - --dry-run=client -o yaml | kubectl apply -f - - -echo "Cleaning up any existing benchmark job..." -kubectl delete job $JOB_NAME 2>/dev/null || true - -echo "Deploying benchmark Job..." -kubectl apply -f "$TEMP_YAML" - -echo "Waiting for job to start..." -kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=60s - -echo "Following benchmark logs..." -kubectl logs -f job/$JOB_NAME - -echo "Job completed. Checking final status..." -kubectl get job $JOB_NAME - -# Clean up temporary file -rm -f "$TEMP_YAML" diff --git a/benchmarking/k8s-benchmark/scripts/generate_charts.py b/benchmarking/k8s-benchmark/scripts/generate_charts.py new file mode 100755 index 000000000..7b920fc04 --- /dev/null +++ b/benchmarking/k8s-benchmark/scripts/generate_charts.py @@ -0,0 +1,294 @@ +#!/usr/bin/env python3 +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the terms described in the LICENSE file in +# the root directory of this source tree. + +# /// script +# dependencies = [ +# "matplotlib", +# ] +# /// +""" +Script to generate benchmark charts from guidellm text results. +Creates 2x2 grid charts with RPS, Request Latency, TTFT, and ITL metrics against concurrent@x values. +Outputs one chart file per vLLM replica group, with each line representing one benchmark run. +""" + +import glob +import os +import re + +import matplotlib.pyplot as plt + + +def extract_setup_name(filename: str) -> str: + """Extract setup name from filename and format legend appropriately.""" + basename = os.path.basename(filename) + + # Try new pattern: guidellm-benchmark-stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt + match = re.search(r"guidellm-benchmark-stack-s(\d+)-sw(\d+)-v(\d+)-(\d{8})-(\d{6})\.txt", basename) + if match: + stack_replicas = match.group(1) + workers = match.group(2) + vllm_replicas = match.group(3) + date = match.group(4) + time = match.group(5) + return f"stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}" + + # Try new vLLM pattern: guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt + match = re.search(r"guidellm-benchmark-vllm-v(\d+)-(\d{8})-(\d{6})\.txt", basename) + if match: + vllm_replicas = match.group(1) + date = match.group(2) + time = match.group(3) + return f"vllm-v{vllm_replicas}" + + # Fall back to old pattern: guidellm-benchmark-{target}-{stack_replicas}-w{workers}-{vllm_replicas}-{timestamp}.txt + match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-w(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename) + if match: + target = match.group(1) + stack_replicas = match.group(2) + workers = match.group(3) + vllm_replicas = match.group(4) + date = match.group(5) + time = match.group(6) + + if target == "vllm": + return f"vllm-{vllm_replicas}-w{workers}-{vllm_replicas}" + else: + return f"stack-replicas{stack_replicas}-w{workers}-vllm-replicas{vllm_replicas}-{date}-{time}" + + # Fall back to older pattern: guidellm-benchmark-{target}-{stack_replicas}-{vllm_replicas}-{timestamp}.txt + match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename) + if match: + target = match.group(1) + stack_replicas = match.group(2) + vllm_replicas = match.group(3) + date = match.group(4) + time = match.group(5) + + if target == "vllm": + return f"vllm-{vllm_replicas}-w1-{vllm_replicas}" + else: + return f"stack-replicas{stack_replicas}-vllm-replicas{vllm_replicas}-{date}-{time}" + + return basename.replace("guidellm-benchmark-", "").replace(".txt", "") + + +def parse_txt_file(filepath: str) -> list[tuple[float, float, float, float, float, str]]: + """ + Parse a text benchmark file and extract concurrent@x, RPS, TTFT, ITL, and request latency data. + Returns list of (concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name) tuples. + """ + setup_name = extract_setup_name(filepath) + data_points = [] + + try: + with open(filepath) as f: + content = f.read() + + # Find the benchmark stats table + lines = content.split("\n") + in_stats_table = False + header_lines_seen = 0 + + for line in lines: + line_stripped = line.strip() + + # Look for the start of the stats table + if "Benchmarks Stats:" in line: + in_stats_table = True + continue + + if in_stats_table: + # Skip the first few separator/header lines + if line_stripped.startswith("=") or line_stripped.startswith("-"): + header_lines_seen += 1 + if header_lines_seen >= 3: # After seeing multiple header lines, look for concurrent@ data + if line_stripped.startswith("=") and "concurrent@" not in line_stripped: + break + continue + + # Parse concurrent@ lines in the stats table (may have leading spaces) + if in_stats_table and "concurrent@" in line: + parts = [part.strip() for part in line.split("|")] + + if len(parts) >= 12: # Make sure we have enough columns for new format + try: + # Extract concurrency from benchmark name (e.g., concurrent@1 -> 1) + concurrent_match = re.search(r"concurrent@(\d+)", parts[0]) + if not concurrent_match: + continue + concurrency = float(concurrent_match.group(1)) + + # Extract metrics from the new table format + # From your image, the table has these columns with | separators: + # Benchmark | Per Second | Concurrency | Out Tok/sec | Tot Tok/sec | Req Latency (sec) | TTFT (ms) | ITL (ms) | TPOT (ms) + # Looking at the mean/median/p99 structure, need to find the mean columns + # The structure shows: mean | median | p99 for each metric + rps_mean = float(parts[1]) # Per Second (RPS) + req_latency_mean = float(parts[6]) * 1000 # Request latency mean (convert from sec to ms) + ttft_mean = float(parts[9]) # TTFT mean column + itl_mean = float(parts[12]) # ITL mean column + + data_points.append((concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name)) + + except (ValueError, IndexError) as e: + print(f"Warning: Could not parse line '{line}' in {filepath}: {e}") + continue + + except (OSError, FileNotFoundError) as e: + print(f"Error reading {filepath}: {e}") + + return data_points + + +def generate_charts(benchmark_dir: str = "results"): + """Generate 2x2 grid charts (RPS, Request Latency, TTFT, ITL) from benchmark text files.""" + # Find all text result files instead of JSON + txt_pattern = os.path.join(benchmark_dir, "guidellm-benchmark-*.txt") + txt_files = glob.glob(txt_pattern) + + if not txt_files: + print(f"No text files found matching pattern: {txt_pattern}") + return + + print(f"Found {len(txt_files)} text files") + + # Parse all files and collect data + all_data = {} # setup_name -> [(concurrency, rps, ttft, itl, req_latency), ...] + + for txt_file in txt_files: + print(f"Processing {txt_file}") + data_points = parse_txt_file(txt_file) + + for concurrency, rps, ttft, itl, req_latency, setup_name in data_points: + if setup_name not in all_data: + all_data[setup_name] = [] + all_data[setup_name].append((concurrency, rps, ttft, itl, req_latency)) + + if not all_data: + print("No data found to plot") + return + + # Sort data points by concurrency for each setup + for setup_name in all_data: + all_data[setup_name].sort(key=lambda x: x[0]) # Sort by concurrency + + # Group setups by vLLM replica number (original approach) + replica_groups = {} # vllm_replica_count -> {setup_name: points} + + for setup_name, points in all_data.items(): + # Extract vLLM replica number from setup name + # Expected formats: + # - New stack format: "stack-s{X}-sw{W}-v{Y}" + # - New vLLM format: "vllm-v{Y}" + # - Old formats: "stack-replicas{X}-w{W}-vllm-replicas{Y}" or "vllm-{Y}-w{W}-{Y}" + + # Try new formats first + vllm_match = re.search(r"-v(\d+)$", setup_name) # Matches both "stack-s1-sw2-v3" and "vllm-v1" + if not vllm_match: + # Try old stack format + vllm_match = re.search(r"vllm-replicas(\d+)", setup_name) + if not vllm_match: + # Try old vLLM format: "vllm-{Y}-w{W}-{Y}" + vllm_match = re.search(r"vllm-(\d+)-w\d+-\d+", setup_name) + + if vllm_match: + vllm_replica_num = int(vllm_match.group(1)) + if vllm_replica_num not in replica_groups: + replica_groups[vllm_replica_num] = {} + replica_groups[vllm_replica_num][setup_name] = points + else: + print(f"Warning: Could not extract vLLM replica count from setup name: {setup_name}") + + def create_charts(data_dict, prefix, title_prefix): + """Create a 2x2 grid with RPS, Request Latency, TTFT, and ITL charts.""" + if not data_dict: + print(f"No data found for {prefix}") + return + + # Create 2x2 subplot grid + fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12)) + fig.suptitle(f"{title_prefix} Benchmark Results", fontsize=16, fontweight="bold") + + # Collect all unique concurrency values for tick setting + all_concurrency_values = set() + for points in data_dict.values(): + all_concurrency_values.update([p[0] for p in points]) + all_concurrency_values = sorted(all_concurrency_values) + + # Plot data for each setup in alphabetical order + for setup_name in sorted(data_dict.keys()): + points = data_dict[setup_name] + if not points: + continue + + concurrency_values = [p[0] for p in points] + rps_values = [p[1] for p in points] + ttft_values = [p[2] for p in points] + itl_values = [p[3] for p in points] + req_latency_values = [p[4] for p in points] + + # RPS chart (top-left) + ax1.plot(concurrency_values, rps_values, marker="o", label=setup_name, linewidth=2, markersize=6) + + # Request Latency chart (top-right) + ax2.plot(concurrency_values, req_latency_values, marker="o", label=setup_name, linewidth=2, markersize=6) + + # TTFT chart (bottom-left) + ax3.plot(concurrency_values, ttft_values, marker="o", label=setup_name, linewidth=2, markersize=6) + + # ITL chart (bottom-right) + ax4.plot(concurrency_values, itl_values, marker="o", label=setup_name, linewidth=2, markersize=6) + + # Configure all charts after plotting data + axes = [ax1, ax2, ax3, ax4] + titles = ["RPS", "Request Latency", "TTFT", "ITL"] + ylabels = [ + "Requests Per Second (RPS)", + "Request Latency (ms)", + "Time to First Token (ms)", + "Inter Token Latency (ms)", + ] + + for ax, title, ylabel in zip(axes, titles, ylabels, strict=False): + ax.set_xlabel("Concurrency", fontsize=12) + ax.set_ylabel(ylabel, fontsize=12) + ax.set_title(title, fontsize=14, fontweight="bold") + ax.set_xscale("log", base=2) + ax.set_xticks(all_concurrency_values) + ax.set_xticklabels([str(int(x)) for x in all_concurrency_values]) + ax.grid(True, alpha=0.3) + + # Add legend to the right-most subplot (top-right) + ax2.legend(bbox_to_anchor=(1.05, 1), loc="upper left") + + plt.tight_layout() + + # Save the combined chart + combined_filename = os.path.join(benchmark_dir, f"{prefix}_benchmark_results.png") + plt.savefig(combined_filename, dpi=300, bbox_inches="tight") + plt.close() + print(f"Combined benchmark chart saved to {combined_filename}") + + # Print grouping information + for replica_count, data_dict in replica_groups.items(): + print(f"vLLM Replica {replica_count} setups: {list(data_dict.keys())}") + + # Create separate charts for each replica group + for replica_count, data_dict in replica_groups.items(): + prefix = f"vllm_replica{replica_count}" + title = f"vLLM Replicas={replica_count}" + create_charts(data_dict, prefix, title) + + # Print summary + print("\nSummary:") + for setup_name, points in all_data.items(): + print(f"{setup_name}: {len(points)} data points") + + +if __name__ == "__main__": + generate_charts() diff --git a/benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh b/benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh new file mode 100755 index 000000000..0a4a774c7 --- /dev/null +++ b/benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh @@ -0,0 +1,103 @@ +#!/usr/bin/env bash + +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the terms described in the LICENSE file in +# the root directory of this source tree. + +# Define benchmark configurations: (target, stack_replicas, vllm_replicas, stack_workers) +configs=( + "stack 1 1 1" + "stack 1 1 2" + "stack 1 1 4" + "vllm 1 1 -" +) + +set -euo pipefail + +# Get the directory where this script is located +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +echo "Running comprehensive GuideLL benchmark suite..." +echo "Start time: $(date)" + +# Default deployment names +STACK_DEPLOYMENT="llama-stack-benchmark-server" +VLLM_DEPLOYMENT="vllm-server" + +# Scaling function +scale_deployments() { + local stack_replicas=$1 + local vllm_replicas=$2 + local workers=$3 + + echo "Scaling deployments..." + + if [[ "$vllm_replicas" != "-" ]]; then + echo "Scaling $VLLM_DEPLOYMENT to $vllm_replicas replicas..." + kubectl scale deployment $VLLM_DEPLOYMENT --replicas=$vllm_replicas + kubectl rollout status deployment $VLLM_DEPLOYMENT --timeout=600s + fi + + if [[ "$target" == "stack" ]]; then + if [[ "$stack_replicas" != "-" ]]; then + echo "Scaling $STACK_DEPLOYMENT to $stack_replicas replicas..." + kubectl scale deployment $STACK_DEPLOYMENT --replicas=$stack_replicas + kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s + fi + + if [[ "$workers" != "-" ]]; then + echo "Updating $STACK_DEPLOYMENT to use $workers workers..." + kubectl set env deployment/$STACK_DEPLOYMENT LLAMA_STACK_WORKERS=$workers + kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s + fi + fi + + echo "All scaling operations completed. Waiting additional 30s for services to stabilize..." + sleep 30 +} + + +for config in "${configs[@]}"; do + read -r target stack_replicas vllm_replicas workers <<< "$config" + + echo "" + echo "==========================================" + if [[ "$workers" != "-" ]]; then + echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas, workers=$workers)" + else + echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas)" + fi + echo "Start: $(date)" + echo "==========================================" + + # Scale deployments before running benchmark + scale_deployments "$stack_replicas" "$vllm_replicas" "$workers" + + # Generate output filename with setup info + TIMESTAMP=$(date +%Y%m%d-%H%M%S) + if [[ "$target" == "stack" ]]; then + OUTPUT_FILE="results/guidellm-benchmark-${target}-s${stack_replicas}-sw${workers}-v${vllm_replicas}-${TIMESTAMP}.txt" + else + OUTPUT_FILE="results/guidellm-benchmark-${target}-v${vllm_replicas}-${TIMESTAMP}.txt" + fi + + # Run the benchmark with the cluster as configured + "$SCRIPT_DIR/run-guidellm-benchmark.sh" \ + --target "$target" \ + --output-file "$OUTPUT_FILE" + + echo "Completed: $(date)" + echo "Waiting 30 seconds before next benchmark..." + sleep 30 +done + +echo "" +echo "==========================================" +echo "All benchmarks completed!" +echo "End time: $(date)" +echo "==========================================" +echo "" +echo "Results files generated:" +ls -la results/guidellm-*.txt results/guidellm-*.json 2>/dev/null || echo "No result files found" diff --git a/benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh b/benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh new file mode 100755 index 000000000..746eff391 --- /dev/null +++ b/benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh @@ -0,0 +1,219 @@ +#!/usr/bin/env bash + +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the terms described in the LICENSE file in +# the root directory of this source tree. + +set -euo pipefail + +# Default values +TARGET="stack" +MAX_SECONDS=60 +PROMPT_TOKENS=512 +OUTPUT_TOKENS=256 +RATE_TYPE="concurrent" +RATE="1,2,4,8,16,32,64,128" +STACK_DEPLOYMENT="llama-stack-benchmark-server" +STACK_URL="http://llama-stack-benchmark-service:8323/v1/openai" +VLLM_DEPLOYMENT="vllm-server" +OUTPUT_FILE="" + +# Parse command line arguments +usage() { + echo "Usage: $0 [options]" + echo "Options:" + echo " -t, --target Target to benchmark (default: stack)" + echo " -s, --max-seconds Maximum duration in seconds (default: 60)" + echo " -p, --prompt-tokens Number of prompt tokens (default: 512)" + echo " -o, --output-tokens Number of output tokens (default: 256)" + echo " -r, --rate-type Rate type (default: concurrent)" + echo " -c, --rate Rate (default: 1,2,4,8,16,32,64,128)" + echo " --output-file Output file path (default: auto-generated)" + echo " --stack-deployment Name of the stack deployment (default: llama-stack-benchmark-server)" + echo " --vllm-deployment Name of the vllm deployment (default: vllm-server)" + echo " --stack-url URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai)" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 --target vllm # Benchmark vLLM direct" + echo " $0 --target stack # Benchmark Llama Stack (default)" + echo " $0 -t vllm -s 60 -p 512 -o 256 # vLLM with custom parameters" + echo " $0 --output-file results/my-benchmark.txt # Specify custom output file" + echo " $0 --stack-deployment my-stack-server # Use custom stack deployment name" +} + +while [[ $# -gt 0 ]]; do + case $1 in + -t|--target) + TARGET="$2" + shift 2 + ;; + -s|--max-seconds) + MAX_SECONDS="$2" + shift 2 + ;; + -p|--prompt-tokens) + PROMPT_TOKENS="$2" + shift 2 + ;; + -o|--output-tokens) + OUTPUT_TOKENS="$2" + shift 2 + ;; + -r|--rate-type) + RATE_TYPE="$2" + shift 2 + ;; + -c|--rate) + RATE="$2" + shift 2 + ;; + --output-file) + OUTPUT_FILE="$2" + shift 2 + ;; + --stack-deployment) + STACK_DEPLOYMENT="$2" + shift 2 + ;; + --vllm-deployment) + VLLM_DEPLOYMENT="$2" + shift 2 + ;; + --stack-url) + STACK_URL="$2" + shift 2 + ;; + -h|--help) + usage + exit 0 + ;; + *) + echo "Unknown option: $1" + usage + exit 1 + ;; + esac +done + +# Validate target +if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then + echo "Error: Target must be 'stack' or 'vllm'" + usage + exit 1 +fi + +# Set configuration based on target +if [[ "$TARGET" == "vllm" ]]; then + BASE_URL="http://${VLLM_DEPLOYMENT}:8000" + JOB_NAME="guidellm-vllm-benchmark-job" + echo "Benchmarking vLLM direct with GuideLLM..." +else + BASE_URL="$STACK_URL" + JOB_NAME="guidellm-stack-benchmark-job" + echo "Benchmarking Llama Stack with GuideLLM..." +fi + + +echo "Configuration:" +echo " Target: $TARGET" +echo " Base URL: $BASE_URL" +echo " Max seconds: ${MAX_SECONDS}s" +echo " Prompt tokens: $PROMPT_TOKENS" +echo " Output tokens: $OUTPUT_TOKENS" +echo " Rate type: $RATE_TYPE" +if [[ "$TARGET" == "vllm" ]]; then + echo " vLLM deployment: $VLLM_DEPLOYMENT" +else + echo " Stack deployment: $STACK_DEPLOYMENT" +fi +echo "" + +# Create temporary job yaml +TEMP_YAML="/tmp/guidellm-benchmark-job-temp-$(date +%s).yaml" +cat > "$TEMP_YAML" << EOF +apiVersion: batch/v1 +kind: Job +metadata: + name: $JOB_NAME + namespace: default +spec: + template: + spec: + containers: + - name: guidellm-benchmark + image: python:3.11-slim + command: ["/bin/bash"] + args: + - "-c" + - | + # Install uv and guidellm + pip install uv && + uv pip install --system guidellm && + + # Login to HuggingFace + uv pip install --system huggingface_hub && + python -c "from huggingface_hub import login; login(token='\$HF_TOKEN')" && + + # Run GuideLLM benchmark and save output + export COLUMNS=200 + GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \\ + --target "$BASE_URL" \\ + --rate-type "$RATE_TYPE" \\ + --max-seconds $MAX_SECONDS \\ + --data "prompt_tokens=$PROMPT_TOKENS,output_tokens=$OUTPUT_TOKENS" \\ + --model "$INFERENCE_MODEL" \\ + --rate "$RATE" \\ + --warmup-percent 0.05 \\ + 2>&1 + env: + - name: INFERENCE_MODEL + value: "meta-llama/Llama-3.2-3B-Instruct" + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: token + resources: + requests: + memory: "4Gi" + cpu: "500m" + limits: + memory: "8Gi" + cpu: "2000m" + restartPolicy: Never + backoffLimit: 3 +EOF + +echo "Cleaning up any existing GuideLLM benchmark job..." +kubectl delete job $JOB_NAME 2>/dev/null || true + +echo "Deploying GuideLLM benchmark Job..." +kubectl apply -f "$TEMP_YAML" + +echo "Waiting for job to start..." +kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=120s + +# Prepare file names and create results directory +mkdir -p results +if [[ -z "$OUTPUT_FILE" ]]; then + TIMESTAMP=$(date +%Y%m%d-%H%M%S) + OUTPUT_FILE="results/guidellm-benchmark-${TARGET}-${TIMESTAMP}.txt" +fi + +echo "Following GuideLLM benchmark logs..." +kubectl logs -f job/$JOB_NAME + +echo "Job completed. Checking final status..." +kubectl get job $JOB_NAME + +# Save benchmark results using kubectl logs +echo "Saving benchmark results..." +kubectl logs job/$JOB_NAME > "$OUTPUT_FILE" + +echo "Benchmark output saved to: $OUTPUT_FILE" + +# Clean up temporary file +rm -f "$TEMP_YAML" diff --git a/benchmarking/k8s-benchmark/stack-k8s.yaml.template b/benchmarking/k8s-benchmark/stack-k8s.yaml.template index 8842c0bea..54eeadcad 100644 --- a/benchmarking/k8s-benchmark/stack-k8s.yaml.template +++ b/benchmarking/k8s-benchmark/stack-k8s.yaml.template @@ -58,14 +58,14 @@ spec: value: "/etc/config/stack_run_config.yaml" - name: LLAMA_STACK_WORKERS value: "${LLAMA_STACK_WORKERS}" - command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$LLAMA_STACK_WORKERS", "--factory"] + command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$(LLAMA_STACK_WORKERS)", "--factory"] ports: - containerPort: 8323 resources: requests: - cpu: "${LLAMA_STACK_WORKERS}" + cpu: "4" limits: - cpu: "${LLAMA_STACK_WORKERS}" + cpu: "4" volumeMounts: - name: llama-storage mountPath: /root/.llama diff --git a/pyproject.toml b/pyproject.toml index ecbd8991a..86a32f978 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -177,6 +177,7 @@ exclude = [ ".pre-commit-config.yaml", "*.md", ".flake8", + "benchmarking/k8s-benchmark/results", ] [tool.ruff.lint]