mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-03 19:57:35 +00:00
chore(perf): run guidellm benchmarks (#3421)
Some checks failed
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s
Unit Tests / unit-tests (3.13) (push) Failing after 3s
Update ReadTheDocs / update-readthedocs (push) Failing after 3s
Test Llama Stack Build / build (push) Failing after 3s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Test Llama Stack Build / generate-matrix (push) Successful in 3s
Python Package Build Test / build (3.12) (push) Failing after 1s
Python Package Build Test / build (3.13) (push) Failing after 2s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 3s
Test Llama Stack Build / build-single-provider (push) Failing after 3s
Vector IO Integration Tests / test-matrix (push) Failing after 5s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 4s
API Conformance Tests / check-schema-compatibility (push) Successful in 8s
Test External API and Providers / test-external (venv) (push) Failing after 3s
Unit Tests / unit-tests (3.12) (push) Failing after 4s
UI Tests / ui-tests (22) (push) Successful in 40s
Pre-commit / pre-commit (push) Successful in 1m9s
Some checks failed
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s
Unit Tests / unit-tests (3.13) (push) Failing after 3s
Update ReadTheDocs / update-readthedocs (push) Failing after 3s
Test Llama Stack Build / build (push) Failing after 3s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Test Llama Stack Build / generate-matrix (push) Successful in 3s
Python Package Build Test / build (3.12) (push) Failing after 1s
Python Package Build Test / build (3.13) (push) Failing after 2s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 3s
Test Llama Stack Build / build-single-provider (push) Failing after 3s
Vector IO Integration Tests / test-matrix (push) Failing after 5s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 4s
API Conformance Tests / check-schema-compatibility (push) Successful in 8s
Test External API and Providers / test-external (venv) (push) Failing after 3s
Unit Tests / unit-tests (3.12) (push) Failing after 4s
UI Tests / ui-tests (22) (push) Successful in 40s
Pre-commit / pre-commit (push) Successful in 1m9s
# What does this PR do? - Mostly AI-generated scripts to run guidellm (https://github.com/vllm-project/guidellm) benchmarks on k8s setup - Stack is using image built from main on 9/11 ## Test Plan See updated README.md
This commit is contained in:
parent
2f58d87c22
commit
48a551ecbc
14 changed files with 1436 additions and 526 deletions
|
@ -26,6 +26,7 @@ The benchmark suite measures critical performance indicators:
|
|||
- **Throughput**: Requests per second under sustained load
|
||||
- **Latency Distribution**: P50, P95, P99 response times
|
||||
- **Time to First Token (TTFT)**: Critical for streaming applications
|
||||
- **Inter-Token Latency (ITL)**: Token generation speed for streaming
|
||||
- **Error Rates**: Request failures and timeout analysis
|
||||
|
||||
This data enables data-driven architectural decisions and performance optimization efforts.
|
||||
|
@ -49,49 +50,148 @@ kubectl get pods
|
|||
# Should see: llama-stack-benchmark-server, vllm-server, etc.
|
||||
```
|
||||
|
||||
## Benchmark Results
|
||||
|
||||
We use [GuideLLM](https://github.com/neuralmagic/guidellm) against our k8s deployment for comprehensive performance testing.
|
||||
|
||||
|
||||
### Performance - 1 vLLM Replica
|
||||
|
||||
We vary the number of Llama Stack replicas with 1 vLLM replica and compare performance below.
|
||||
|
||||

|
||||
|
||||
|
||||
For full results see the `benchmarking/k8s-benchmark/results/` directory.
|
||||
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Benchmarks
|
||||
Follow the instructions below to run benchmarks similar to the ones above.
|
||||
|
||||
**Benchmark Llama Stack (default):**
|
||||
### Comprehensive Benchmark Suite
|
||||
|
||||
**Run all benchmarks with different cluster configurations:**
|
||||
```bash
|
||||
./run-benchmark.sh
|
||||
./scripts/run-all-benchmarks.sh
|
||||
```
|
||||
|
||||
**Benchmark vLLM direct:**
|
||||
This script will automatically:
|
||||
- Scale deployments to different configurations
|
||||
- Run benchmarks for each setup
|
||||
- Generate output files with meaningful names that include setup information
|
||||
|
||||
### Individual Benchmarks
|
||||
|
||||
**Benchmark Llama Stack (runs against current cluster setup):**
|
||||
```bash
|
||||
./run-benchmark.sh --target vllm
|
||||
./scripts/run-guidellm-benchmark.sh --target stack
|
||||
```
|
||||
|
||||
### Custom Configuration
|
||||
|
||||
**Extended benchmark with high concurrency:**
|
||||
**Benchmark vLLM direct (runs against current cluster setup):**
|
||||
```bash
|
||||
./run-benchmark.sh --target vllm --duration 120 --concurrent 20
|
||||
./scripts/run-guidellm-benchmark.sh --target vllm
|
||||
```
|
||||
|
||||
**Short test run:**
|
||||
**Benchmark with custom parameters:**
|
||||
```bash
|
||||
./run-benchmark.sh --target stack --duration 30 --concurrent 5
|
||||
./scripts/run-guidellm-benchmark.sh --target stack --max-seconds 120 --prompt-tokens 1024 --output-tokens 512
|
||||
```
|
||||
|
||||
**Benchmark with custom output file:**
|
||||
```bash
|
||||
./scripts/run-guidellm-benchmark.sh --target stack --output-file results/my-custom-benchmark.txt
|
||||
```
|
||||
|
||||
### Generating Charts
|
||||
|
||||
Once the benchmarks are run, you can generate performance charts from benchmark results:
|
||||
|
||||
```bash
|
||||
uv run ./scripts/generate_charts.py
|
||||
```
|
||||
|
||||
This loads runs in the `results/` directory and creates visualizations comparing different configurations and replica counts.
|
||||
|
||||
## Benchmark Workflow
|
||||
|
||||
The benchmark suite is organized into two main scripts with distinct responsibilities:
|
||||
|
||||
### 1. `run-all-benchmarks.sh` - Orchestration & Scaling
|
||||
- **Purpose**: Manages different cluster configurations and orchestrates benchmark runs
|
||||
- **Responsibilities**:
|
||||
- Scales Kubernetes deployments (vLLM replicas, Stack replicas, worker counts)
|
||||
- Runs benchmarks for each configuration
|
||||
- Generates meaningful output filenames with setup information
|
||||
- **Use case**: Running comprehensive performance testing across multiple configurations
|
||||
|
||||
### 2. `run-guidellm-benchmark.sh` - Single Benchmark Execution
|
||||
- **Purpose**: Executes a single benchmark against the current cluster state
|
||||
- **Responsibilities**:
|
||||
- Runs GuideLLM benchmark with configurable parameters
|
||||
- Accepts custom output file paths
|
||||
- No cluster scaling - benchmarks current deployment state
|
||||
- **Use case**: Testing specific configurations or custom scenarios
|
||||
|
||||
### Typical Workflow
|
||||
1. **Comprehensive Testing**: Use `run-all-benchmarks.sh` to automatically test multiple configurations
|
||||
2. **Custom Testing**: Use `run-guidellm-benchmark.sh` for specific parameter testing or manual cluster configurations
|
||||
3. **Analysis**: Use `generate_charts.py` to visualize results from either approach
|
||||
|
||||
## Command Reference
|
||||
|
||||
### run-benchmark.sh Options
|
||||
### run-all-benchmarks.sh
|
||||
|
||||
Orchestrates multiple benchmark runs with different cluster configurations. This script:
|
||||
- Automatically scales deployments before each benchmark
|
||||
- Runs benchmarks against the configured cluster setup
|
||||
- Generates meaningfully named output files
|
||||
|
||||
```bash
|
||||
./run-benchmark.sh [options]
|
||||
./scripts/run-all-benchmarks.sh
|
||||
```
|
||||
|
||||
**Configuration**: Edit the `configs` array in the script to customize benchmark configurations:
|
||||
```bash
|
||||
# Each line: (target, stack_replicas, vllm_replicas, stack_workers)
|
||||
configs=(
|
||||
"stack 1 1 1"
|
||||
"stack 1 1 2"
|
||||
"stack 1 1 4"
|
||||
"vllm 1 1 -"
|
||||
)
|
||||
```
|
||||
|
||||
**Output files**: Generated with setup information in filename:
|
||||
- Stack: `guidellm-benchmark-stack-s{replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt`
|
||||
- vLLM: `guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt`
|
||||
|
||||
### run-guidellm-benchmark.sh Options
|
||||
|
||||
Runs a single benchmark against the current cluster setup (no scaling).
|
||||
|
||||
```bash
|
||||
./scripts/run-guidellm-benchmark.sh [options]
|
||||
|
||||
Options:
|
||||
-t, --target <stack|vllm> Target to benchmark (default: stack)
|
||||
-d, --duration <seconds> Duration in seconds (default: 60)
|
||||
-c, --concurrent <users> Number of concurrent users (default: 10)
|
||||
-s, --max-seconds <seconds> Maximum duration in seconds (default: 60)
|
||||
-p, --prompt-tokens <tokens> Number of prompt tokens (default: 512)
|
||||
-o, --output-tokens <tokens> Number of output tokens (default: 256)
|
||||
-r, --rate-type <type> Rate type (default: concurrent)
|
||||
-c, --rate Rate (default: 1,2,4,8,16,32,64,128)
|
||||
--output-file <path> Output file path (default: auto-generated)
|
||||
--stack-deployment <name> Name of the stack deployment (default: llama-stack-benchmark-server)
|
||||
--vllm-deployment <name> Name of the vllm deployment (default: vllm-server)
|
||||
--stack-url <url> URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai)
|
||||
-h, --help Show help message
|
||||
|
||||
Examples:
|
||||
./run-benchmark.sh --target vllm # Benchmark vLLM direct
|
||||
./run-benchmark.sh --target stack # Benchmark Llama Stack
|
||||
./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users
|
||||
./scripts/run-guidellm-benchmark.sh --target vllm # Benchmark vLLM direct
|
||||
./scripts/run-guidellm-benchmark.sh --target stack # Benchmark Llama Stack (default)
|
||||
./scripts/run-guidellm-benchmark.sh -t vllm -s 60 -p 512 -o 256 # vLLM with custom parameters
|
||||
./scripts/run-guidellm-benchmark.sh --output-file results/my-benchmark.txt # Specify custom output file
|
||||
./scripts/run-guidellm-benchmark.sh --stack-deployment my-stack-server # Use custom stack deployment name
|
||||
```
|
||||
|
||||
## Local Testing
|
||||
|
@ -100,55 +200,30 @@ Examples:
|
|||
|
||||
For local development without Kubernetes:
|
||||
|
||||
**1. Start OpenAI mock server:**
|
||||
```bash
|
||||
uv run python openai-mock-server.py --port 8080
|
||||
```
|
||||
|
||||
**2. Run benchmark against mock server:**
|
||||
```bash
|
||||
uv run python benchmark.py \
|
||||
--base-url http://localhost:8080/v1 \
|
||||
--model mock-inference \
|
||||
--duration 30 \
|
||||
--concurrent 5
|
||||
```
|
||||
|
||||
**3. Test against local vLLM server:**
|
||||
```bash
|
||||
# If you have vLLM running locally on port 8000
|
||||
uv run python benchmark.py \
|
||||
--base-url http://localhost:8000/v1 \
|
||||
--model meta-llama/Llama-3.2-3B-Instruct \
|
||||
--duration 30 \
|
||||
--concurrent 5
|
||||
```
|
||||
|
||||
**4. Profile the running server:**
|
||||
```bash
|
||||
./profile_running_server.sh
|
||||
```
|
||||
|
||||
|
||||
|
||||
### OpenAI Mock Server
|
||||
**1. (Optional) Start Mock OpenAI server:**
|
||||
|
||||
There is a simple mock OpenAI server if you don't have an inference provider available.
|
||||
The `openai-mock-server.py` provides:
|
||||
- **OpenAI-compatible API** for testing without real models
|
||||
- **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var
|
||||
- **Consistent responses** for reproducible benchmarks
|
||||
- **Lightweight testing** without GPU requirements
|
||||
|
||||
**Mock server usage:**
|
||||
```bash
|
||||
uv run python openai-mock-server.py --port 8080
|
||||
```
|
||||
|
||||
The mock server is also deployed in k8s as `openai-mock-service:8080` and can be used by changing the Llama Stack configuration to use the `mock-vllm-inference` provider.
|
||||
**2. Start Stack server:**
|
||||
```bash
|
||||
LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv run uvicorn llama_stack.core.server.server:create_app --port 8321 --workers 4 --factory
|
||||
```
|
||||
|
||||
## Files in this Directory
|
||||
|
||||
- `benchmark.py` - Core benchmark script with async streaming support
|
||||
- `run-benchmark.sh` - Main script with target selection and configuration
|
||||
- `openai-mock-server.py` - Mock OpenAI API server for local testing
|
||||
- `README.md` - This documentation file
|
||||
**3. Run GuideLLM benchmark:**
|
||||
```bash
|
||||
GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \
|
||||
--target "http://localhost:8321/v1/openai/v1" \
|
||||
--model "meta-llama/Llama-3.2-3B-Instruct" \
|
||||
--rate-type sweep \
|
||||
--max-seconds 60 \
|
||||
--data "prompt_tokens=256,output_tokens=128" --output-path='output.html'
|
||||
```
|
||||
|
|
|
@ -1,265 +0,0 @@
|
|||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
"""
|
||||
Simple benchmark script for Llama Stack with OpenAI API compatibility.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import os
|
||||
import random
|
||||
import statistics
|
||||
import time
|
||||
|
||||
import aiohttp
|
||||
|
||||
|
||||
class BenchmarkStats:
|
||||
def __init__(self):
|
||||
self.response_times = []
|
||||
self.ttft_times = []
|
||||
self.chunks_received = []
|
||||
self.errors = []
|
||||
self.success_count = 0
|
||||
self.total_requests = 0
|
||||
self.concurrent_users = 0
|
||||
self.start_time = None
|
||||
self.end_time = None
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def add_result(self, response_time: float, chunks: int, ttft: float = None, error: str = None):
|
||||
async with self._lock:
|
||||
self.total_requests += 1
|
||||
if error:
|
||||
self.errors.append(error)
|
||||
else:
|
||||
self.success_count += 1
|
||||
self.response_times.append(response_time)
|
||||
self.chunks_received.append(chunks)
|
||||
if ttft is not None:
|
||||
self.ttft_times.append(ttft)
|
||||
|
||||
def print_summary(self):
|
||||
if not self.response_times:
|
||||
print("No successful requests to report")
|
||||
if self.errors:
|
||||
print(f"Total errors: {len(self.errors)}")
|
||||
print("First 5 errors:")
|
||||
for error in self.errors[:5]:
|
||||
print(f" {error}")
|
||||
return
|
||||
|
||||
total_time = self.end_time - self.start_time
|
||||
success_rate = (self.success_count / self.total_requests) * 100
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print("BENCHMARK RESULTS")
|
||||
|
||||
print("\nResponse Time Statistics:")
|
||||
print(f" Mean: {statistics.mean(self.response_times):.3f}s")
|
||||
print(f" Median: {statistics.median(self.response_times):.3f}s")
|
||||
print(f" Min: {min(self.response_times):.3f}s")
|
||||
print(f" Max: {max(self.response_times):.3f}s")
|
||||
|
||||
if len(self.response_times) > 1:
|
||||
print(f" Std Dev: {statistics.stdev(self.response_times):.3f}s")
|
||||
|
||||
percentiles = [50, 90, 95, 99]
|
||||
sorted_times = sorted(self.response_times)
|
||||
print("\nPercentiles:")
|
||||
for p in percentiles:
|
||||
idx = int(len(sorted_times) * p / 100) - 1
|
||||
idx = max(0, min(idx, len(sorted_times) - 1))
|
||||
print(f" P{p}: {sorted_times[idx]:.3f}s")
|
||||
|
||||
if self.ttft_times:
|
||||
print("\nTime to First Token (TTFT) Statistics:")
|
||||
print(f" Mean: {statistics.mean(self.ttft_times):.3f}s")
|
||||
print(f" Median: {statistics.median(self.ttft_times):.3f}s")
|
||||
print(f" Min: {min(self.ttft_times):.3f}s")
|
||||
print(f" Max: {max(self.ttft_times):.3f}s")
|
||||
|
||||
if len(self.ttft_times) > 1:
|
||||
print(f" Std Dev: {statistics.stdev(self.ttft_times):.3f}s")
|
||||
|
||||
sorted_ttft = sorted(self.ttft_times)
|
||||
print("\nTTFT Percentiles:")
|
||||
for p in percentiles:
|
||||
idx = int(len(sorted_ttft) * p / 100) - 1
|
||||
idx = max(0, min(idx, len(sorted_ttft) - 1))
|
||||
print(f" P{p}: {sorted_ttft[idx]:.3f}s")
|
||||
|
||||
if self.chunks_received:
|
||||
print("\nStreaming Statistics:")
|
||||
print(f" Mean chunks per response: {statistics.mean(self.chunks_received):.1f}")
|
||||
print(f" Total chunks received: {sum(self.chunks_received)}")
|
||||
|
||||
print(f"{'=' * 60}")
|
||||
print(f"Total time: {total_time:.2f}s")
|
||||
print(f"Concurrent users: {self.concurrent_users}")
|
||||
print(f"Total requests: {self.total_requests}")
|
||||
print(f"Successful requests: {self.success_count}")
|
||||
print(f"Failed requests: {len(self.errors)}")
|
||||
print(f"Success rate: {success_rate:.1f}%")
|
||||
print(f"Requests per second: {self.success_count / total_time:.2f}")
|
||||
|
||||
if self.errors:
|
||||
print("\nErrors (showing first 5):")
|
||||
for error in self.errors[:5]:
|
||||
print(f" {error}")
|
||||
|
||||
|
||||
class LlamaStackBenchmark:
|
||||
def __init__(self, base_url: str, model_id: str):
|
||||
self.base_url = base_url.rstrip("/")
|
||||
self.model_id = model_id
|
||||
self.headers = {"Content-Type": "application/json"}
|
||||
self.test_messages = [
|
||||
[{"role": "user", "content": "Hi"}],
|
||||
[{"role": "user", "content": "What is the capital of France?"}],
|
||||
[{"role": "user", "content": "Explain quantum physics in simple terms."}],
|
||||
[{"role": "user", "content": "Write a short story about a robot learning to paint."}],
|
||||
[
|
||||
{"role": "user", "content": "What is machine learning?"},
|
||||
{"role": "assistant", "content": "Machine learning is a subset of AI..."},
|
||||
{"role": "user", "content": "Can you give me a practical example?"},
|
||||
],
|
||||
]
|
||||
|
||||
async def make_async_streaming_request(self) -> tuple[float, int, float | None, str | None]:
|
||||
"""Make a single async streaming chat completion request."""
|
||||
messages = random.choice(self.test_messages)
|
||||
payload = {"model": self.model_id, "messages": messages, "stream": True, "max_tokens": 100}
|
||||
|
||||
start_time = time.time()
|
||||
chunks_received = 0
|
||||
ttft = None
|
||||
error = None
|
||||
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.post(
|
||||
f"{self.base_url}/chat/completions",
|
||||
headers=self.headers,
|
||||
json=payload,
|
||||
timeout=aiohttp.ClientTimeout(total=30),
|
||||
) as response:
|
||||
if response.status == 200:
|
||||
async for line in response.content:
|
||||
if line:
|
||||
line_str = line.decode("utf-8").strip()
|
||||
if line_str.startswith("data: "):
|
||||
chunks_received += 1
|
||||
if ttft is None:
|
||||
ttft = time.time() - start_time
|
||||
if line_str == "data: [DONE]":
|
||||
break
|
||||
|
||||
if chunks_received == 0:
|
||||
error = "No streaming chunks received"
|
||||
else:
|
||||
text = await response.text()
|
||||
error = f"HTTP {response.status}: {text[:100]}"
|
||||
|
||||
except Exception as e:
|
||||
error = f"Request error: {str(e)}"
|
||||
finally:
|
||||
await session.close()
|
||||
|
||||
response_time = time.time() - start_time
|
||||
return response_time, chunks_received, ttft, error
|
||||
|
||||
async def run_benchmark(self, duration: int, concurrent_users: int) -> BenchmarkStats:
|
||||
"""Run benchmark using async requests for specified duration."""
|
||||
stats = BenchmarkStats()
|
||||
stats.concurrent_users = concurrent_users
|
||||
stats.start_time = time.time()
|
||||
|
||||
print(f"Starting benchmark: {duration}s duration, {concurrent_users} concurrent users")
|
||||
print(f"Target URL: {self.base_url}/chat/completions")
|
||||
print(f"Model: {self.model_id}")
|
||||
|
||||
connector = aiohttp.TCPConnector(limit=concurrent_users)
|
||||
async with aiohttp.ClientSession(connector=connector):
|
||||
|
||||
async def worker(worker_id: int):
|
||||
"""Worker that sends requests sequentially until canceled."""
|
||||
request_count = 0
|
||||
while True:
|
||||
try:
|
||||
response_time, chunks, ttft, error = await self.make_async_streaming_request()
|
||||
await stats.add_result(response_time, chunks, ttft, error)
|
||||
request_count += 1
|
||||
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
await stats.add_result(0, 0, None, f"Worker {worker_id} error: {str(e)}")
|
||||
|
||||
# Progress reporting task
|
||||
async def progress_reporter():
|
||||
last_report_time = time.time()
|
||||
while True:
|
||||
try:
|
||||
await asyncio.sleep(1) # Report every second
|
||||
if time.time() >= last_report_time + 10: # Report every 10 seconds
|
||||
elapsed = time.time() - stats.start_time
|
||||
print(
|
||||
f"Completed: {stats.total_requests} requests in {elapsed:.1f}s, RPS: {stats.total_requests / elapsed:.1f}"
|
||||
)
|
||||
last_report_time = time.time()
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
|
||||
# Spawn concurrent workers
|
||||
tasks = [asyncio.create_task(worker(i)) for i in range(concurrent_users)]
|
||||
progress_task = asyncio.create_task(progress_reporter())
|
||||
tasks.append(progress_task)
|
||||
|
||||
# Wait for duration then cancel all tasks
|
||||
await asyncio.sleep(duration)
|
||||
|
||||
for task in tasks:
|
||||
task.cancel()
|
||||
|
||||
# Wait for all tasks to complete
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
stats.end_time = time.time()
|
||||
return stats
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Llama Stack Benchmark Tool")
|
||||
parser.add_argument(
|
||||
"--base-url",
|
||||
default=os.getenv("BENCHMARK_BASE_URL", "http://localhost:8000/v1/openai/v1"),
|
||||
help="Base URL for the API (default: http://localhost:8000/v1/openai/v1)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model", default=os.getenv("INFERENCE_MODEL", "test-model"), help="Model ID to use for requests"
|
||||
)
|
||||
parser.add_argument("--duration", type=int, default=60, help="Duration in seconds to run benchmark (default: 60)")
|
||||
parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent users (default: 10)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
benchmark = LlamaStackBenchmark(args.base_url, args.model)
|
||||
|
||||
try:
|
||||
stats = asyncio.run(benchmark.run_benchmark(args.duration, args.concurrent))
|
||||
stats.print_summary()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\nBenchmark interrupted by user")
|
||||
except Exception as e:
|
||||
print(f"Benchmark failed: {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,52 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
# Script to profile an already running Llama Stack server
|
||||
# Usage: ./profile_running_server.sh [duration_seconds] [output_file]
|
||||
|
||||
DURATION=${1:-60} # Default 60 seconds
|
||||
OUTPUT_FILE=${2:-"llama_stack_profile"} # Default output file
|
||||
|
||||
echo "Looking for running Llama Stack server..."
|
||||
|
||||
# Find the server PID
|
||||
SERVER_PID=$(ps aux | grep "llama_stack.core.server.server" | grep -v grep | awk '{print $2}' | head -1)
|
||||
|
||||
|
||||
if [ -z "$SERVER_PID" ]; then
|
||||
echo "Error: No running Llama Stack server found"
|
||||
echo "Please start your server first with:"
|
||||
echo "LLAMA_STACK_LOGGING=\"all=ERROR\" MOCK_INFERENCE_URL=http://localhost:8080 SAFETY_MODEL=llama-guard3:1b uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Found Llama Stack server with PID: $SERVER_PID"
|
||||
|
||||
# Start py-spy profiling
|
||||
echo "Starting py-spy profiling for ${DURATION} seconds..."
|
||||
echo "Output will be saved to: ${OUTPUT_FILE}.svg"
|
||||
echo ""
|
||||
echo "You can now run your load test..."
|
||||
echo ""
|
||||
|
||||
# Get the full path to py-spy
|
||||
PYSPY_PATH=$(which py-spy)
|
||||
|
||||
# Check if running as root, if not, use sudo
|
||||
if [ "$EUID" -ne 0 ]; then
|
||||
echo "py-spy requires root permissions on macOS. Running with sudo..."
|
||||
sudo "$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID
|
||||
else
|
||||
"$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Profiling completed! Results saved to: ${OUTPUT_FILE}.svg"
|
||||
echo ""
|
||||
echo "To view the flame graph:"
|
||||
echo "open ${OUTPUT_FILE}.svg"
|
|
@ -0,0 +1,171 @@
|
|||
Collecting uv
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 144.3 MB/s eta 0:00:00
|
||||
Installing collected packages: uv
|
||||
Successfully installed uv-0.8.19
|
||||
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
|
||||
|
||||
[notice] A new release of pip is available: 24.0 -> 25.2
|
||||
[notice] To update, run: pip install --upgrade pip
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Resolved 61 packages in 551ms
|
||||
Downloading pillow (6.3MiB)
|
||||
Downloading hf-xet (3.0MiB)
|
||||
Downloading tokenizers (3.1MiB)
|
||||
Downloading pygments (1.2MiB)
|
||||
Downloading pandas (11.8MiB)
|
||||
Downloading aiohttp (1.7MiB)
|
||||
Downloading pydantic-core (1.9MiB)
|
||||
Downloading numpy (16.2MiB)
|
||||
Downloading transformers (11.1MiB)
|
||||
Downloading pyarrow (40.8MiB)
|
||||
Downloading pydantic-core
|
||||
Downloading aiohttp
|
||||
Downloading tokenizers
|
||||
Downloading hf-xet
|
||||
Downloading pygments
|
||||
Downloading pillow
|
||||
Downloading numpy
|
||||
Downloading pandas
|
||||
Downloading transformers
|
||||
Downloading pyarrow
|
||||
Prepared 61 packages in 1.23s
|
||||
Installed 61 packages in 114ms
|
||||
+ aiohappyeyeballs==2.6.1
|
||||
+ aiohttp==3.12.15
|
||||
+ aiosignal==1.4.0
|
||||
+ annotated-types==0.7.0
|
||||
+ anyio==4.10.0
|
||||
+ attrs==25.3.0
|
||||
+ certifi==2025.8.3
|
||||
+ charset-normalizer==3.4.3
|
||||
+ click==8.1.8
|
||||
+ datasets==4.1.1
|
||||
+ dill==0.4.0
|
||||
+ filelock==3.19.1
|
||||
+ frozenlist==1.7.0
|
||||
+ fsspec==2025.9.0
|
||||
+ ftfy==6.3.1
|
||||
+ guidellm==0.3.0
|
||||
+ h11==0.16.0
|
||||
+ h2==4.3.0
|
||||
+ hf-xet==1.1.10
|
||||
+ hpack==4.1.0
|
||||
+ httpcore==1.0.9
|
||||
+ httpx==0.28.1
|
||||
+ huggingface-hub==0.35.0
|
||||
+ hyperframe==6.1.0
|
||||
+ idna==3.10
|
||||
+ loguru==0.7.3
|
||||
+ markdown-it-py==4.0.0
|
||||
+ mdurl==0.1.2
|
||||
+ multidict==6.6.4
|
||||
+ multiprocess==0.70.16
|
||||
+ numpy==2.3.3
|
||||
+ packaging==25.0
|
||||
+ pandas==2.3.2
|
||||
+ pillow==11.3.0
|
||||
+ propcache==0.3.2
|
||||
+ protobuf==6.32.1
|
||||
+ pyarrow==21.0.0
|
||||
+ pydantic==2.11.9
|
||||
+ pydantic-core==2.33.2
|
||||
+ pydantic-settings==2.10.1
|
||||
+ pygments==2.19.2
|
||||
+ python-dateutil==2.9.0.post0
|
||||
+ python-dotenv==1.1.1
|
||||
+ pytz==2025.2
|
||||
+ pyyaml==6.0.2
|
||||
+ regex==2025.9.18
|
||||
+ requests==2.32.5
|
||||
+ rich==14.1.0
|
||||
+ safetensors==0.6.2
|
||||
+ six==1.17.0
|
||||
+ sniffio==1.3.1
|
||||
+ tokenizers==0.22.1
|
||||
+ tqdm==4.67.1
|
||||
+ transformers==4.56.2
|
||||
+ typing-extensions==4.15.0
|
||||
+ typing-inspection==0.4.1
|
||||
+ tzdata==2025.2
|
||||
+ urllib3==2.5.0
|
||||
+ wcwidth==0.2.14
|
||||
+ xxhash==3.5.0
|
||||
+ yarl==1.20.1
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Audited 1 package in 3ms
|
||||
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
|
||||
Creating backend...
|
||||
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
|
||||
Creating request loader...
|
||||
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
|
||||
|
||||
|
||||
╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ [17:34:30] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.32s Lat, 1.0 Conc, 18 Comp, 1 Inc, 0 Err │
|
||||
│ Tok: 74.0 gen/s, 238.6 tot/s, 40.2ms TTFT, 13.4ms ITL, 546 Prompt, 246 Gen │
|
||||
│ [17:35:35] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.46s Lat, 2.0 Conc, 34 Comp, 2 Inc, 0 Err │
|
||||
│ Tok: 139.6 gen/s, 454.0 tot/s, 48.0ms TTFT, 14.1ms ITL, 546 Prompt, 243 Gen │
|
||||
│ [17:36:40] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.44s Lat, 3.9 Conc, 68 Comp, 4 Inc, 0 Err │
|
||||
│ Tok: 273.2 gen/s, 900.4 tot/s, 50.7ms TTFT, 14.3ms ITL, 546 Prompt, 238 Gen │
|
||||
│ [17:37:45] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.55s Lat, 7.7 Conc, 129 Comp, 8 Inc, 0 Err │
|
||||
│ Tok: 519.1 gen/s, 1699.8 tot/s, 66.0ms TTFT, 14.6ms ITL, 547 Prompt, 240 Gen │
|
||||
│ [17:38:50] ⠋ 100% concurrent@16 (complete) Req: 4.1 req/s, 3.76s Lat, 15.5 Conc, 247 Comp, 16 Inc, 0 Err │
|
||||
│ Tok: 1005.5 gen/s, 3256.7 tot/s, 101.0ms TTFT, 15.0ms ITL, 547 Prompt, 244 Gen │
|
||||
│ [17:39:56] ⠋ 100% concurrent@32 (complete) Req: 8.1 req/s, 3.84s Lat, 30.9 Conc, 483 Comp, 32 Inc, 0 Err │
|
||||
│ Tok: 1926.3 gen/s, 6327.2 tot/s, 295.7ms TTFT, 14.8ms ITL, 547 Prompt, 239 Gen │
|
||||
│ [17:41:03] ⠋ 100% concurrent@64 (complete) Req: 9.9 req/s, 6.05s Lat, 59.7 Conc, 576 Comp, 58 Inc, 0 Err │
|
||||
│ Tok: 2381.0 gen/s, 7774.5 tot/s, 1196.2ms TTFT, 20.2ms ITL, 547 Prompt, 241 Gen │
|
||||
│ [17:42:10] ⠋ 100% concurrent@128 (complete) Req: 9.2 req/s, 11.59s Lat, 107.2 Conc, 514 Comp, 117 Inc, 0 Err │
|
||||
│ Tok: 2233.4 gen/s, 7286.3 tot/s, 2403.9ms TTFT, 38.2ms ITL, 547 Prompt, 242 Gen │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
|
||||
|
||||
Benchmarks Metadata:
|
||||
Run id:511a14fd-ba11-4ffa-92ef-7cc23db4dd38
|
||||
Duration:528.5 seconds
|
||||
Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
|
||||
Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
|
||||
Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
|
||||
backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
|
||||
'/v1/chat/completions'}
|
||||
Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
|
||||
Extras:None
|
||||
|
||||
|
||||
Benchmarks Info:
|
||||
===================================================================================================================================================
|
||||
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total||
|
||||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
|
||||
--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|----
|
||||
concurrent@1| 17:34:35| 17:35:35| 60.0| 18| 1| 0| 546.4| 512.0| 0.0| 246.0| 14.0| 0.0| 9835| 512| 0| 4428| 14| 0
|
||||
concurrent@2| 17:35:40| 17:36:40| 60.0| 34| 2| 0| 546.4| 512.0| 0.0| 242.7| 80.0| 0.0| 18577| 1024| 0| 8253| 160| 0
|
||||
concurrent@4| 17:36:45| 17:37:45| 60.0| 68| 4| 0| 546.4| 512.0| 0.0| 238.1| 103.2| 0.0| 37156| 2048| 0| 16188| 413| 0
|
||||
concurrent@8| 17:37:50| 17:38:50| 60.0| 129| 8| 0| 546.7| 512.0| 0.0| 240.3| 180.0| 0.0| 70518| 4096| 0| 31001| 1440| 0
|
||||
concurrent@16| 17:38:55| 17:39:55| 60.0| 247| 16| 0| 546.6| 512.0| 0.0| 244.1| 142.6| 0.0| 135002| 8192| 0| 60300| 2281| 0
|
||||
concurrent@32| 17:40:01| 17:41:01| 60.0| 483| 32| 0| 546.5| 512.0| 0.0| 239.2| 123.2| 0.0| 263972| 16384| 0| 115540| 3944| 0
|
||||
concurrent@64| 17:41:08| 17:42:08| 60.0| 576| 58| 0| 546.6| 512.0| 0.0| 241.3| 13.9| 0.0| 314817| 29696| 0| 138976| 807| 0
|
||||
concurrent@128| 17:42:15| 17:43:15| 60.0| 514| 117| 0| 546.5| 512.0| 0.0| 241.6| 143.9| 0.0| 280911| 59904| 0| 124160| 16832| 0
|
||||
===================================================================================================================================================
|
||||
|
||||
|
||||
Benchmarks Stats:
|
||||
=======================================================================================================================================================
|
||||
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
|
||||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
|
||||
--------------|-----------|------------|------------|------------|------|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|-----
|
||||
concurrent@1| 0.30| 1.00| 74.0| 238.6| 3.32| 3.43| 3.61| 40.2| 39.3| 51.2| 13.4| 13.3| 14.0| 13.3| 13.2| 13.9
|
||||
concurrent@2| 0.58| 1.99| 139.6| 454.0| 3.46| 3.64| 3.74| 48.0| 45.8| 72.0| 14.1| 14.1| 14.5| 14.0| 14.0| 14.4
|
||||
concurrent@4| 1.15| 3.95| 273.2| 900.4| 3.44| 3.69| 3.74| 50.7| 47.2| 118.6| 14.3| 14.3| 14.4| 14.2| 14.2| 14.4
|
||||
concurrent@8| 2.16| 7.67| 519.1| 1699.8| 3.55| 3.76| 3.87| 66.0| 48.8| 208.2| 14.6| 14.5| 14.8| 14.5| 14.5| 14.8
|
||||
concurrent@16| 4.12| 15.48| 1005.5| 3256.7| 3.76| 3.90| 4.18| 101.0| 65.6| 396.7| 15.0| 15.0| 15.9| 15.0| 15.0| 15.9
|
||||
concurrent@32| 8.05| 30.89| 1926.3| 6327.2| 3.84| 4.04| 4.39| 295.7| 265.6| 720.4| 14.8| 14.9| 15.5| 14.8| 14.8| 15.3
|
||||
concurrent@64| 9.87| 59.74| 2381.0| 7774.5| 6.05| 6.18| 9.94| 1196.2| 1122.5| 4295.3| 20.2| 20.0| 25.8| 20.1| 19.9| 25.8
|
||||
concurrent@128| 9.25| 107.16| 2233.4| 7286.3| 11.59| 12.04| 14.46| 2403.9| 2322.3| 4001.5| 38.2| 38.5| 53.0| 38.0| 38.3| 52.7
|
||||
=======================================================================================================================================================
|
||||
|
||||
Saving benchmarks report...
|
||||
Benchmarks report saved to /benchmarks.json
|
||||
|
||||
Benchmarking complete.
|
|
@ -0,0 +1,171 @@
|
|||
Collecting uv
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 149.3 MB/s eta 0:00:00
|
||||
Installing collected packages: uv
|
||||
Successfully installed uv-0.8.19
|
||||
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
|
||||
|
||||
[notice] A new release of pip is available: 24.0 -> 25.2
|
||||
[notice] To update, run: pip install --upgrade pip
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Resolved 61 packages in 494ms
|
||||
Downloading pandas (11.8MiB)
|
||||
Downloading tokenizers (3.1MiB)
|
||||
Downloading pygments (1.2MiB)
|
||||
Downloading aiohttp (1.7MiB)
|
||||
Downloading transformers (11.1MiB)
|
||||
Downloading numpy (16.2MiB)
|
||||
Downloading pillow (6.3MiB)
|
||||
Downloading pydantic-core (1.9MiB)
|
||||
Downloading hf-xet (3.0MiB)
|
||||
Downloading pyarrow (40.8MiB)
|
||||
Downloading pydantic-core
|
||||
Downloading aiohttp
|
||||
Downloading tokenizers
|
||||
Downloading hf-xet
|
||||
Downloading pillow
|
||||
Downloading pygments
|
||||
Downloading numpy
|
||||
Downloading pandas
|
||||
Downloading pyarrow
|
||||
Downloading transformers
|
||||
Prepared 61 packages in 1.24s
|
||||
Installed 61 packages in 126ms
|
||||
+ aiohappyeyeballs==2.6.1
|
||||
+ aiohttp==3.12.15
|
||||
+ aiosignal==1.4.0
|
||||
+ annotated-types==0.7.0
|
||||
+ anyio==4.10.0
|
||||
+ attrs==25.3.0
|
||||
+ certifi==2025.8.3
|
||||
+ charset-normalizer==3.4.3
|
||||
+ click==8.1.8
|
||||
+ datasets==4.1.1
|
||||
+ dill==0.4.0
|
||||
+ filelock==3.19.1
|
||||
+ frozenlist==1.7.0
|
||||
+ fsspec==2025.9.0
|
||||
+ ftfy==6.3.1
|
||||
+ guidellm==0.3.0
|
||||
+ h11==0.16.0
|
||||
+ h2==4.3.0
|
||||
+ hf-xet==1.1.10
|
||||
+ hpack==4.1.0
|
||||
+ httpcore==1.0.9
|
||||
+ httpx==0.28.1
|
||||
+ huggingface-hub==0.35.0
|
||||
+ hyperframe==6.1.0
|
||||
+ idna==3.10
|
||||
+ loguru==0.7.3
|
||||
+ markdown-it-py==4.0.0
|
||||
+ mdurl==0.1.2
|
||||
+ multidict==6.6.4
|
||||
+ multiprocess==0.70.16
|
||||
+ numpy==2.3.3
|
||||
+ packaging==25.0
|
||||
+ pandas==2.3.2
|
||||
+ pillow==11.3.0
|
||||
+ propcache==0.3.2
|
||||
+ protobuf==6.32.1
|
||||
+ pyarrow==21.0.0
|
||||
+ pydantic==2.11.9
|
||||
+ pydantic-core==2.33.2
|
||||
+ pydantic-settings==2.10.1
|
||||
+ pygments==2.19.2
|
||||
+ python-dateutil==2.9.0.post0
|
||||
+ python-dotenv==1.1.1
|
||||
+ pytz==2025.2
|
||||
+ pyyaml==6.0.2
|
||||
+ regex==2025.9.18
|
||||
+ requests==2.32.5
|
||||
+ rich==14.1.0
|
||||
+ safetensors==0.6.2
|
||||
+ six==1.17.0
|
||||
+ sniffio==1.3.1
|
||||
+ tokenizers==0.22.1
|
||||
+ tqdm==4.67.1
|
||||
+ transformers==4.56.2
|
||||
+ typing-extensions==4.15.0
|
||||
+ typing-inspection==0.4.1
|
||||
+ tzdata==2025.2
|
||||
+ urllib3==2.5.0
|
||||
+ wcwidth==0.2.14
|
||||
+ xxhash==3.5.0
|
||||
+ yarl==1.20.1
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Audited 1 package in 3ms
|
||||
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
|
||||
Creating backend...
|
||||
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
|
||||
Creating request loader...
|
||||
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
|
||||
|
||||
|
||||
╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ [17:45:18] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.42s Lat, 1.0 Conc, 17 Comp, 1 Inc, 0 Err │
|
||||
│ Tok: 73.9 gen/s, 233.7 tot/s, 50.2ms TTFT, 13.4ms ITL, 547 Prompt, 253 Gen │
|
||||
│ [17:46:23] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.42s Lat, 2.0 Conc, 34 Comp, 2 Inc, 0 Err │
|
||||
│ Tok: 134.7 gen/s, 447.4 tot/s, 50.8ms TTFT, 14.3ms ITL, 546 Prompt, 235 Gen │
|
||||
│ [17:47:28] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.55s Lat, 3.9 Conc, 66 Comp, 4 Inc, 0 Err │
|
||||
│ Tok: 268.7 gen/s, 873.1 tot/s, 54.9ms TTFT, 14.4ms ITL, 547 Prompt, 243 Gen │
|
||||
│ [17:48:33] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.56s Lat, 7.8 Conc, 130 Comp, 8 Inc, 0 Err │
|
||||
│ Tok: 526.1 gen/s, 1728.4 tot/s, 60.6ms TTFT, 14.7ms ITL, 547 Prompt, 239 Gen │
|
||||
│ [17:49:38] ⠋ 100% concurrent@16 (complete) Req: 4.1 req/s, 3.79s Lat, 15.7 Conc, 246 Comp, 16 Inc, 0 Err │
|
||||
│ Tok: 1006.9 gen/s, 3268.6 tot/s, 74.8ms TTFT, 15.3ms ITL, 547 Prompt, 243 Gen │
|
||||
│ [17:50:44] ⠋ 100% concurrent@32 (complete) Req: 7.8 req/s, 3.95s Lat, 30.9 Conc, 467 Comp, 32 Inc, 0 Err │
|
||||
│ Tok: 1912.0 gen/s, 6191.6 tot/s, 119.1ms TTFT, 15.7ms ITL, 547 Prompt, 244 Gen │
|
||||
│ [17:51:50] ⠋ 100% concurrent@64 (complete) Req: 13.0 req/s, 4.75s Lat, 61.8 Conc, 776 Comp, 64 Inc, 0 Err │
|
||||
│ Tok: 3154.3 gen/s, 10273.3 tot/s, 339.1ms TTFT, 18.3ms ITL, 547 Prompt, 242 Gen │
|
||||
│ [17:52:58] ⠋ 100% concurrent@128 (complete) Req: 15.1 req/s, 7.82s Lat, 117.7 Conc, 898 Comp, 127 Inc, 0 Err │
|
||||
│ Tok: 3617.4 gen/s, 11843.9 tot/s, 1393.8ms TTFT, 26.8ms ITL, 547 Prompt, 240 Gen │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
|
||||
|
||||
Benchmarks Metadata:
|
||||
Run id:f73d408e-256a-4c32-aa40-05e8d7098b66
|
||||
Duration:529.2 seconds
|
||||
Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
|
||||
Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
|
||||
Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
|
||||
backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
|
||||
'/v1/chat/completions'}
|
||||
Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
|
||||
Extras:None
|
||||
|
||||
|
||||
Benchmarks Info:
|
||||
=====================================================================================================================================================
|
||||
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total ||
|
||||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
|
||||
--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|-----
|
||||
concurrent@1| 17:45:23| 17:46:23| 60.0| 17| 1| 0| 546.6| 512.0| 0.0| 252.8| 136.0| 0.0| 9292| 512| 0| 4298| 136| 0
|
||||
concurrent@2| 17:46:28| 17:47:28| 60.0| 34| 2| 0| 546.4| 512.0| 0.0| 235.4| 130.0| 0.0| 18577| 1024| 0| 8003| 260| 0
|
||||
concurrent@4| 17:47:33| 17:48:33| 60.0| 66| 4| 0| 546.5| 512.0| 0.0| 243.0| 97.5| 0.0| 36072| 2048| 0| 16035| 390| 0
|
||||
concurrent@8| 17:48:38| 17:49:38| 60.0| 130| 8| 0| 546.6| 512.0| 0.0| 239.2| 146.0| 0.0| 71052| 4096| 0| 31090| 1168| 0
|
||||
concurrent@16| 17:49:43| 17:50:43| 60.0| 246| 16| 0| 546.6| 512.0| 0.0| 243.3| 112.3| 0.0| 134456| 8192| 0| 59862| 1797| 0
|
||||
concurrent@32| 17:50:49| 17:51:49| 60.0| 467| 32| 0| 546.6| 512.0| 0.0| 244.2| 147.3| 0.0| 255242| 16384| 0| 114038| 4714| 0
|
||||
concurrent@64| 17:51:55| 17:52:55| 60.0| 776| 64| 0| 546.5| 512.0| 0.0| 242.2| 106.1| 0.0| 424115| 32768| 0| 187916| 6788| 0
|
||||
concurrent@128| 17:53:03| 17:54:03| 60.0| 898| 127| 0| 546.5| 512.0| 0.0| 240.3| 69.8| 0.0| 490789| 65024| 0| 215810| 8864| 0
|
||||
=====================================================================================================================================================
|
||||
|
||||
|
||||
Benchmarks Stats:
|
||||
======================================================================================================================================================
|
||||
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec)||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
|
||||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
|
||||
--------------|-----------|------------|------------|------------|-----|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|-----
|
||||
concurrent@1| 0.29| 1.00| 73.9| 233.7| 3.42| 3.45| 3.50| 50.2| 50.9| 62.5| 13.4| 13.4| 13.5| 13.3| 13.3| 13.5
|
||||
concurrent@2| 0.57| 1.96| 134.7| 447.4| 3.42| 3.67| 4.12| 50.8| 49.2| 79.8| 14.3| 14.2| 15.9| 14.3| 14.2| 15.9
|
||||
concurrent@4| 1.11| 3.92| 268.7| 873.1| 3.55| 3.72| 3.80| 54.9| 51.7| 101.3| 14.4| 14.4| 14.5| 14.4| 14.4| 14.5
|
||||
concurrent@8| 2.20| 7.82| 526.1| 1728.4| 3.56| 3.78| 3.93| 60.6| 49.8| 189.5| 14.7| 14.7| 14.8| 14.6| 14.6| 14.8
|
||||
concurrent@16| 4.14| 15.66| 1006.9| 3268.6| 3.79| 3.94| 4.25| 74.8| 54.3| 328.4| 15.3| 15.3| 16.1| 15.2| 15.2| 16.0
|
||||
concurrent@32| 7.83| 30.91| 1912.0| 6191.6| 3.95| 4.07| 4.53| 119.1| 80.5| 674.0| 15.7| 15.6| 17.4| 15.7| 15.6| 17.3
|
||||
concurrent@64| 13.03| 61.85| 3154.3| 10273.3| 4.75| 4.93| 5.43| 339.1| 321.1| 1146.6| 18.3| 18.4| 19.3| 18.2| 18.3| 19.2
|
||||
concurrent@128| 15.05| 117.71| 3617.4| 11843.9| 7.82| 8.58| 13.35| 1393.8| 1453.0| 5232.2| 26.8| 26.7| 36.0| 26.7| 26.6| 35.9
|
||||
======================================================================================================================================================
|
||||
|
||||
Saving benchmarks report...
|
||||
Benchmarks report saved to /benchmarks.json
|
||||
|
||||
Benchmarking complete.
|
|
@ -0,0 +1,171 @@
|
|||
Collecting uv
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 156.8 MB/s eta 0:00:00
|
||||
Installing collected packages: uv
|
||||
Successfully installed uv-0.8.19
|
||||
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
|
||||
|
||||
[notice] A new release of pip is available: 24.0 -> 25.2
|
||||
[notice] To update, run: pip install --upgrade pip
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Resolved 61 packages in 480ms
|
||||
Downloading pillow (6.3MiB)
|
||||
Downloading pydantic-core (1.9MiB)
|
||||
Downloading pyarrow (40.8MiB)
|
||||
Downloading aiohttp (1.7MiB)
|
||||
Downloading numpy (16.2MiB)
|
||||
Downloading pygments (1.2MiB)
|
||||
Downloading transformers (11.1MiB)
|
||||
Downloading pandas (11.8MiB)
|
||||
Downloading tokenizers (3.1MiB)
|
||||
Downloading hf-xet (3.0MiB)
|
||||
Downloading pydantic-core
|
||||
Downloading aiohttp
|
||||
Downloading tokenizers
|
||||
Downloading hf-xet
|
||||
Downloading pygments
|
||||
Downloading pillow
|
||||
Downloading numpy
|
||||
Downloading pandas
|
||||
Downloading pyarrow
|
||||
Downloading transformers
|
||||
Prepared 61 packages in 1.25s
|
||||
Installed 61 packages in 126ms
|
||||
+ aiohappyeyeballs==2.6.1
|
||||
+ aiohttp==3.12.15
|
||||
+ aiosignal==1.4.0
|
||||
+ annotated-types==0.7.0
|
||||
+ anyio==4.10.0
|
||||
+ attrs==25.3.0
|
||||
+ certifi==2025.8.3
|
||||
+ charset-normalizer==3.4.3
|
||||
+ click==8.1.8
|
||||
+ datasets==4.1.1
|
||||
+ dill==0.4.0
|
||||
+ filelock==3.19.1
|
||||
+ frozenlist==1.7.0
|
||||
+ fsspec==2025.9.0
|
||||
+ ftfy==6.3.1
|
||||
+ guidellm==0.3.0
|
||||
+ h11==0.16.0
|
||||
+ h2==4.3.0
|
||||
+ hf-xet==1.1.10
|
||||
+ hpack==4.1.0
|
||||
+ httpcore==1.0.9
|
||||
+ httpx==0.28.1
|
||||
+ huggingface-hub==0.35.0
|
||||
+ hyperframe==6.1.0
|
||||
+ idna==3.10
|
||||
+ loguru==0.7.3
|
||||
+ markdown-it-py==4.0.0
|
||||
+ mdurl==0.1.2
|
||||
+ multidict==6.6.4
|
||||
+ multiprocess==0.70.16
|
||||
+ numpy==2.3.3
|
||||
+ packaging==25.0
|
||||
+ pandas==2.3.2
|
||||
+ pillow==11.3.0
|
||||
+ propcache==0.3.2
|
||||
+ protobuf==6.32.1
|
||||
+ pyarrow==21.0.0
|
||||
+ pydantic==2.11.9
|
||||
+ pydantic-core==2.33.2
|
||||
+ pydantic-settings==2.10.1
|
||||
+ pygments==2.19.2
|
||||
+ python-dateutil==2.9.0.post0
|
||||
+ python-dotenv==1.1.1
|
||||
+ pytz==2025.2
|
||||
+ pyyaml==6.0.2
|
||||
+ regex==2025.9.18
|
||||
+ requests==2.32.5
|
||||
+ rich==14.1.0
|
||||
+ safetensors==0.6.2
|
||||
+ six==1.17.0
|
||||
+ sniffio==1.3.1
|
||||
+ tokenizers==0.22.1
|
||||
+ tqdm==4.67.1
|
||||
+ transformers==4.56.2
|
||||
+ typing-extensions==4.15.0
|
||||
+ typing-inspection==0.4.1
|
||||
+ tzdata==2025.2
|
||||
+ urllib3==2.5.0
|
||||
+ wcwidth==0.2.14
|
||||
+ xxhash==3.5.0
|
||||
+ yarl==1.20.1
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Audited 1 package in 4ms
|
||||
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
|
||||
Creating backend...
|
||||
Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
|
||||
Creating request loader...
|
||||
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
|
||||
|
||||
|
||||
╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ [17:55:59] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.33s Lat, 1.0 Conc, 18 Comp, 1 Inc, 0 Err │
|
||||
│ Tok: 74.0 gen/s, 238.0 tot/s, 49.6ms TTFT, 13.4ms ITL, 546 Prompt, 246 Gen │
|
||||
│ [17:57:04] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.32s Lat, 1.9 Conc, 35 Comp, 2 Inc, 0 Err │
|
||||
│ Tok: 137.1 gen/s, 457.5 tot/s, 50.6ms TTFT, 14.0ms ITL, 546 Prompt, 234 Gen │
|
||||
│ [17:58:09] ⠋ 100% concurrent@4 (complete) Req: 1.2 req/s, 3.42s Lat, 4.0 Conc, 69 Comp, 4 Inc, 0 Err │
|
||||
│ Tok: 276.7 gen/s, 907.2 tot/s, 52.7ms TTFT, 14.1ms ITL, 547 Prompt, 240 Gen │
|
||||
│ [17:59:14] ⠋ 100% concurrent@8 (complete) Req: 2.3 req/s, 3.47s Lat, 7.8 Conc, 134 Comp, 8 Inc, 0 Err │
|
||||
│ Tok: 541.4 gen/s, 1775.4 tot/s, 57.3ms TTFT, 14.3ms ITL, 547 Prompt, 240 Gen │
|
||||
│ [18:00:19] ⠋ 100% concurrent@16 (complete) Req: 4.3 req/s, 3.60s Lat, 15.6 Conc, 259 Comp, 16 Inc, 0 Err │
|
||||
│ Tok: 1034.8 gen/s, 3401.7 tot/s, 72.3ms TTFT, 14.8ms ITL, 547 Prompt, 239 Gen │
|
||||
│ [18:01:25] ⠋ 100% concurrent@32 (complete) Req: 8.4 req/s, 3.69s Lat, 31.1 Conc, 505 Comp, 32 Inc, 0 Err │
|
||||
│ Tok: 2029.7 gen/s, 6641.5 tot/s, 91.6ms TTFT, 15.0ms ITL, 547 Prompt, 241 Gen │
|
||||
│ [18:02:31] ⠋ 100% concurrent@64 (complete) Req: 13.6 req/s, 4.50s Lat, 61.4 Conc, 818 Comp, 64 Inc, 0 Err │
|
||||
│ Tok: 3333.9 gen/s, 10787.0 tot/s, 171.3ms TTFT, 17.8ms ITL, 547 Prompt, 244 Gen │
|
||||
│ [18:03:40] ⠋ 100% concurrent@128 (complete) Req: 16.1 req/s, 7.43s Lat, 119.5 Conc, 964 Comp, 122 Inc, 0 Err │
|
||||
│ Tok: 3897.0 gen/s, 12679.4 tot/s, 446.4ms TTFT, 28.9ms ITL, 547 Prompt, 243 Gen │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
|
||||
|
||||
Benchmarks Metadata:
|
||||
Run id:5393e64f-d9f8-4548-95d8-da320bba1c24
|
||||
Duration:530.1 seconds
|
||||
Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
|
||||
Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
|
||||
Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
|
||||
backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
|
||||
'/v1/chat/completions'}
|
||||
Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
|
||||
Extras:None
|
||||
|
||||
|
||||
Benchmarks Info:
|
||||
===================================================================================================================================================
|
||||
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total||
|
||||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
|
||||
--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|----
|
||||
concurrent@1| 17:56:04| 17:57:04| 60.0| 18| 1| 0| 546.4| 512.0| 0.0| 246.4| 256.0| 0.0| 9836| 512| 0| 4436| 256| 0
|
||||
concurrent@2| 17:57:09| 17:58:09| 60.0| 35| 2| 0| 546.4| 512.0| 0.0| 233.9| 132.0| 0.0| 19124| 1024| 0| 8188| 264| 0
|
||||
concurrent@4| 17:58:14| 17:59:14| 60.0| 69| 4| 0| 546.6| 512.0| 0.0| 239.9| 60.5| 0.0| 37715| 2048| 0| 16553| 242| 0
|
||||
concurrent@8| 17:59:19| 18:00:19| 60.0| 134| 8| 0| 546.6| 512.0| 0.0| 239.8| 126.6| 0.0| 73243| 4096| 0| 32135| 1013| 0
|
||||
concurrent@16| 18:00:24| 18:01:24| 60.0| 259| 16| 0| 546.6| 512.0| 0.0| 239.0| 115.7| 0.0| 141561| 8192| 0| 61889| 1851| 0
|
||||
concurrent@32| 18:01:30| 18:02:30| 60.0| 505| 32| 0| 546.5| 512.0| 0.0| 240.5| 113.2| 0.0| 275988| 16384| 0| 121466| 3623| 0
|
||||
concurrent@64| 18:02:37| 18:03:37| 60.0| 818| 64| 0| 546.6| 512.0| 0.0| 244.5| 132.4| 0.0| 447087| 32768| 0| 199988| 8475| 0
|
||||
concurrent@128| 18:03:45| 18:04:45| 60.0| 964| 122| 0| 546.5| 512.0| 0.0| 242.5| 133.1| 0.0| 526866| 62464| 0| 233789| 16241| 0
|
||||
===================================================================================================================================================
|
||||
|
||||
|
||||
Benchmarks Stats:
|
||||
=======================================================================================================================================================
|
||||
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
|
||||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
|
||||
--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|-----
|
||||
concurrent@1| 0.30| 1.00| 74.0| 238.0| 3.33| 3.44| 3.63| 49.6| 47.2| 66.1| 13.4| 13.3| 14.0| 13.3| 13.3| 14.0
|
||||
concurrent@2| 0.59| 1.95| 137.1| 457.5| 3.32| 3.61| 3.67| 50.6| 48.6| 80.4| 14.0| 14.0| 14.2| 13.9| 13.9| 14.1
|
||||
concurrent@4| 1.15| 3.95| 276.7| 907.2| 3.42| 3.61| 3.77| 52.7| 49.7| 106.9| 14.1| 14.0| 14.6| 14.0| 13.9| 14.5
|
||||
concurrent@8| 2.26| 7.83| 541.4| 1775.4| 3.47| 3.70| 3.79| 57.3| 50.9| 171.3| 14.3| 14.3| 14.4| 14.2| 14.2| 14.4
|
||||
concurrent@16| 4.33| 15.57| 1034.8| 3401.7| 3.60| 3.81| 4.22| 72.3| 52.0| 292.9| 14.8| 14.7| 16.3| 14.7| 14.7| 16.3
|
||||
concurrent@32| 8.44| 31.12| 2029.7| 6641.5| 3.69| 3.89| 4.24| 91.6| 62.6| 504.6| 15.0| 15.0| 15.4| 14.9| 14.9| 15.4
|
||||
concurrent@64| 13.64| 61.40| 3333.9| 10787.0| 4.50| 4.61| 5.67| 171.3| 101.2| 1165.6| 17.8| 17.7| 19.2| 17.7| 17.6| 19.1
|
||||
concurrent@128| 16.07| 119.45| 3897.0| 12679.4| 7.43| 7.63| 9.74| 446.4| 195.8| 2533.1| 28.9| 28.9| 31.0| 28.8| 28.8| 30.9
|
||||
=======================================================================================================================================================
|
||||
|
||||
Saving benchmarks report...
|
||||
Benchmarks report saved to /benchmarks.json
|
||||
|
||||
Benchmarking complete.
|
|
@ -0,0 +1,170 @@
|
|||
Collecting uv
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
|
||||
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 126.9 MB/s eta 0:00:00
|
||||
Installing collected packages: uv
|
||||
Successfully installed uv-0.8.19
|
||||
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
|
||||
|
||||
[notice] A new release of pip is available: 24.0 -> 25.2
|
||||
[notice] To update, run: pip install --upgrade pip
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Resolved 61 packages in 561ms
|
||||
Downloading hf-xet (3.0MiB)
|
||||
Downloading pillow (6.3MiB)
|
||||
Downloading transformers (11.1MiB)
|
||||
Downloading pyarrow (40.8MiB)
|
||||
Downloading numpy (16.2MiB)
|
||||
Downloading pandas (11.8MiB)
|
||||
Downloading tokenizers (3.1MiB)
|
||||
Downloading pydantic-core (1.9MiB)
|
||||
Downloading pygments (1.2MiB)
|
||||
Downloading aiohttp (1.7MiB)
|
||||
Downloading pydantic-core
|
||||
Downloading aiohttp
|
||||
Downloading tokenizers
|
||||
Downloading hf-xet
|
||||
Downloading pygments
|
||||
Downloading pillow
|
||||
Downloading numpy
|
||||
Downloading pandas
|
||||
Downloading transformers
|
||||
Downloading pyarrow
|
||||
Prepared 61 packages in 1.25s
|
||||
Installed 61 packages in 114ms
|
||||
+ aiohappyeyeballs==2.6.1
|
||||
+ aiohttp==3.12.15
|
||||
+ aiosignal==1.4.0
|
||||
+ annotated-types==0.7.0
|
||||
+ anyio==4.10.0
|
||||
+ attrs==25.3.0
|
||||
+ certifi==2025.8.3
|
||||
+ charset-normalizer==3.4.3
|
||||
+ click==8.1.8
|
||||
+ datasets==4.1.1
|
||||
+ dill==0.4.0
|
||||
+ filelock==3.19.1
|
||||
+ frozenlist==1.7.0
|
||||
+ fsspec==2025.9.0
|
||||
+ ftfy==6.3.1
|
||||
+ guidellm==0.3.0
|
||||
+ h11==0.16.0
|
||||
+ h2==4.3.0
|
||||
+ hf-xet==1.1.10
|
||||
+ hpack==4.1.0
|
||||
+ httpcore==1.0.9
|
||||
+ httpx==0.28.1
|
||||
+ huggingface-hub==0.35.0
|
||||
+ hyperframe==6.1.0
|
||||
+ idna==3.10
|
||||
+ loguru==0.7.3
|
||||
+ markdown-it-py==4.0.0
|
||||
+ mdurl==0.1.2
|
||||
+ multidict==6.6.4
|
||||
+ multiprocess==0.70.16
|
||||
+ numpy==2.3.3
|
||||
+ packaging==25.0
|
||||
+ pandas==2.3.2
|
||||
+ pillow==11.3.0
|
||||
+ propcache==0.3.2
|
||||
+ protobuf==6.32.1
|
||||
+ pyarrow==21.0.0
|
||||
+ pydantic==2.11.9
|
||||
+ pydantic-core==2.33.2
|
||||
+ pydantic-settings==2.10.1
|
||||
+ pygments==2.19.2
|
||||
+ python-dateutil==2.9.0.post0
|
||||
+ python-dotenv==1.1.1
|
||||
+ pytz==2025.2
|
||||
+ pyyaml==6.0.2
|
||||
+ regex==2025.9.18
|
||||
+ requests==2.32.5
|
||||
+ rich==14.1.0
|
||||
+ safetensors==0.6.2
|
||||
+ six==1.17.0
|
||||
+ sniffio==1.3.1
|
||||
+ tokenizers==0.22.1
|
||||
+ tqdm==4.67.1
|
||||
+ transformers==4.56.2
|
||||
+ typing-extensions==4.15.0
|
||||
+ typing-inspection==0.4.1
|
||||
+ tzdata==2025.2
|
||||
+ urllib3==2.5.0
|
||||
+ wcwidth==0.2.14
|
||||
+ xxhash==3.5.0
|
||||
+ yarl==1.20.1
|
||||
Using Python 3.11.13 environment at: /usr/local
|
||||
Audited 1 package in 3ms
|
||||
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
|
||||
Creating backend...
|
||||
Backend openai_http connected to http://vllm-server:8000 for model meta-llama/Llama-3.2-3B-Instruct.
|
||||
Creating request loader...
|
||||
Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
|
||||
|
||||
|
||||
╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
||||
│ [18:11:47] ⠋ 100% concurrent@1 (complete) Req: 0.3 req/s, 3.35s Lat, 1.0 Conc, 17 Comp, 1 Inc, 0 Err │
|
||||
│ Tok: 76.4 gen/s, 239.4 tot/s, 29.6ms TTFT, 13.0ms ITL, 547 Prompt, 256 Gen │
|
||||
│ [18:12:52] ⠋ 100% concurrent@2 (complete) Req: 0.6 req/s, 3.53s Lat, 2.0 Conc, 32 Comp, 2 Inc, 0 Err │
|
||||
│ Tok: 145.0 gen/s, 454.5 tot/s, 36.9ms TTFT, 13.7ms ITL, 546 Prompt, 256 Gen │
|
||||
│ [18:13:57] ⠋ 100% concurrent@4 (complete) Req: 1.1 req/s, 3.59s Lat, 4.0 Conc, 64 Comp, 4 Inc, 0 Err │
|
||||
│ Tok: 284.8 gen/s, 892.7 tot/s, 59.0ms TTFT, 13.9ms ITL, 546 Prompt, 256 Gen │
|
||||
│ [18:15:02] ⠋ 100% concurrent@8 (complete) Req: 2.2 req/s, 3.70s Lat, 8.0 Conc, 128 Comp, 7 Inc, 0 Err │
|
||||
│ Tok: 553.5 gen/s, 1735.2 tot/s, 79.8ms TTFT, 14.2ms ITL, 547 Prompt, 256 Gen │
|
||||
│ [18:16:08] ⠋ 100% concurrent@16 (complete) Req: 4.2 req/s, 3.83s Lat, 16.0 Conc, 240 Comp, 16 Inc, 0 Err │
|
||||
│ Tok: 1066.9 gen/s, 3344.6 tot/s, 97.5ms TTFT, 14.6ms ITL, 547 Prompt, 256 Gen │
|
||||
│ [18:17:13] ⠋ 100% concurrent@32 (complete) Req: 8.1 req/s, 3.94s Lat, 31.8 Conc, 480 Comp, 31 Inc, 0 Err │
|
||||
│ Tok: 2069.7 gen/s, 6488.4 tot/s, 120.8ms TTFT, 15.0ms ITL, 547 Prompt, 256 Gen │
|
||||
│ [18:18:20] ⠋ 100% concurrent@64 (complete) Req: 13.6 req/s, 4.60s Lat, 62.3 Conc, 813 Comp, 57 Inc, 0 Err │
|
||||
│ Tok: 3472.1 gen/s, 10884.9 tot/s, 190.9ms TTFT, 17.3ms ITL, 547 Prompt, 256 Gen │
|
||||
│ [18:19:28] ⠋ 100% concurrent@128 (complete) Req: 16.8 req/s, 7.37s Lat, 123.5 Conc, 1005 Comp, 126 Inc, 0 Err │
|
||||
│ Tok: 4289.1 gen/s, 13445.8 tot/s, 356.4ms TTFT, 27.5ms ITL, 547 Prompt, 256 Gen │
|
||||
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
||||
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:43 < 0:00:00 ]
|
||||
|
||||
Benchmarks Metadata:
|
||||
Run id:8ccb6da1-83f4-4624-8d84-07c723b0b2a5
|
||||
Duration:530.4 seconds
|
||||
Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
|
||||
Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
|
||||
Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://vllm-server:8000' backend_model='meta-llama/Llama-3.2-3B-Instruct' backend_info={'max_output_tokens':
|
||||
16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': '/v1/chat/completions'}
|
||||
Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
|
||||
Extras:None
|
||||
|
||||
|
||||
Benchmarks Info:
|
||||
=====================================================================================================================================================
|
||||
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total ||
|
||||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
|
||||
--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|-----
|
||||
concurrent@1| 18:11:52| 18:12:52| 60.0| 17| 1| 0| 546.5| 512.0| 0.0| 256.0| 231.0| 0.0| 9291| 512| 0| 4352| 231| 0
|
||||
concurrent@2| 18:12:57| 18:13:57| 60.0| 32| 2| 0| 546.5| 512.0| 0.0| 256.0| 251.0| 0.0| 17488| 1024| 0| 8192| 502| 0
|
||||
concurrent@4| 18:14:02| 18:15:02| 60.0| 64| 4| 0| 546.4| 512.0| 0.0| 256.0| 175.2| 0.0| 34972| 2048| 0| 16384| 701| 0
|
||||
concurrent@8| 18:15:07| 18:16:07| 60.0| 128| 7| 0| 546.6| 512.0| 0.0| 256.0| 50.7| 0.0| 69966| 3584| 0| 32768| 355| 0
|
||||
concurrent@16| 18:16:13| 18:17:13| 60.0| 240| 16| 0| 546.5| 512.0| 0.0| 256.0| 166.0| 0.0| 131170| 8192| 0| 61440| 2656| 0
|
||||
concurrent@32| 18:17:18| 18:18:18| 60.0| 480| 31| 0| 546.5| 512.0| 0.0| 256.0| 47.4| 0.0| 262339| 15872| 0| 122880| 1468| 0
|
||||
concurrent@64| 18:18:25| 18:19:25| 60.0| 813| 57| 0| 546.5| 512.0| 0.0| 256.0| 110.7| 0.0| 444341| 29184| 0| 208128| 6311| 0
|
||||
concurrent@128| 18:19:33| 18:20:33| 60.0| 1005| 126| 0| 546.5| 512.0| 0.0| 256.0| 65.8| 0.0| 549264| 64512| 0| 257280| 8296| 0
|
||||
=====================================================================================================================================================
|
||||
|
||||
|
||||
Benchmarks Stats:
|
||||
=======================================================================================================================================================
|
||||
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
|
||||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
|
||||
--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|-----
|
||||
concurrent@1| 0.30| 1.00| 76.4| 239.4| 3.35| 3.35| 3.38| 29.6| 29.0| 38.9| 13.0| 13.0| 13.1| 13.0| 13.0| 13.0
|
||||
concurrent@2| 0.57| 2.00| 145.0| 454.5| 3.53| 3.53| 3.55| 36.9| 39.0| 59.6| 13.7| 13.7| 13.8| 13.6| 13.7| 13.7
|
||||
concurrent@4| 1.11| 4.00| 284.8| 892.7| 3.59| 3.59| 3.65| 59.0| 65.7| 88.2| 13.9| 13.8| 14.1| 13.8| 13.8| 14.0
|
||||
concurrent@8| 2.16| 7.99| 553.5| 1735.2| 3.70| 3.69| 3.76| 79.8| 80.7| 152.6| 14.2| 14.2| 14.5| 14.1| 14.1| 14.4
|
||||
concurrent@16| 4.17| 15.97| 1066.9| 3344.6| 3.83| 3.82| 3.99| 97.5| 96.3| 283.9| 14.6| 14.6| 14.9| 14.6| 14.6| 14.8
|
||||
concurrent@32| 8.08| 31.84| 2069.7| 6488.4| 3.94| 3.90| 4.31| 120.8| 101.7| 564.3| 15.0| 14.9| 15.9| 14.9| 14.8| 15.9
|
||||
concurrent@64| 13.56| 62.34| 3472.1| 10884.9| 4.60| 4.54| 5.43| 190.9| 133.9| 1113.2| 17.3| 17.2| 18.2| 17.2| 17.2| 18.2
|
||||
concurrent@128| 16.75| 123.45| 4289.1| 13445.8| 7.37| 7.21| 9.21| 356.4| 161.9| 2319.9| 27.5| 27.5| 28.8| 27.4| 27.4| 28.7
|
||||
=======================================================================================================================================================
|
||||
|
||||
Saving benchmarks report...
|
||||
Benchmarks report saved to /benchmarks.json
|
||||
|
||||
Benchmarking complete.
|
Binary file not shown.
After Width: | Height: | Size: 562 KiB |
|
@ -1,148 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Default values
|
||||
TARGET="stack"
|
||||
DURATION=60
|
||||
CONCURRENT=10
|
||||
|
||||
# Parse command line arguments
|
||||
usage() {
|
||||
echo "Usage: $0 [options]"
|
||||
echo "Options:"
|
||||
echo " -t, --target <stack|vllm> Target to benchmark (default: stack)"
|
||||
echo " -d, --duration <seconds> Duration in seconds (default: 60)"
|
||||
echo " -c, --concurrent <users> Number of concurrent users (default: 10)"
|
||||
echo " -h, --help Show this help message"
|
||||
echo ""
|
||||
echo "Examples:"
|
||||
echo " $0 --target vllm # Benchmark vLLM direct"
|
||||
echo " $0 --target stack # Benchmark Llama Stack (default)"
|
||||
echo " $0 -t vllm -d 120 -c 20 # vLLM with 120s duration, 20 users"
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-t|--target)
|
||||
TARGET="$2"
|
||||
shift 2
|
||||
;;
|
||||
-d|--duration)
|
||||
DURATION="$2"
|
||||
shift 2
|
||||
;;
|
||||
-c|--concurrent)
|
||||
CONCURRENT="$2"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1"
|
||||
usage
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Validate target
|
||||
if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then
|
||||
echo "Error: Target must be 'stack' or 'vllm'"
|
||||
usage
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Set configuration based on target
|
||||
if [[ "$TARGET" == "vllm" ]]; then
|
||||
BASE_URL="http://vllm-server:8000/v1"
|
||||
JOB_NAME="vllm-benchmark-job"
|
||||
echo "Benchmarking vLLM direct..."
|
||||
else
|
||||
BASE_URL="http://llama-stack-benchmark-service:8323/v1/openai/v1"
|
||||
JOB_NAME="stack-benchmark-job"
|
||||
echo "Benchmarking Llama Stack..."
|
||||
fi
|
||||
|
||||
echo "Configuration:"
|
||||
echo " Target: $TARGET"
|
||||
echo " Base URL: $BASE_URL"
|
||||
echo " Duration: ${DURATION}s"
|
||||
echo " Concurrent users: $CONCURRENT"
|
||||
echo ""
|
||||
|
||||
# Create temporary job yaml
|
||||
TEMP_YAML="/tmp/benchmark-job-temp-$(date +%s).yaml"
|
||||
cat > "$TEMP_YAML" << EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: $JOB_NAME
|
||||
namespace: default
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: benchmark
|
||||
image: python:3.11-slim
|
||||
command: ["/bin/bash"]
|
||||
args:
|
||||
- "-c"
|
||||
- |
|
||||
pip install aiohttp &&
|
||||
python3 /benchmark/benchmark.py \\
|
||||
--base-url $BASE_URL \\
|
||||
--model \${INFERENCE_MODEL} \\
|
||||
--duration $DURATION \\
|
||||
--concurrent $CONCURRENT
|
||||
env:
|
||||
- name: INFERENCE_MODEL
|
||||
value: "meta-llama/Llama-3.2-3B-Instruct"
|
||||
volumeMounts:
|
||||
- name: benchmark-script
|
||||
mountPath: /benchmark
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
volumes:
|
||||
- name: benchmark-script
|
||||
configMap:
|
||||
name: benchmark-script
|
||||
restartPolicy: Never
|
||||
backoffLimit: 3
|
||||
EOF
|
||||
|
||||
echo "Creating benchmark ConfigMap..."
|
||||
kubectl create configmap benchmark-script \
|
||||
--from-file=benchmark.py=benchmark.py \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
echo "Cleaning up any existing benchmark job..."
|
||||
kubectl delete job $JOB_NAME 2>/dev/null || true
|
||||
|
||||
echo "Deploying benchmark Job..."
|
||||
kubectl apply -f "$TEMP_YAML"
|
||||
|
||||
echo "Waiting for job to start..."
|
||||
kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=60s
|
||||
|
||||
echo "Following benchmark logs..."
|
||||
kubectl logs -f job/$JOB_NAME
|
||||
|
||||
echo "Job completed. Checking final status..."
|
||||
kubectl get job $JOB_NAME
|
||||
|
||||
# Clean up temporary file
|
||||
rm -f "$TEMP_YAML"
|
294
benchmarking/k8s-benchmark/scripts/generate_charts.py
Executable file
294
benchmarking/k8s-benchmark/scripts/generate_charts.py
Executable file
|
@ -0,0 +1,294 @@
|
|||
#!/usr/bin/env python3
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
# /// script
|
||||
# dependencies = [
|
||||
# "matplotlib",
|
||||
# ]
|
||||
# ///
|
||||
"""
|
||||
Script to generate benchmark charts from guidellm text results.
|
||||
Creates 2x2 grid charts with RPS, Request Latency, TTFT, and ITL metrics against concurrent@x values.
|
||||
Outputs one chart file per vLLM replica group, with each line representing one benchmark run.
|
||||
"""
|
||||
|
||||
import glob
|
||||
import os
|
||||
import re
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
|
||||
def extract_setup_name(filename: str) -> str:
|
||||
"""Extract setup name from filename and format legend appropriately."""
|
||||
basename = os.path.basename(filename)
|
||||
|
||||
# Try new pattern: guidellm-benchmark-stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt
|
||||
match = re.search(r"guidellm-benchmark-stack-s(\d+)-sw(\d+)-v(\d+)-(\d{8})-(\d{6})\.txt", basename)
|
||||
if match:
|
||||
stack_replicas = match.group(1)
|
||||
workers = match.group(2)
|
||||
vllm_replicas = match.group(3)
|
||||
date = match.group(4)
|
||||
time = match.group(5)
|
||||
return f"stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}"
|
||||
|
||||
# Try new vLLM pattern: guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt
|
||||
match = re.search(r"guidellm-benchmark-vllm-v(\d+)-(\d{8})-(\d{6})\.txt", basename)
|
||||
if match:
|
||||
vllm_replicas = match.group(1)
|
||||
date = match.group(2)
|
||||
time = match.group(3)
|
||||
return f"vllm-v{vllm_replicas}"
|
||||
|
||||
# Fall back to old pattern: guidellm-benchmark-{target}-{stack_replicas}-w{workers}-{vllm_replicas}-{timestamp}.txt
|
||||
match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-w(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename)
|
||||
if match:
|
||||
target = match.group(1)
|
||||
stack_replicas = match.group(2)
|
||||
workers = match.group(3)
|
||||
vllm_replicas = match.group(4)
|
||||
date = match.group(5)
|
||||
time = match.group(6)
|
||||
|
||||
if target == "vllm":
|
||||
return f"vllm-{vllm_replicas}-w{workers}-{vllm_replicas}"
|
||||
else:
|
||||
return f"stack-replicas{stack_replicas}-w{workers}-vllm-replicas{vllm_replicas}-{date}-{time}"
|
||||
|
||||
# Fall back to older pattern: guidellm-benchmark-{target}-{stack_replicas}-{vllm_replicas}-{timestamp}.txt
|
||||
match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename)
|
||||
if match:
|
||||
target = match.group(1)
|
||||
stack_replicas = match.group(2)
|
||||
vllm_replicas = match.group(3)
|
||||
date = match.group(4)
|
||||
time = match.group(5)
|
||||
|
||||
if target == "vllm":
|
||||
return f"vllm-{vllm_replicas}-w1-{vllm_replicas}"
|
||||
else:
|
||||
return f"stack-replicas{stack_replicas}-vllm-replicas{vllm_replicas}-{date}-{time}"
|
||||
|
||||
return basename.replace("guidellm-benchmark-", "").replace(".txt", "")
|
||||
|
||||
|
||||
def parse_txt_file(filepath: str) -> list[tuple[float, float, float, float, float, str]]:
|
||||
"""
|
||||
Parse a text benchmark file and extract concurrent@x, RPS, TTFT, ITL, and request latency data.
|
||||
Returns list of (concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name) tuples.
|
||||
"""
|
||||
setup_name = extract_setup_name(filepath)
|
||||
data_points = []
|
||||
|
||||
try:
|
||||
with open(filepath) as f:
|
||||
content = f.read()
|
||||
|
||||
# Find the benchmark stats table
|
||||
lines = content.split("\n")
|
||||
in_stats_table = False
|
||||
header_lines_seen = 0
|
||||
|
||||
for line in lines:
|
||||
line_stripped = line.strip()
|
||||
|
||||
# Look for the start of the stats table
|
||||
if "Benchmarks Stats:" in line:
|
||||
in_stats_table = True
|
||||
continue
|
||||
|
||||
if in_stats_table:
|
||||
# Skip the first few separator/header lines
|
||||
if line_stripped.startswith("=") or line_stripped.startswith("-"):
|
||||
header_lines_seen += 1
|
||||
if header_lines_seen >= 3: # After seeing multiple header lines, look for concurrent@ data
|
||||
if line_stripped.startswith("=") and "concurrent@" not in line_stripped:
|
||||
break
|
||||
continue
|
||||
|
||||
# Parse concurrent@ lines in the stats table (may have leading spaces)
|
||||
if in_stats_table and "concurrent@" in line:
|
||||
parts = [part.strip() for part in line.split("|")]
|
||||
|
||||
if len(parts) >= 12: # Make sure we have enough columns for new format
|
||||
try:
|
||||
# Extract concurrency from benchmark name (e.g., concurrent@1 -> 1)
|
||||
concurrent_match = re.search(r"concurrent@(\d+)", parts[0])
|
||||
if not concurrent_match:
|
||||
continue
|
||||
concurrency = float(concurrent_match.group(1))
|
||||
|
||||
# Extract metrics from the new table format
|
||||
# From your image, the table has these columns with | separators:
|
||||
# Benchmark | Per Second | Concurrency | Out Tok/sec | Tot Tok/sec | Req Latency (sec) | TTFT (ms) | ITL (ms) | TPOT (ms)
|
||||
# Looking at the mean/median/p99 structure, need to find the mean columns
|
||||
# The structure shows: mean | median | p99 for each metric
|
||||
rps_mean = float(parts[1]) # Per Second (RPS)
|
||||
req_latency_mean = float(parts[6]) * 1000 # Request latency mean (convert from sec to ms)
|
||||
ttft_mean = float(parts[9]) # TTFT mean column
|
||||
itl_mean = float(parts[12]) # ITL mean column
|
||||
|
||||
data_points.append((concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name))
|
||||
|
||||
except (ValueError, IndexError) as e:
|
||||
print(f"Warning: Could not parse line '{line}' in {filepath}: {e}")
|
||||
continue
|
||||
|
||||
except (OSError, FileNotFoundError) as e:
|
||||
print(f"Error reading {filepath}: {e}")
|
||||
|
||||
return data_points
|
||||
|
||||
|
||||
def generate_charts(benchmark_dir: str = "results"):
|
||||
"""Generate 2x2 grid charts (RPS, Request Latency, TTFT, ITL) from benchmark text files."""
|
||||
# Find all text result files instead of JSON
|
||||
txt_pattern = os.path.join(benchmark_dir, "guidellm-benchmark-*.txt")
|
||||
txt_files = glob.glob(txt_pattern)
|
||||
|
||||
if not txt_files:
|
||||
print(f"No text files found matching pattern: {txt_pattern}")
|
||||
return
|
||||
|
||||
print(f"Found {len(txt_files)} text files")
|
||||
|
||||
# Parse all files and collect data
|
||||
all_data = {} # setup_name -> [(concurrency, rps, ttft, itl, req_latency), ...]
|
||||
|
||||
for txt_file in txt_files:
|
||||
print(f"Processing {txt_file}")
|
||||
data_points = parse_txt_file(txt_file)
|
||||
|
||||
for concurrency, rps, ttft, itl, req_latency, setup_name in data_points:
|
||||
if setup_name not in all_data:
|
||||
all_data[setup_name] = []
|
||||
all_data[setup_name].append((concurrency, rps, ttft, itl, req_latency))
|
||||
|
||||
if not all_data:
|
||||
print("No data found to plot")
|
||||
return
|
||||
|
||||
# Sort data points by concurrency for each setup
|
||||
for setup_name in all_data:
|
||||
all_data[setup_name].sort(key=lambda x: x[0]) # Sort by concurrency
|
||||
|
||||
# Group setups by vLLM replica number (original approach)
|
||||
replica_groups = {} # vllm_replica_count -> {setup_name: points}
|
||||
|
||||
for setup_name, points in all_data.items():
|
||||
# Extract vLLM replica number from setup name
|
||||
# Expected formats:
|
||||
# - New stack format: "stack-s{X}-sw{W}-v{Y}"
|
||||
# - New vLLM format: "vllm-v{Y}"
|
||||
# - Old formats: "stack-replicas{X}-w{W}-vllm-replicas{Y}" or "vllm-{Y}-w{W}-{Y}"
|
||||
|
||||
# Try new formats first
|
||||
vllm_match = re.search(r"-v(\d+)$", setup_name) # Matches both "stack-s1-sw2-v3" and "vllm-v1"
|
||||
if not vllm_match:
|
||||
# Try old stack format
|
||||
vllm_match = re.search(r"vllm-replicas(\d+)", setup_name)
|
||||
if not vllm_match:
|
||||
# Try old vLLM format: "vllm-{Y}-w{W}-{Y}"
|
||||
vllm_match = re.search(r"vllm-(\d+)-w\d+-\d+", setup_name)
|
||||
|
||||
if vllm_match:
|
||||
vllm_replica_num = int(vllm_match.group(1))
|
||||
if vllm_replica_num not in replica_groups:
|
||||
replica_groups[vllm_replica_num] = {}
|
||||
replica_groups[vllm_replica_num][setup_name] = points
|
||||
else:
|
||||
print(f"Warning: Could not extract vLLM replica count from setup name: {setup_name}")
|
||||
|
||||
def create_charts(data_dict, prefix, title_prefix):
|
||||
"""Create a 2x2 grid with RPS, Request Latency, TTFT, and ITL charts."""
|
||||
if not data_dict:
|
||||
print(f"No data found for {prefix}")
|
||||
return
|
||||
|
||||
# Create 2x2 subplot grid
|
||||
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
|
||||
fig.suptitle(f"{title_prefix} Benchmark Results", fontsize=16, fontweight="bold")
|
||||
|
||||
# Collect all unique concurrency values for tick setting
|
||||
all_concurrency_values = set()
|
||||
for points in data_dict.values():
|
||||
all_concurrency_values.update([p[0] for p in points])
|
||||
all_concurrency_values = sorted(all_concurrency_values)
|
||||
|
||||
# Plot data for each setup in alphabetical order
|
||||
for setup_name in sorted(data_dict.keys()):
|
||||
points = data_dict[setup_name]
|
||||
if not points:
|
||||
continue
|
||||
|
||||
concurrency_values = [p[0] for p in points]
|
||||
rps_values = [p[1] for p in points]
|
||||
ttft_values = [p[2] for p in points]
|
||||
itl_values = [p[3] for p in points]
|
||||
req_latency_values = [p[4] for p in points]
|
||||
|
||||
# RPS chart (top-left)
|
||||
ax1.plot(concurrency_values, rps_values, marker="o", label=setup_name, linewidth=2, markersize=6)
|
||||
|
||||
# Request Latency chart (top-right)
|
||||
ax2.plot(concurrency_values, req_latency_values, marker="o", label=setup_name, linewidth=2, markersize=6)
|
||||
|
||||
# TTFT chart (bottom-left)
|
||||
ax3.plot(concurrency_values, ttft_values, marker="o", label=setup_name, linewidth=2, markersize=6)
|
||||
|
||||
# ITL chart (bottom-right)
|
||||
ax4.plot(concurrency_values, itl_values, marker="o", label=setup_name, linewidth=2, markersize=6)
|
||||
|
||||
# Configure all charts after plotting data
|
||||
axes = [ax1, ax2, ax3, ax4]
|
||||
titles = ["RPS", "Request Latency", "TTFT", "ITL"]
|
||||
ylabels = [
|
||||
"Requests Per Second (RPS)",
|
||||
"Request Latency (ms)",
|
||||
"Time to First Token (ms)",
|
||||
"Inter Token Latency (ms)",
|
||||
]
|
||||
|
||||
for ax, title, ylabel in zip(axes, titles, ylabels, strict=False):
|
||||
ax.set_xlabel("Concurrency", fontsize=12)
|
||||
ax.set_ylabel(ylabel, fontsize=12)
|
||||
ax.set_title(title, fontsize=14, fontweight="bold")
|
||||
ax.set_xscale("log", base=2)
|
||||
ax.set_xticks(all_concurrency_values)
|
||||
ax.set_xticklabels([str(int(x)) for x in all_concurrency_values])
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# Add legend to the right-most subplot (top-right)
|
||||
ax2.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
|
||||
|
||||
plt.tight_layout()
|
||||
|
||||
# Save the combined chart
|
||||
combined_filename = os.path.join(benchmark_dir, f"{prefix}_benchmark_results.png")
|
||||
plt.savefig(combined_filename, dpi=300, bbox_inches="tight")
|
||||
plt.close()
|
||||
print(f"Combined benchmark chart saved to {combined_filename}")
|
||||
|
||||
# Print grouping information
|
||||
for replica_count, data_dict in replica_groups.items():
|
||||
print(f"vLLM Replica {replica_count} setups: {list(data_dict.keys())}")
|
||||
|
||||
# Create separate charts for each replica group
|
||||
for replica_count, data_dict in replica_groups.items():
|
||||
prefix = f"vllm_replica{replica_count}"
|
||||
title = f"vLLM Replicas={replica_count}"
|
||||
create_charts(data_dict, prefix, title)
|
||||
|
||||
# Print summary
|
||||
print("\nSummary:")
|
||||
for setup_name, points in all_data.items():
|
||||
print(f"{setup_name}: {len(points)} data points")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
generate_charts()
|
103
benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh
Executable file
103
benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh
Executable file
|
@ -0,0 +1,103 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
# Define benchmark configurations: (target, stack_replicas, vllm_replicas, stack_workers)
|
||||
configs=(
|
||||
"stack 1 1 1"
|
||||
"stack 1 1 2"
|
||||
"stack 1 1 4"
|
||||
"vllm 1 1 -"
|
||||
)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Get the directory where this script is located
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
echo "Running comprehensive GuideLL benchmark suite..."
|
||||
echo "Start time: $(date)"
|
||||
|
||||
# Default deployment names
|
||||
STACK_DEPLOYMENT="llama-stack-benchmark-server"
|
||||
VLLM_DEPLOYMENT="vllm-server"
|
||||
|
||||
# Scaling function
|
||||
scale_deployments() {
|
||||
local stack_replicas=$1
|
||||
local vllm_replicas=$2
|
||||
local workers=$3
|
||||
|
||||
echo "Scaling deployments..."
|
||||
|
||||
if [[ "$vllm_replicas" != "-" ]]; then
|
||||
echo "Scaling $VLLM_DEPLOYMENT to $vllm_replicas replicas..."
|
||||
kubectl scale deployment $VLLM_DEPLOYMENT --replicas=$vllm_replicas
|
||||
kubectl rollout status deployment $VLLM_DEPLOYMENT --timeout=600s
|
||||
fi
|
||||
|
||||
if [[ "$target" == "stack" ]]; then
|
||||
if [[ "$stack_replicas" != "-" ]]; then
|
||||
echo "Scaling $STACK_DEPLOYMENT to $stack_replicas replicas..."
|
||||
kubectl scale deployment $STACK_DEPLOYMENT --replicas=$stack_replicas
|
||||
kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s
|
||||
fi
|
||||
|
||||
if [[ "$workers" != "-" ]]; then
|
||||
echo "Updating $STACK_DEPLOYMENT to use $workers workers..."
|
||||
kubectl set env deployment/$STACK_DEPLOYMENT LLAMA_STACK_WORKERS=$workers
|
||||
kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "All scaling operations completed. Waiting additional 30s for services to stabilize..."
|
||||
sleep 30
|
||||
}
|
||||
|
||||
|
||||
for config in "${configs[@]}"; do
|
||||
read -r target stack_replicas vllm_replicas workers <<< "$config"
|
||||
|
||||
echo ""
|
||||
echo "=========================================="
|
||||
if [[ "$workers" != "-" ]]; then
|
||||
echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas, workers=$workers)"
|
||||
else
|
||||
echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas)"
|
||||
fi
|
||||
echo "Start: $(date)"
|
||||
echo "=========================================="
|
||||
|
||||
# Scale deployments before running benchmark
|
||||
scale_deployments "$stack_replicas" "$vllm_replicas" "$workers"
|
||||
|
||||
# Generate output filename with setup info
|
||||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||
if [[ "$target" == "stack" ]]; then
|
||||
OUTPUT_FILE="results/guidellm-benchmark-${target}-s${stack_replicas}-sw${workers}-v${vllm_replicas}-${TIMESTAMP}.txt"
|
||||
else
|
||||
OUTPUT_FILE="results/guidellm-benchmark-${target}-v${vllm_replicas}-${TIMESTAMP}.txt"
|
||||
fi
|
||||
|
||||
# Run the benchmark with the cluster as configured
|
||||
"$SCRIPT_DIR/run-guidellm-benchmark.sh" \
|
||||
--target "$target" \
|
||||
--output-file "$OUTPUT_FILE"
|
||||
|
||||
echo "Completed: $(date)"
|
||||
echo "Waiting 30 seconds before next benchmark..."
|
||||
sleep 30
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "=========================================="
|
||||
echo "All benchmarks completed!"
|
||||
echo "End time: $(date)"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
echo "Results files generated:"
|
||||
ls -la results/guidellm-*.txt results/guidellm-*.json 2>/dev/null || echo "No result files found"
|
219
benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh
Executable file
219
benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh
Executable file
|
@ -0,0 +1,219 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Default values
|
||||
TARGET="stack"
|
||||
MAX_SECONDS=60
|
||||
PROMPT_TOKENS=512
|
||||
OUTPUT_TOKENS=256
|
||||
RATE_TYPE="concurrent"
|
||||
RATE="1,2,4,8,16,32,64,128"
|
||||
STACK_DEPLOYMENT="llama-stack-benchmark-server"
|
||||
STACK_URL="http://llama-stack-benchmark-service:8323/v1/openai"
|
||||
VLLM_DEPLOYMENT="vllm-server"
|
||||
OUTPUT_FILE=""
|
||||
|
||||
# Parse command line arguments
|
||||
usage() {
|
||||
echo "Usage: $0 [options]"
|
||||
echo "Options:"
|
||||
echo " -t, --target <stack|vllm> Target to benchmark (default: stack)"
|
||||
echo " -s, --max-seconds <seconds> Maximum duration in seconds (default: 60)"
|
||||
echo " -p, --prompt-tokens <tokens> Number of prompt tokens (default: 512)"
|
||||
echo " -o, --output-tokens <tokens> Number of output tokens (default: 256)"
|
||||
echo " -r, --rate-type <type> Rate type (default: concurrent)"
|
||||
echo " -c, --rate Rate (default: 1,2,4,8,16,32,64,128)"
|
||||
echo " --output-file <path> Output file path (default: auto-generated)"
|
||||
echo " --stack-deployment <name> Name of the stack deployment (default: llama-stack-benchmark-server)"
|
||||
echo " --vllm-deployment <name> Name of the vllm deployment (default: vllm-server)"
|
||||
echo " --stack-url <url> URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai)"
|
||||
echo " -h, --help Show this help message"
|
||||
echo ""
|
||||
echo "Examples:"
|
||||
echo " $0 --target vllm # Benchmark vLLM direct"
|
||||
echo " $0 --target stack # Benchmark Llama Stack (default)"
|
||||
echo " $0 -t vllm -s 60 -p 512 -o 256 # vLLM with custom parameters"
|
||||
echo " $0 --output-file results/my-benchmark.txt # Specify custom output file"
|
||||
echo " $0 --stack-deployment my-stack-server # Use custom stack deployment name"
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
-t|--target)
|
||||
TARGET="$2"
|
||||
shift 2
|
||||
;;
|
||||
-s|--max-seconds)
|
||||
MAX_SECONDS="$2"
|
||||
shift 2
|
||||
;;
|
||||
-p|--prompt-tokens)
|
||||
PROMPT_TOKENS="$2"
|
||||
shift 2
|
||||
;;
|
||||
-o|--output-tokens)
|
||||
OUTPUT_TOKENS="$2"
|
||||
shift 2
|
||||
;;
|
||||
-r|--rate-type)
|
||||
RATE_TYPE="$2"
|
||||
shift 2
|
||||
;;
|
||||
-c|--rate)
|
||||
RATE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--output-file)
|
||||
OUTPUT_FILE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--stack-deployment)
|
||||
STACK_DEPLOYMENT="$2"
|
||||
shift 2
|
||||
;;
|
||||
--vllm-deployment)
|
||||
VLLM_DEPLOYMENT="$2"
|
||||
shift 2
|
||||
;;
|
||||
--stack-url)
|
||||
STACK_URL="$2"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1"
|
||||
usage
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Validate target
|
||||
if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then
|
||||
echo "Error: Target must be 'stack' or 'vllm'"
|
||||
usage
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Set configuration based on target
|
||||
if [[ "$TARGET" == "vllm" ]]; then
|
||||
BASE_URL="http://${VLLM_DEPLOYMENT}:8000"
|
||||
JOB_NAME="guidellm-vllm-benchmark-job"
|
||||
echo "Benchmarking vLLM direct with GuideLLM..."
|
||||
else
|
||||
BASE_URL="$STACK_URL"
|
||||
JOB_NAME="guidellm-stack-benchmark-job"
|
||||
echo "Benchmarking Llama Stack with GuideLLM..."
|
||||
fi
|
||||
|
||||
|
||||
echo "Configuration:"
|
||||
echo " Target: $TARGET"
|
||||
echo " Base URL: $BASE_URL"
|
||||
echo " Max seconds: ${MAX_SECONDS}s"
|
||||
echo " Prompt tokens: $PROMPT_TOKENS"
|
||||
echo " Output tokens: $OUTPUT_TOKENS"
|
||||
echo " Rate type: $RATE_TYPE"
|
||||
if [[ "$TARGET" == "vllm" ]]; then
|
||||
echo " vLLM deployment: $VLLM_DEPLOYMENT"
|
||||
else
|
||||
echo " Stack deployment: $STACK_DEPLOYMENT"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Create temporary job yaml
|
||||
TEMP_YAML="/tmp/guidellm-benchmark-job-temp-$(date +%s).yaml"
|
||||
cat > "$TEMP_YAML" << EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: $JOB_NAME
|
||||
namespace: default
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: guidellm-benchmark
|
||||
image: python:3.11-slim
|
||||
command: ["/bin/bash"]
|
||||
args:
|
||||
- "-c"
|
||||
- |
|
||||
# Install uv and guidellm
|
||||
pip install uv &&
|
||||
uv pip install --system guidellm &&
|
||||
|
||||
# Login to HuggingFace
|
||||
uv pip install --system huggingface_hub &&
|
||||
python -c "from huggingface_hub import login; login(token='\$HF_TOKEN')" &&
|
||||
|
||||
# Run GuideLLM benchmark and save output
|
||||
export COLUMNS=200
|
||||
GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \\
|
||||
--target "$BASE_URL" \\
|
||||
--rate-type "$RATE_TYPE" \\
|
||||
--max-seconds $MAX_SECONDS \\
|
||||
--data "prompt_tokens=$PROMPT_TOKENS,output_tokens=$OUTPUT_TOKENS" \\
|
||||
--model "$INFERENCE_MODEL" \\
|
||||
--rate "$RATE" \\
|
||||
--warmup-percent 0.05 \\
|
||||
2>&1
|
||||
env:
|
||||
- name: INFERENCE_MODEL
|
||||
value: "meta-llama/Llama-3.2-3B-Instruct"
|
||||
- name: HF_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: hf-token-secret
|
||||
key: token
|
||||
resources:
|
||||
requests:
|
||||
memory: "4Gi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "8Gi"
|
||||
cpu: "2000m"
|
||||
restartPolicy: Never
|
||||
backoffLimit: 3
|
||||
EOF
|
||||
|
||||
echo "Cleaning up any existing GuideLLM benchmark job..."
|
||||
kubectl delete job $JOB_NAME 2>/dev/null || true
|
||||
|
||||
echo "Deploying GuideLLM benchmark Job..."
|
||||
kubectl apply -f "$TEMP_YAML"
|
||||
|
||||
echo "Waiting for job to start..."
|
||||
kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=120s
|
||||
|
||||
# Prepare file names and create results directory
|
||||
mkdir -p results
|
||||
if [[ -z "$OUTPUT_FILE" ]]; then
|
||||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||||
OUTPUT_FILE="results/guidellm-benchmark-${TARGET}-${TIMESTAMP}.txt"
|
||||
fi
|
||||
|
||||
echo "Following GuideLLM benchmark logs..."
|
||||
kubectl logs -f job/$JOB_NAME
|
||||
|
||||
echo "Job completed. Checking final status..."
|
||||
kubectl get job $JOB_NAME
|
||||
|
||||
# Save benchmark results using kubectl logs
|
||||
echo "Saving benchmark results..."
|
||||
kubectl logs job/$JOB_NAME > "$OUTPUT_FILE"
|
||||
|
||||
echo "Benchmark output saved to: $OUTPUT_FILE"
|
||||
|
||||
# Clean up temporary file
|
||||
rm -f "$TEMP_YAML"
|
|
@ -58,14 +58,14 @@ spec:
|
|||
value: "/etc/config/stack_run_config.yaml"
|
||||
- name: LLAMA_STACK_WORKERS
|
||||
value: "${LLAMA_STACK_WORKERS}"
|
||||
command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$LLAMA_STACK_WORKERS", "--factory"]
|
||||
command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$(LLAMA_STACK_WORKERS)", "--factory"]
|
||||
ports:
|
||||
- containerPort: 8323
|
||||
resources:
|
||||
requests:
|
||||
cpu: "${LLAMA_STACK_WORKERS}"
|
||||
cpu: "4"
|
||||
limits:
|
||||
cpu: "${LLAMA_STACK_WORKERS}"
|
||||
cpu: "4"
|
||||
volumeMounts:
|
||||
- name: llama-storage
|
||||
mountPath: /root/.llama
|
||||
|
|
|
@ -177,6 +177,7 @@ exclude = [
|
|||
".pre-commit-config.yaml",
|
||||
"*.md",
|
||||
".flake8",
|
||||
"benchmarking/k8s-benchmark/results",
|
||||
]
|
||||
|
||||
[tool.ruff.lint]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue