chore(perf): run guidellm benchmarks (#3421)

# What does this PR do? - Mostly AI-generated scripts to run guidellm (https://github.com/vllm-project/guidellm) benchmarks on k8s setup - Stack is using image built from main on 9/11 ## Test Plan See updated README.md
2025-12-03 09:53:45 +00:00 · 2025-09-24 10:18:33 -07:00 · 2025-09-24 10:18:33 -07:00 · 48a551ecbc
commit 48a551ecbc
parent 2f58d87c22
14 changed files with 1436 additions and 526 deletions
--- a/benchmarking/k8s-benchmark/README.md
+++ b/benchmarking/k8s-benchmark/README.md
@ -26,6 +26,7 @@ The benchmark suite measures critical performance indicators:
 - **Throughput**: Requests per second under sustained load
 - **Latency Distribution**: P50, P95, P99 response times
 - **Time to First Token (TTFT)**: Critical for streaming applications
+- **Inter-Token Latency (ITL)**: Token generation speed for streaming
 - **Error Rates**: Request failures and timeout analysis

 This data enables data-driven architectural decisions and performance optimization efforts.
@ -49,49 +50,148 @@ kubectl get pods
 # Should see: llama-stack-benchmark-server, vllm-server, etc.
 ```

+## Benchmark Results
+
+We use [GuideLLM](https://github.com/neuralmagic/guidellm) against our k8s deployment for comprehensive performance testing.
+
+
+### Performance - 1 vLLM Replica
+
+We vary the number of Llama Stack replicas with 1 vLLM replica and compare performance below.
+
+![Performance - 1 vLLM Replica](results/vllm_replica1_benchmark_results.png)
+
+
+For full results see the `benchmarking/k8s-benchmark/results/` directory.
+
+
 ## Quick Start

-### Basic Benchmarks
+Follow the instructions below to run benchmarks similar to the ones above.

-**Benchmark Llama Stack (default):**
+### Comprehensive Benchmark Suite
+
+**Run all benchmarks with different cluster configurations:**
 ```bash
-./run-benchmark.sh
+./scripts/run-all-benchmarks.sh
 ```

-**Benchmark vLLM direct:**
+This script will automatically:
+- Scale deployments to different configurations
+- Run benchmarks for each setup
+- Generate output files with meaningful names that include setup information
+
+### Individual Benchmarks
+
+**Benchmark Llama Stack (runs against current cluster setup):**
 ```bash
-./run-benchmark.sh --target vllm
+./scripts/run-guidellm-benchmark.sh --target stack
 ```

-### Custom Configuration
-
-**Extended benchmark with high concurrency:**
+**Benchmark vLLM direct (runs against current cluster setup):**
 ```bash
-./run-benchmark.sh --target vllm --duration 120 --concurrent 20
+./scripts/run-guidellm-benchmark.sh --target vllm
 ```

-**Short test run:**
+**Benchmark with custom parameters:**
 ```bash
-./run-benchmark.sh --target stack --duration 30 --concurrent 5
+./scripts/run-guidellm-benchmark.sh --target stack --max-seconds 120 --prompt-tokens 1024 --output-tokens 512
 ```

+**Benchmark with custom output file:**
+```bash
+./scripts/run-guidellm-benchmark.sh --target stack --output-file results/my-custom-benchmark.txt
+```
+
+### Generating Charts
+
+Once the benchmarks are run, you can generate performance charts from benchmark results:
+
+```bash
+uv run ./scripts/generate_charts.py
+```
+
+This loads runs in the `results/` directory and creates visualizations comparing different configurations and replica counts.
+
+## Benchmark Workflow
+
+The benchmark suite is organized into two main scripts with distinct responsibilities:
+
+### 1. `run-all-benchmarks.sh` - Orchestration & Scaling
+- **Purpose**: Manages different cluster configurations and orchestrates benchmark runs
+- **Responsibilities**:
+  - Scales Kubernetes deployments (vLLM replicas, Stack replicas, worker counts)
+  - Runs benchmarks for each configuration
+  - Generates meaningful output filenames with setup information
+- **Use case**: Running comprehensive performance testing across multiple configurations
+
+### 2. `run-guidellm-benchmark.sh` - Single Benchmark Execution
+- **Purpose**: Executes a single benchmark against the current cluster state
+- **Responsibilities**:
+  - Runs GuideLLM benchmark with configurable parameters
+  - Accepts custom output file paths
+  - No cluster scaling - benchmarks current deployment state
+- **Use case**: Testing specific configurations or custom scenarios
+
+### Typical Workflow
+1. **Comprehensive Testing**: Use `run-all-benchmarks.sh` to automatically test multiple configurations
+2. **Custom Testing**: Use `run-guidellm-benchmark.sh` for specific parameter testing or manual cluster configurations
+3. **Analysis**: Use `generate_charts.py` to visualize results from either approach
+
 ## Command Reference

-### run-benchmark.sh Options
+### run-all-benchmarks.sh
+
+Orchestrates multiple benchmark runs with different cluster configurations. This script:
+- Automatically scales deployments before each benchmark
+- Runs benchmarks against the configured cluster setup
+- Generates meaningfully named output files

 ```bash
-./run-benchmark.sh [options]
+./scripts/run-all-benchmarks.sh
+```
+
+**Configuration**: Edit the `configs` array in the script to customize benchmark configurations:
+```bash
+# Each line: (target, stack_replicas, vllm_replicas, stack_workers)
+configs=(
+    "stack 1 1 1"
+    "stack 1 1 2"
+    "stack 1 1 4"
+    "vllm 1 1 -"
+)
+```
+
+**Output files**: Generated with setup information in filename:
+- Stack: `guidellm-benchmark-stack-s{replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt`
+- vLLM: `guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt`
+
+### run-guidellm-benchmark.sh Options
+
+Runs a single benchmark against the current cluster setup (no scaling).
+
+```bash
+./scripts/run-guidellm-benchmark.sh [options]

 Options:
  -t, --target <stack|vllm>     Target to benchmark (default: stack)
-  -d, --duration <seconds>      Duration in seconds (default: 60)
-  -c, --concurrent <users>      Number of concurrent users (default: 10)
+  -s, --max-seconds <seconds>   Maximum duration in seconds (default: 60)
+  -p, --prompt-tokens <tokens>  Number of prompt tokens (default: 512)
+  -o, --output-tokens <tokens>  Number of output tokens (default: 256)
+  -r, --rate-type <type>        Rate type (default: concurrent)
+  -c, --rate                    Rate (default: 1,2,4,8,16,32,64,128)
+  --output-file <path>          Output file path (default: auto-generated)
+  --stack-deployment <name>     Name of the stack deployment (default: llama-stack-benchmark-server)
+  --vllm-deployment <name>      Name of the vllm deployment (default: vllm-server)
+  --stack-url <url>             URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai)
  -h, --help                    Show help message

 Examples:
-  ./run-benchmark.sh --target vllm              # Benchmark vLLM direct
-  ./run-benchmark.sh --target stack             # Benchmark Llama Stack
-  ./run-benchmark.sh -t vllm -d 120 -c 20       # vLLM with 120s, 20 users
+  ./scripts/run-guidellm-benchmark.sh --target vllm                              # Benchmark vLLM direct
+  ./scripts/run-guidellm-benchmark.sh --target stack                             # Benchmark Llama Stack (default)
+  ./scripts/run-guidellm-benchmark.sh -t vllm -s 60 -p 512 -o 256               # vLLM with custom parameters
+  ./scripts/run-guidellm-benchmark.sh --output-file results/my-benchmark.txt     # Specify custom output file
+  ./scripts/run-guidellm-benchmark.sh --stack-deployment my-stack-server         # Use custom stack deployment name
 ```

 ## Local Testing
@ -100,55 +200,30 @@ Examples:

 For local development without Kubernetes:

-**1. Start OpenAI mock server:**
-```bash
-uv run python openai-mock-server.py --port 8080
-```
-
-**2. Run benchmark against mock server:**
-```bash
-uv run python benchmark.py \
-  --base-url http://localhost:8080/v1 \
-  --model mock-inference \
-  --duration 30 \
-  --concurrent 5
-```
-
-**3. Test against local vLLM server:**
-```bash
-# If you have vLLM running locally on port 8000
-uv run python benchmark.py \
-  --base-url http://localhost:8000/v1 \
-  --model meta-llama/Llama-3.2-3B-Instruct \
-  --duration 30 \
-  --concurrent 5
-```
-
-**4. Profile the running server:**
-```bash
-./profile_running_server.sh
-```
-
-
-
-### OpenAI Mock Server
+**1. (Optional) Start Mock OpenAI server:**

+There is a simple mock OpenAI server if you don't have an inference provider available.
 The `openai-mock-server.py` provides:
 - **OpenAI-compatible API** for testing without real models
 - **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var
 - **Consistent responses** for reproducible benchmarks
 - **Lightweight testing** without GPU requirements

-**Mock server usage:**
 ```bash
 uv run python openai-mock-server.py --port 8080
 ```

-The mock server is also deployed in k8s as `openai-mock-service:8080` and can be used by changing the Llama Stack configuration to use the `mock-vllm-inference` provider.
+**2. Start Stack server:**
+```bash
+LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv run uvicorn llama_stack.core.server.server:create_app --port 8321 --workers 4 --factory
+```

-## Files in this Directory
-
- `benchmark.py` - Core benchmark script with async streaming support
- `run-benchmark.sh` - Main script with target selection and configuration
- `openai-mock-server.py` - Mock OpenAI API server for local testing
- `README.md` - This documentation file
+**3. Run GuideLLM benchmark:**
+```bash
+GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \
+  --target "http://localhost:8321/v1/openai/v1" \
+  --model "meta-llama/Llama-3.2-3B-Instruct" \
+  --rate-type sweep \
+  --max-seconds 60 \
+  --data "prompt_tokens=256,output_tokens=128" --output-path='output.html'
+```
--- a/benchmarking/k8s-benchmark/benchmark.py
+++ b/benchmarking/k8s-benchmark/benchmark.py
@ -1,265 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-"""
-Simple benchmark script for Llama Stack with OpenAI API compatibility.
-"""
-
-import argparse
-import asyncio
-import os
-import random
-import statistics
-import time
-
-import aiohttp
-
-
-class BenchmarkStats:
-    def __init__(self):
-        self.response_times = []
-        self.ttft_times = []
-        self.chunks_received = []
-        self.errors = []
-        self.success_count = 0
-        self.total_requests = 0
-        self.concurrent_users = 0
-        self.start_time = None
-        self.end_time = None
-        self._lock = asyncio.Lock()
-
-    async def add_result(self, response_time: float, chunks: int, ttft: float = None, error: str = None):
-        async with self._lock:
-            self.total_requests += 1
-            if error:
-                self.errors.append(error)
-            else:
-                self.success_count += 1
-                self.response_times.append(response_time)
-                self.chunks_received.append(chunks)
-                if ttft is not None:
-                    self.ttft_times.append(ttft)
-
-    def print_summary(self):
-        if not self.response_times:
-            print("No successful requests to report")
-            if self.errors:
-                print(f"Total errors: {len(self.errors)}")
-                print("First 5 errors:")
-                for error in self.errors[:5]:
-                    print(f"  {error}")
-            return
-
-        total_time = self.end_time - self.start_time
-        success_rate = (self.success_count / self.total_requests) * 100
-
-        print(f"\n{'=' * 60}")
-        print("BENCHMARK RESULTS")
-
-        print("\nResponse Time Statistics:")
-        print(f"  Mean: {statistics.mean(self.response_times):.3f}s")
-        print(f"  Median: {statistics.median(self.response_times):.3f}s")
-        print(f"  Min: {min(self.response_times):.3f}s")
-        print(f"  Max: {max(self.response_times):.3f}s")
-
-        if len(self.response_times) > 1:
-            print(f"  Std Dev: {statistics.stdev(self.response_times):.3f}s")
-
-        percentiles = [50, 90, 95, 99]
-        sorted_times = sorted(self.response_times)
-        print("\nPercentiles:")
-        for p in percentiles:
-            idx = int(len(sorted_times) * p / 100) - 1
-            idx = max(0, min(idx, len(sorted_times) - 1))
-            print(f"  P{p}: {sorted_times[idx]:.3f}s")
-
-        if self.ttft_times:
-            print("\nTime to First Token (TTFT) Statistics:")
-            print(f"  Mean: {statistics.mean(self.ttft_times):.3f}s")
-            print(f"  Median: {statistics.median(self.ttft_times):.3f}s")
-            print(f"  Min: {min(self.ttft_times):.3f}s")
-            print(f"  Max: {max(self.ttft_times):.3f}s")
-
-            if len(self.ttft_times) > 1:
-                print(f"  Std Dev: {statistics.stdev(self.ttft_times):.3f}s")
-
-            sorted_ttft = sorted(self.ttft_times)
-            print("\nTTFT Percentiles:")
-            for p in percentiles:
-                idx = int(len(sorted_ttft) * p / 100) - 1
-                idx = max(0, min(idx, len(sorted_ttft) - 1))
-                print(f"  P{p}: {sorted_ttft[idx]:.3f}s")
-
-        if self.chunks_received:
-            print("\nStreaming Statistics:")
-            print(f"  Mean chunks per response: {statistics.mean(self.chunks_received):.1f}")
-            print(f"  Total chunks received: {sum(self.chunks_received)}")
-
-        print(f"{'=' * 60}")
-        print(f"Total time: {total_time:.2f}s")
-        print(f"Concurrent users: {self.concurrent_users}")
-        print(f"Total requests: {self.total_requests}")
-        print(f"Successful requests: {self.success_count}")
-        print(f"Failed requests: {len(self.errors)}")
-        print(f"Success rate: {success_rate:.1f}%")
-        print(f"Requests per second: {self.success_count / total_time:.2f}")
-
-        if self.errors:
-            print("\nErrors (showing first 5):")
-            for error in self.errors[:5]:
-                print(f"  {error}")
-
-
-class LlamaStackBenchmark:
-    def __init__(self, base_url: str, model_id: str):
-        self.base_url = base_url.rstrip("/")
-        self.model_id = model_id
-        self.headers = {"Content-Type": "application/json"}
-        self.test_messages = [
-            [{"role": "user", "content": "Hi"}],
-            [{"role": "user", "content": "What is the capital of France?"}],
-            [{"role": "user", "content": "Explain quantum physics in simple terms."}],
-            [{"role": "user", "content": "Write a short story about a robot learning to paint."}],
-            [
-                {"role": "user", "content": "What is machine learning?"},
-                {"role": "assistant", "content": "Machine learning is a subset of AI..."},
-                {"role": "user", "content": "Can you give me a practical example?"},
-            ],
-        ]
-
-    async def make_async_streaming_request(self) -> tuple[float, int, float | None, str | None]:
-        """Make a single async streaming chat completion request."""
-        messages = random.choice(self.test_messages)
-        payload = {"model": self.model_id, "messages": messages, "stream": True, "max_tokens": 100}
-
-        start_time = time.time()
-        chunks_received = 0
-        ttft = None
-        error = None
-
-        session = aiohttp.ClientSession()
-
-        try:
-            async with session.post(
-                f"{self.base_url}/chat/completions",
-                headers=self.headers,
-                json=payload,
-                timeout=aiohttp.ClientTimeout(total=30),
-            ) as response:
-                if response.status == 200:
-                    async for line in response.content:
-                        if line:
-                            line_str = line.decode("utf-8").strip()
-                            if line_str.startswith("data: "):
-                                chunks_received += 1
-                                if ttft is None:
-                                    ttft = time.time() - start_time
-                                if line_str == "data: [DONE]":
-                                    break
-
-                    if chunks_received == 0:
-                        error = "No streaming chunks received"
-                else:
-                    text = await response.text()
-                    error = f"HTTP {response.status}: {text[:100]}"
-
-        except Exception as e:
-            error = f"Request error: {str(e)}"
-        finally:
-            await session.close()
-
-        response_time = time.time() - start_time
-        return response_time, chunks_received, ttft, error
-
-    async def run_benchmark(self, duration: int, concurrent_users: int) -> BenchmarkStats:
-        """Run benchmark using async requests for specified duration."""
-        stats = BenchmarkStats()
-        stats.concurrent_users = concurrent_users
-        stats.start_time = time.time()
-
-        print(f"Starting benchmark: {duration}s duration, {concurrent_users} concurrent users")
-        print(f"Target URL: {self.base_url}/chat/completions")
-        print(f"Model: {self.model_id}")
-
-        connector = aiohttp.TCPConnector(limit=concurrent_users)
-        async with aiohttp.ClientSession(connector=connector):
-
-            async def worker(worker_id: int):
-                """Worker that sends requests sequentially until canceled."""
-                request_count = 0
-                while True:
-                    try:
-                        response_time, chunks, ttft, error = await self.make_async_streaming_request()
-                        await stats.add_result(response_time, chunks, ttft, error)
-                        request_count += 1
-
-                    except asyncio.CancelledError:
-                        break
-                    except Exception as e:
-                        await stats.add_result(0, 0, None, f"Worker {worker_id} error: {str(e)}")
-
-            # Progress reporting task
-            async def progress_reporter():
-                last_report_time = time.time()
-                while True:
-                    try:
-                        await asyncio.sleep(1)  # Report every second
-                        if time.time() >= last_report_time + 10:  # Report every 10 seconds
-                            elapsed = time.time() - stats.start_time
-                            print(
-                                f"Completed: {stats.total_requests} requests in {elapsed:.1f}s, RPS: {stats.total_requests / elapsed:.1f}"
-                            )
-                            last_report_time = time.time()
-                    except asyncio.CancelledError:
-                        break
-
-            # Spawn concurrent workers
-            tasks = [asyncio.create_task(worker(i)) for i in range(concurrent_users)]
-            progress_task = asyncio.create_task(progress_reporter())
-            tasks.append(progress_task)
-
-            # Wait for duration then cancel all tasks
-            await asyncio.sleep(duration)
-
-            for task in tasks:
-                task.cancel()
-
-            # Wait for all tasks to complete
-            await asyncio.gather(*tasks, return_exceptions=True)
-
-        stats.end_time = time.time()
-        return stats
-
-
-def main():
-    parser = argparse.ArgumentParser(description="Llama Stack Benchmark Tool")
-    parser.add_argument(
-        "--base-url",
-        default=os.getenv("BENCHMARK_BASE_URL", "http://localhost:8000/v1/openai/v1"),
-        help="Base URL for the API (default: http://localhost:8000/v1/openai/v1)",
-    )
-    parser.add_argument(
-        "--model", default=os.getenv("INFERENCE_MODEL", "test-model"), help="Model ID to use for requests"
-    )
-    parser.add_argument("--duration", type=int, default=60, help="Duration in seconds to run benchmark (default: 60)")
-    parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent users (default: 10)")
-
-    args = parser.parse_args()
-
-    benchmark = LlamaStackBenchmark(args.base_url, args.model)
-
-    try:
-        stats = asyncio.run(benchmark.run_benchmark(args.duration, args.concurrent))
-        stats.print_summary()
-
-    except KeyboardInterrupt:
-        print("\nBenchmark interrupted by user")
-    except Exception as e:
-        print(f"Benchmark failed: {e}")
-
-
-if __name__ == "__main__":
-    main()
--- a/benchmarking/k8s-benchmark/profile_running_server.sh
+++ b/benchmarking/k8s-benchmark/profile_running_server.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-# Script to profile an already running Llama Stack server
-# Usage: ./profile_running_server.sh [duration_seconds] [output_file]
-
-DURATION=${1:-60}  # Default 60 seconds
-OUTPUT_FILE=${2:-"llama_stack_profile"}  # Default output file
-
-echo "Looking for running Llama Stack server..."
-
-# Find the server PID
-SERVER_PID=$(ps aux | grep "llama_stack.core.server.server" | grep -v grep | awk '{print $2}' | head -1)
-
-
-if [ -z "$SERVER_PID" ]; then
-    echo "Error: No running Llama Stack server found"
-    echo "Please start your server first with:"
-    echo "LLAMA_STACK_LOGGING=\"all=ERROR\" MOCK_INFERENCE_URL=http://localhost:8080 SAFETY_MODEL=llama-guard3:1b uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml"
-    exit 1
-fi
-
-echo "Found Llama Stack server with PID: $SERVER_PID"
-
-# Start py-spy profiling
-echo "Starting py-spy profiling for ${DURATION} seconds..."
-echo "Output will be saved to: ${OUTPUT_FILE}.svg"
-echo ""
-echo "You can now run your load test..."
-echo ""
-
-# Get the full path to py-spy
-PYSPY_PATH=$(which py-spy)
-
-# Check if running as root, if not, use sudo
-if [ "$EUID" -ne 0 ]; then
-    echo "py-spy requires root permissions on macOS. Running with sudo..."
-    sudo "$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID
-else
-    "$PYSPY_PATH" record -o "${OUTPUT_FILE}.svg" -d ${DURATION} -p $SERVER_PID
-fi
-
-echo ""
-echo "Profiling completed! Results saved to: ${OUTPUT_FILE}.svg"
-echo ""
-echo "To view the flame graph:"
-echo "open ${OUTPUT_FILE}.svg"
--- a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw1-v1-20250922-103408.txt
+++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw1-v1-20250922-103408.txt
@ -0,0 +1,171 @@
+Collecting uv
+  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
+Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 144.3 MB/s eta 0:00:00
+Installing collected packages: uv
+Successfully installed uv-0.8.19
+WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+
+[notice] A new release of pip is available: 24.0 -> 25.2
+[notice] To update, run: pip install --upgrade pip
+Using Python 3.11.13 environment at: /usr/local
+Resolved 61 packages in 551ms
+Downloading pillow (6.3MiB)
+Downloading hf-xet (3.0MiB)
+Downloading tokenizers (3.1MiB)
+Downloading pygments (1.2MiB)
+Downloading pandas (11.8MiB)
+Downloading aiohttp (1.7MiB)
+Downloading pydantic-core (1.9MiB)
+Downloading numpy (16.2MiB)
+Downloading transformers (11.1MiB)
+Downloading pyarrow (40.8MiB)
+ Downloading pydantic-core
+ Downloading aiohttp
+ Downloading tokenizers
+ Downloading hf-xet
+ Downloading pygments
+ Downloading pillow
+ Downloading numpy
+ Downloading pandas
+ Downloading transformers
+ Downloading pyarrow
+Prepared 61 packages in 1.23s
+Installed 61 packages in 114ms
+ + aiohappyeyeballs==2.6.1
+ + aiohttp==3.12.15
+ + aiosignal==1.4.0
+ + annotated-types==0.7.0
+ + anyio==4.10.0
+ + attrs==25.3.0
+ + certifi==2025.8.3
+ + charset-normalizer==3.4.3
+ + click==8.1.8
+ + datasets==4.1.1
+ + dill==0.4.0
+ + filelock==3.19.1
+ + frozenlist==1.7.0
+ + fsspec==2025.9.0
+ + ftfy==6.3.1
+ + guidellm==0.3.0
+ + h11==0.16.0
+ + h2==4.3.0
+ + hf-xet==1.1.10
+ + hpack==4.1.0
+ + httpcore==1.0.9
+ + httpx==0.28.1
+ + huggingface-hub==0.35.0
+ + hyperframe==6.1.0
+ + idna==3.10
+ + loguru==0.7.3
+ + markdown-it-py==4.0.0
+ + mdurl==0.1.2
+ + multidict==6.6.4
+ + multiprocess==0.70.16
+ + numpy==2.3.3
+ + packaging==25.0
+ + pandas==2.3.2
+ + pillow==11.3.0
+ + propcache==0.3.2
+ + protobuf==6.32.1
+ + pyarrow==21.0.0
+ + pydantic==2.11.9
+ + pydantic-core==2.33.2
+ + pydantic-settings==2.10.1
+ + pygments==2.19.2
+ + python-dateutil==2.9.0.post0
+ + python-dotenv==1.1.1
+ + pytz==2025.2
+ + pyyaml==6.0.2
+ + regex==2025.9.18
+ + requests==2.32.5
+ + rich==14.1.0
+ + safetensors==0.6.2
+ + six==1.17.0
+ + sniffio==1.3.1
+ + tokenizers==0.22.1
+ + tqdm==4.67.1
+ + transformers==4.56.2
+ + typing-extensions==4.15.0
+ + typing-inspection==0.4.1
+ + tzdata==2025.2
+ + urllib3==2.5.0
+ + wcwidth==0.2.14
+ + xxhash==3.5.0
+ + yarl==1.20.1
+Using Python 3.11.13 environment at: /usr/local
+Audited 1 package in 3ms
+Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
+Creating backend...
+Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
+Creating request loader...
+Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
+
+
+╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ [17:34:30] ⠋ 100% concurrent@1   (complete)   Req:    0.3 req/s,    3.32s Lat,     1.0 Conc,      18 Comp,        1 Inc,        0 Err                                                                │
+│                                               Tok:   74.0 gen/s,  238.6 tot/s,  40.2ms TTFT,   13.4ms ITL,   546 Prompt,      246 Gen                                                                │
+│ [17:35:35] ⠋ 100% concurrent@2   (complete)   Req:    0.6 req/s,    3.46s Lat,     2.0 Conc,      34 Comp,        2 Inc,        0 Err                                                                │
+│                                               Tok:  139.6 gen/s,  454.0 tot/s,  48.0ms TTFT,   14.1ms ITL,   546 Prompt,      243 Gen                                                                │
+│ [17:36:40] ⠋ 100% concurrent@4   (complete)   Req:    1.1 req/s,    3.44s Lat,     3.9 Conc,      68 Comp,        4 Inc,        0 Err                                                                │
+│                                               Tok:  273.2 gen/s,  900.4 tot/s,  50.7ms TTFT,   14.3ms ITL,   546 Prompt,      238 Gen                                                                │
+│ [17:37:45] ⠋ 100% concurrent@8   (complete)   Req:    2.2 req/s,    3.55s Lat,     7.7 Conc,     129 Comp,        8 Inc,        0 Err                                                                │
+│                                               Tok:  519.1 gen/s, 1699.8 tot/s,  66.0ms TTFT,   14.6ms ITL,   547 Prompt,      240 Gen                                                                │
+│ [17:38:50] ⠋ 100% concurrent@16  (complete)   Req:    4.1 req/s,    3.76s Lat,    15.5 Conc,     247 Comp,       16 Inc,        0 Err                                                                │
+│                                               Tok: 1005.5 gen/s, 3256.7 tot/s, 101.0ms TTFT,   15.0ms ITL,   547 Prompt,      244 Gen                                                                │
+│ [17:39:56] ⠋ 100% concurrent@32  (complete)   Req:    8.1 req/s,    3.84s Lat,    30.9 Conc,     483 Comp,       32 Inc,        0 Err                                                                │
+│                                               Tok: 1926.3 gen/s, 6327.2 tot/s, 295.7ms TTFT,   14.8ms ITL,   547 Prompt,      239 Gen                                                                │
+│ [17:41:03] ⠋ 100% concurrent@64  (complete)   Req:    9.9 req/s,    6.05s Lat,    59.7 Conc,     576 Comp,       58 Inc,        0 Err                                                                │
+│                                               Tok: 2381.0 gen/s, 7774.5 tot/s, 1196.2ms TTFT,   20.2ms ITL,   547 Prompt,      241 Gen                                                               │
+│ [17:42:10] ⠋ 100% concurrent@128 (complete)   Req:    9.2 req/s,   11.59s Lat,   107.2 Conc,     514 Comp,      117 Inc,        0 Err                                                                │
+│                                               Tok: 2233.4 gen/s, 7286.3 tot/s, 2403.9ms TTFT,   38.2ms ITL,   547 Prompt,      242 Gen                                                               │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
+
+Benchmarks Metadata:
+    Run id:511a14fd-ba11-4ffa-92ef-7cc23db4dd38
+    Duration:528.5 seconds
+    Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
+    Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
+    Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
+    backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
+    '/v1/chat/completions'}
+    Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
+    Extras:None
+
+
+Benchmarks Info:
+===================================================================================================================================================
+Metadata                                       |||| Requests Made  ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total||
+     Benchmark| Start Time| End Time| Duration (s)|  Comp|  Inc|  Err|  Comp|   Inc| Err|  Comp|   Inc| Err|   Comp|   Inc| Err|   Comp|   Inc| Err
+--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|----
+  concurrent@1|   17:34:35| 17:35:35|         60.0|    18|    1|    0| 546.4| 512.0| 0.0| 246.0|  14.0| 0.0|   9835|   512|   0|   4428|    14|   0
+  concurrent@2|   17:35:40| 17:36:40|         60.0|    34|    2|    0| 546.4| 512.0| 0.0| 242.7|  80.0| 0.0|  18577|  1024|   0|   8253|   160|   0
+  concurrent@4|   17:36:45| 17:37:45|         60.0|    68|    4|    0| 546.4| 512.0| 0.0| 238.1| 103.2| 0.0|  37156|  2048|   0|  16188|   413|   0
+  concurrent@8|   17:37:50| 17:38:50|         60.0|   129|    8|    0| 546.7| 512.0| 0.0| 240.3| 180.0| 0.0|  70518|  4096|   0|  31001|  1440|   0
+ concurrent@16|   17:38:55| 17:39:55|         60.0|   247|   16|    0| 546.6| 512.0| 0.0| 244.1| 142.6| 0.0| 135002|  8192|   0|  60300|  2281|   0
+ concurrent@32|   17:40:01| 17:41:01|         60.0|   483|   32|    0| 546.5| 512.0| 0.0| 239.2| 123.2| 0.0| 263972| 16384|   0| 115540|  3944|   0
+ concurrent@64|   17:41:08| 17:42:08|         60.0|   576|   58|    0| 546.6| 512.0| 0.0| 241.3|  13.9| 0.0| 314817| 29696|   0| 138976|   807|   0
+concurrent@128|   17:42:15| 17:43:15|         60.0|   514|  117|    0| 546.5| 512.0| 0.0| 241.6| 143.9| 0.0| 280911| 59904|   0| 124160| 16832|   0
+===================================================================================================================================================
+
+
+Benchmarks Stats:
+=======================================================================================================================================================
+Metadata      | Request Stats         || Out Tok/sec| Tot Tok/sec| Req Latency (sec) ||| TTFT (ms)           ||| ITL (ms)        ||| TPOT (ms)       ||
+     Benchmark| Per Second| Concurrency|        mean|        mean|  mean| median|   p99|   mean| median|    p99| mean| median|  p99| mean| median|  p99
+--------------|-----------|------------|------------|------------|------|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|-----
+  concurrent@1|       0.30|        1.00|        74.0|       238.6|  3.32|   3.43|  3.61|   40.2|   39.3|   51.2| 13.4|   13.3| 14.0| 13.3|   13.2| 13.9
+  concurrent@2|       0.58|        1.99|       139.6|       454.0|  3.46|   3.64|  3.74|   48.0|   45.8|   72.0| 14.1|   14.1| 14.5| 14.0|   14.0| 14.4
+  concurrent@4|       1.15|        3.95|       273.2|       900.4|  3.44|   3.69|  3.74|   50.7|   47.2|  118.6| 14.3|   14.3| 14.4| 14.2|   14.2| 14.4
+  concurrent@8|       2.16|        7.67|       519.1|      1699.8|  3.55|   3.76|  3.87|   66.0|   48.8|  208.2| 14.6|   14.5| 14.8| 14.5|   14.5| 14.8
+ concurrent@16|       4.12|       15.48|      1005.5|      3256.7|  3.76|   3.90|  4.18|  101.0|   65.6|  396.7| 15.0|   15.0| 15.9| 15.0|   15.0| 15.9
+ concurrent@32|       8.05|       30.89|      1926.3|      6327.2|  3.84|   4.04|  4.39|  295.7|  265.6|  720.4| 14.8|   14.9| 15.5| 14.8|   14.8| 15.3
+ concurrent@64|       9.87|       59.74|      2381.0|      7774.5|  6.05|   6.18|  9.94| 1196.2| 1122.5| 4295.3| 20.2|   20.0| 25.8| 20.1|   19.9| 25.8
+concurrent@128|       9.25|      107.16|      2233.4|      7286.3| 11.59|  12.04| 14.46| 2403.9| 2322.3| 4001.5| 38.2|   38.5| 53.0| 38.0|   38.3| 52.7
+=======================================================================================================================================================
+
+Saving benchmarks report...
+Benchmarks report saved to /benchmarks.json
+
+Benchmarking complete.
--- a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw2-v1-20250922-104457.txt
+++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw2-v1-20250922-104457.txt
@ -0,0 +1,171 @@
+Collecting uv
+  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
+Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 149.3 MB/s eta 0:00:00
+Installing collected packages: uv
+Successfully installed uv-0.8.19
+WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+
+[notice] A new release of pip is available: 24.0 -> 25.2
+[notice] To update, run: pip install --upgrade pip
+Using Python 3.11.13 environment at: /usr/local
+Resolved 61 packages in 494ms
+Downloading pandas (11.8MiB)
+Downloading tokenizers (3.1MiB)
+Downloading pygments (1.2MiB)
+Downloading aiohttp (1.7MiB)
+Downloading transformers (11.1MiB)
+Downloading numpy (16.2MiB)
+Downloading pillow (6.3MiB)
+Downloading pydantic-core (1.9MiB)
+Downloading hf-xet (3.0MiB)
+Downloading pyarrow (40.8MiB)
+ Downloading pydantic-core
+ Downloading aiohttp
+ Downloading tokenizers
+ Downloading hf-xet
+ Downloading pillow
+ Downloading pygments
+ Downloading numpy
+ Downloading pandas
+ Downloading pyarrow
+ Downloading transformers
+Prepared 61 packages in 1.24s
+Installed 61 packages in 126ms
+ + aiohappyeyeballs==2.6.1
+ + aiohttp==3.12.15
+ + aiosignal==1.4.0
+ + annotated-types==0.7.0
+ + anyio==4.10.0
+ + attrs==25.3.0
+ + certifi==2025.8.3
+ + charset-normalizer==3.4.3
+ + click==8.1.8
+ + datasets==4.1.1
+ + dill==0.4.0
+ + filelock==3.19.1
+ + frozenlist==1.7.0
+ + fsspec==2025.9.0
+ + ftfy==6.3.1
+ + guidellm==0.3.0
+ + h11==0.16.0
+ + h2==4.3.0
+ + hf-xet==1.1.10
+ + hpack==4.1.0
+ + httpcore==1.0.9
+ + httpx==0.28.1
+ + huggingface-hub==0.35.0
+ + hyperframe==6.1.0
+ + idna==3.10
+ + loguru==0.7.3
+ + markdown-it-py==4.0.0
+ + mdurl==0.1.2
+ + multidict==6.6.4
+ + multiprocess==0.70.16
+ + numpy==2.3.3
+ + packaging==25.0
+ + pandas==2.3.2
+ + pillow==11.3.0
+ + propcache==0.3.2
+ + protobuf==6.32.1
+ + pyarrow==21.0.0
+ + pydantic==2.11.9
+ + pydantic-core==2.33.2
+ + pydantic-settings==2.10.1
+ + pygments==2.19.2
+ + python-dateutil==2.9.0.post0
+ + python-dotenv==1.1.1
+ + pytz==2025.2
+ + pyyaml==6.0.2
+ + regex==2025.9.18
+ + requests==2.32.5
+ + rich==14.1.0
+ + safetensors==0.6.2
+ + six==1.17.0
+ + sniffio==1.3.1
+ + tokenizers==0.22.1
+ + tqdm==4.67.1
+ + transformers==4.56.2
+ + typing-extensions==4.15.0
+ + typing-inspection==0.4.1
+ + tzdata==2025.2
+ + urllib3==2.5.0
+ + wcwidth==0.2.14
+ + xxhash==3.5.0
+ + yarl==1.20.1
+Using Python 3.11.13 environment at: /usr/local
+Audited 1 package in 3ms
+Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
+Creating backend...
+Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
+Creating request loader...
+Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
+
+
+╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ [17:45:18] ⠋ 100% concurrent@1   (complete)   Req:    0.3 req/s,    3.42s Lat,     1.0 Conc,      17 Comp,        1 Inc,        0 Err                                                                │
+│                                               Tok:   73.9 gen/s,  233.7 tot/s,  50.2ms TTFT,   13.4ms ITL,   547 Prompt,      253 Gen                                                                │
+│ [17:46:23] ⠋ 100% concurrent@2   (complete)   Req:    0.6 req/s,    3.42s Lat,     2.0 Conc,      34 Comp,        2 Inc,        0 Err                                                                │
+│                                               Tok:  134.7 gen/s,  447.4 tot/s,  50.8ms TTFT,   14.3ms ITL,   546 Prompt,      235 Gen                                                                │
+│ [17:47:28] ⠋ 100% concurrent@4   (complete)   Req:    1.1 req/s,    3.55s Lat,     3.9 Conc,      66 Comp,        4 Inc,        0 Err                                                                │
+│                                               Tok:  268.7 gen/s,  873.1 tot/s,  54.9ms TTFT,   14.4ms ITL,   547 Prompt,      243 Gen                                                                │
+│ [17:48:33] ⠋ 100% concurrent@8   (complete)   Req:    2.2 req/s,    3.56s Lat,     7.8 Conc,     130 Comp,        8 Inc,        0 Err                                                                │
+│                                               Tok:  526.1 gen/s, 1728.4 tot/s,  60.6ms TTFT,   14.7ms ITL,   547 Prompt,      239 Gen                                                                │
+│ [17:49:38] ⠋ 100% concurrent@16  (complete)   Req:    4.1 req/s,    3.79s Lat,    15.7 Conc,     246 Comp,       16 Inc,        0 Err                                                                │
+│                                               Tok: 1006.9 gen/s, 3268.6 tot/s,  74.8ms TTFT,   15.3ms ITL,   547 Prompt,      243 Gen                                                                │
+│ [17:50:44] ⠋ 100% concurrent@32  (complete)   Req:    7.8 req/s,    3.95s Lat,    30.9 Conc,     467 Comp,       32 Inc,        0 Err                                                                │
+│                                               Tok: 1912.0 gen/s, 6191.6 tot/s, 119.1ms TTFT,   15.7ms ITL,   547 Prompt,      244 Gen                                                                │
+│ [17:51:50] ⠋ 100% concurrent@64  (complete)   Req:   13.0 req/s,    4.75s Lat,    61.8 Conc,     776 Comp,       64 Inc,        0 Err                                                                │
+│                                               Tok: 3154.3 gen/s, 10273.3 tot/s, 339.1ms TTFT,   18.3ms ITL,   547 Prompt,      242 Gen                                                               │
+│ [17:52:58] ⠋ 100% concurrent@128 (complete)   Req:   15.1 req/s,    7.82s Lat,   117.7 Conc,     898 Comp,      127 Inc,        0 Err                                                                │
+│                                               Tok: 3617.4 gen/s, 11843.9 tot/s, 1393.8ms TTFT,   26.8ms ITL,   547 Prompt,      240 Gen                                                              │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
+
+Benchmarks Metadata:
+    Run id:f73d408e-256a-4c32-aa40-05e8d7098b66
+    Duration:529.2 seconds
+    Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
+    Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
+    Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
+    backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
+    '/v1/chat/completions'}
+    Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
+    Extras:None
+
+
+Benchmarks Info:
+=====================================================================================================================================================
+Metadata                                       |||| Requests Made  ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total  ||
+     Benchmark| Start Time| End Time| Duration (s)|  Comp|  Inc|  Err|  Comp|   Inc| Err|  Comp|   Inc| Err|   Comp|   Inc| Err|    Comp|   Inc|  Err
+--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|-----
+  concurrent@1|   17:45:23| 17:46:23|         60.0|    17|    1|    0| 546.6| 512.0| 0.0| 252.8| 136.0| 0.0|   9292|   512|   0|    4298|   136|    0
+  concurrent@2|   17:46:28| 17:47:28|         60.0|    34|    2|    0| 546.4| 512.0| 0.0| 235.4| 130.0| 0.0|  18577|  1024|   0|    8003|   260|    0
+  concurrent@4|   17:47:33| 17:48:33|         60.0|    66|    4|    0| 546.5| 512.0| 0.0| 243.0|  97.5| 0.0|  36072|  2048|   0|   16035|   390|    0
+  concurrent@8|   17:48:38| 17:49:38|         60.0|   130|    8|    0| 546.6| 512.0| 0.0| 239.2| 146.0| 0.0|  71052|  4096|   0|   31090|  1168|    0
+ concurrent@16|   17:49:43| 17:50:43|         60.0|   246|   16|    0| 546.6| 512.0| 0.0| 243.3| 112.3| 0.0| 134456|  8192|   0|   59862|  1797|    0
+ concurrent@32|   17:50:49| 17:51:49|         60.0|   467|   32|    0| 546.6| 512.0| 0.0| 244.2| 147.3| 0.0| 255242| 16384|   0|  114038|  4714|    0
+ concurrent@64|   17:51:55| 17:52:55|         60.0|   776|   64|    0| 546.5| 512.0| 0.0| 242.2| 106.1| 0.0| 424115| 32768|   0|  187916|  6788|    0
+concurrent@128|   17:53:03| 17:54:03|         60.0|   898|  127|    0| 546.5| 512.0| 0.0| 240.3|  69.8| 0.0| 490789| 65024|   0|  215810|  8864|    0
+=====================================================================================================================================================
+
+
+Benchmarks Stats:
+======================================================================================================================================================
+Metadata      | Request Stats         || Out Tok/sec| Tot Tok/sec| Req Latency (sec)||| TTFT (ms)           ||| ITL (ms)        ||| TPOT (ms)       ||
+     Benchmark| Per Second| Concurrency|        mean|        mean| mean| median|   p99|   mean| median|    p99| mean| median|  p99| mean| median|  p99
+--------------|-----------|------------|------------|------------|-----|-------|------|-------|-------|-------|-----|-------|-----|-----|-------|-----
+  concurrent@1|       0.29|        1.00|        73.9|       233.7| 3.42|   3.45|  3.50|   50.2|   50.9|   62.5| 13.4|   13.4| 13.5| 13.3|   13.3| 13.5
+  concurrent@2|       0.57|        1.96|       134.7|       447.4| 3.42|   3.67|  4.12|   50.8|   49.2|   79.8| 14.3|   14.2| 15.9| 14.3|   14.2| 15.9
+  concurrent@4|       1.11|        3.92|       268.7|       873.1| 3.55|   3.72|  3.80|   54.9|   51.7|  101.3| 14.4|   14.4| 14.5| 14.4|   14.4| 14.5
+  concurrent@8|       2.20|        7.82|       526.1|      1728.4| 3.56|   3.78|  3.93|   60.6|   49.8|  189.5| 14.7|   14.7| 14.8| 14.6|   14.6| 14.8
+ concurrent@16|       4.14|       15.66|      1006.9|      3268.6| 3.79|   3.94|  4.25|   74.8|   54.3|  328.4| 15.3|   15.3| 16.1| 15.2|   15.2| 16.0
+ concurrent@32|       7.83|       30.91|      1912.0|      6191.6| 3.95|   4.07|  4.53|  119.1|   80.5|  674.0| 15.7|   15.6| 17.4| 15.7|   15.6| 17.3
+ concurrent@64|      13.03|       61.85|      3154.3|     10273.3| 4.75|   4.93|  5.43|  339.1|  321.1| 1146.6| 18.3|   18.4| 19.3| 18.2|   18.3| 19.2
+concurrent@128|      15.05|      117.71|      3617.4|     11843.9| 7.82|   8.58| 13.35| 1393.8| 1453.0| 5232.2| 26.8|   26.7| 36.0| 26.7|   26.6| 35.9
+======================================================================================================================================================
+
+Saving benchmarks report...
+Benchmarks report saved to /benchmarks.json
+
+Benchmarking complete.
--- a/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw4-v1-20250922-105539.txt
+++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-stack-s1-sw4-v1-20250922-105539.txt
@ -0,0 +1,171 @@
+Collecting uv
+  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
+Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 156.8 MB/s eta 0:00:00
+Installing collected packages: uv
+Successfully installed uv-0.8.19
+WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+
+[notice] A new release of pip is available: 24.0 -> 25.2
+[notice] To update, run: pip install --upgrade pip
+Using Python 3.11.13 environment at: /usr/local
+Resolved 61 packages in 480ms
+Downloading pillow (6.3MiB)
+Downloading pydantic-core (1.9MiB)
+Downloading pyarrow (40.8MiB)
+Downloading aiohttp (1.7MiB)
+Downloading numpy (16.2MiB)
+Downloading pygments (1.2MiB)
+Downloading transformers (11.1MiB)
+Downloading pandas (11.8MiB)
+Downloading tokenizers (3.1MiB)
+Downloading hf-xet (3.0MiB)
+ Downloading pydantic-core
+ Downloading aiohttp
+ Downloading tokenizers
+ Downloading hf-xet
+ Downloading pygments
+ Downloading pillow
+ Downloading numpy
+ Downloading pandas
+ Downloading pyarrow
+ Downloading transformers
+Prepared 61 packages in 1.25s
+Installed 61 packages in 126ms
+ + aiohappyeyeballs==2.6.1
+ + aiohttp==3.12.15
+ + aiosignal==1.4.0
+ + annotated-types==0.7.0
+ + anyio==4.10.0
+ + attrs==25.3.0
+ + certifi==2025.8.3
+ + charset-normalizer==3.4.3
+ + click==8.1.8
+ + datasets==4.1.1
+ + dill==0.4.0
+ + filelock==3.19.1
+ + frozenlist==1.7.0
+ + fsspec==2025.9.0
+ + ftfy==6.3.1
+ + guidellm==0.3.0
+ + h11==0.16.0
+ + h2==4.3.0
+ + hf-xet==1.1.10
+ + hpack==4.1.0
+ + httpcore==1.0.9
+ + httpx==0.28.1
+ + huggingface-hub==0.35.0
+ + hyperframe==6.1.0
+ + idna==3.10
+ + loguru==0.7.3
+ + markdown-it-py==4.0.0
+ + mdurl==0.1.2
+ + multidict==6.6.4
+ + multiprocess==0.70.16
+ + numpy==2.3.3
+ + packaging==25.0
+ + pandas==2.3.2
+ + pillow==11.3.0
+ + propcache==0.3.2
+ + protobuf==6.32.1
+ + pyarrow==21.0.0
+ + pydantic==2.11.9
+ + pydantic-core==2.33.2
+ + pydantic-settings==2.10.1
+ + pygments==2.19.2
+ + python-dateutil==2.9.0.post0
+ + python-dotenv==1.1.1
+ + pytz==2025.2
+ + pyyaml==6.0.2
+ + regex==2025.9.18
+ + requests==2.32.5
+ + rich==14.1.0
+ + safetensors==0.6.2
+ + six==1.17.0
+ + sniffio==1.3.1
+ + tokenizers==0.22.1
+ + tqdm==4.67.1
+ + transformers==4.56.2
+ + typing-extensions==4.15.0
+ + typing-inspection==0.4.1
+ + tzdata==2025.2
+ + urllib3==2.5.0
+ + wcwidth==0.2.14
+ + xxhash==3.5.0
+ + yarl==1.20.1
+Using Python 3.11.13 environment at: /usr/local
+Audited 1 package in 4ms
+Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
+Creating backend...
+Backend openai_http connected to http://llama-stack-benchmark-service:8323/v1/openai for model meta-llama/Llama-3.2-3B-Instruct.
+Creating request loader...
+Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
+
+
+╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ [17:55:59] ⠋ 100% concurrent@1   (complete)   Req:    0.3 req/s,    3.33s Lat,     1.0 Conc,      18 Comp,        1 Inc,        0 Err                                                                │
+│                                               Tok:   74.0 gen/s,  238.0 tot/s,  49.6ms TTFT,   13.4ms ITL,   546 Prompt,      246 Gen                                                                │
+│ [17:57:04] ⠋ 100% concurrent@2   (complete)   Req:    0.6 req/s,    3.32s Lat,     1.9 Conc,      35 Comp,        2 Inc,        0 Err                                                                │
+│                                               Tok:  137.1 gen/s,  457.5 tot/s,  50.6ms TTFT,   14.0ms ITL,   546 Prompt,      234 Gen                                                                │
+│ [17:58:09] ⠋ 100% concurrent@4   (complete)   Req:    1.2 req/s,    3.42s Lat,     4.0 Conc,      69 Comp,        4 Inc,        0 Err                                                                │
+│                                               Tok:  276.7 gen/s,  907.2 tot/s,  52.7ms TTFT,   14.1ms ITL,   547 Prompt,      240 Gen                                                                │
+│ [17:59:14] ⠋ 100% concurrent@8   (complete)   Req:    2.3 req/s,    3.47s Lat,     7.8 Conc,     134 Comp,        8 Inc,        0 Err                                                                │
+│                                               Tok:  541.4 gen/s, 1775.4 tot/s,  57.3ms TTFT,   14.3ms ITL,   547 Prompt,      240 Gen                                                                │
+│ [18:00:19] ⠋ 100% concurrent@16  (complete)   Req:    4.3 req/s,    3.60s Lat,    15.6 Conc,     259 Comp,       16 Inc,        0 Err                                                                │
+│                                               Tok: 1034.8 gen/s, 3401.7 tot/s,  72.3ms TTFT,   14.8ms ITL,   547 Prompt,      239 Gen                                                                │
+│ [18:01:25] ⠋ 100% concurrent@32  (complete)   Req:    8.4 req/s,    3.69s Lat,    31.1 Conc,     505 Comp,       32 Inc,        0 Err                                                                │
+│                                               Tok: 2029.7 gen/s, 6641.5 tot/s,  91.6ms TTFT,   15.0ms ITL,   547 Prompt,      241 Gen                                                                │
+│ [18:02:31] ⠋ 100% concurrent@64  (complete)   Req:   13.6 req/s,    4.50s Lat,    61.4 Conc,     818 Comp,       64 Inc,        0 Err                                                                │
+│                                               Tok: 3333.9 gen/s, 10787.0 tot/s, 171.3ms TTFT,   17.8ms ITL,   547 Prompt,      244 Gen                                                               │
+│ [18:03:40] ⠋ 100% concurrent@128 (complete)   Req:   16.1 req/s,    7.43s Lat,   119.5 Conc,     964 Comp,      122 Inc,        0 Err                                                                │
+│                                               Tok: 3897.0 gen/s, 12679.4 tot/s, 446.4ms TTFT,   28.9ms ITL,   547 Prompt,      243 Gen                                                               │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:41 < 0:00:00 ]
+
+Benchmarks Metadata:
+    Run id:5393e64f-d9f8-4548-95d8-da320bba1c24
+    Duration:530.1 seconds
+    Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
+    Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
+    Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://llama-stack-benchmark-service:8323/v1/openai' backend_model='meta-llama/Llama-3.2-3B-Instruct'
+    backend_info={'max_output_tokens': 16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path':
+    '/v1/chat/completions'}
+    Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
+    Extras:None
+
+
+Benchmarks Info:
+===================================================================================================================================================
+Metadata                                       |||| Requests Made  ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total||
+     Benchmark| Start Time| End Time| Duration (s)|  Comp|  Inc|  Err|  Comp|   Inc| Err|  Comp|   Inc| Err|   Comp|   Inc| Err|   Comp|   Inc| Err
+--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|-------|------|----
+  concurrent@1|   17:56:04| 17:57:04|         60.0|    18|    1|    0| 546.4| 512.0| 0.0| 246.4| 256.0| 0.0|   9836|   512|   0|   4436|   256|   0
+  concurrent@2|   17:57:09| 17:58:09|         60.0|    35|    2|    0| 546.4| 512.0| 0.0| 233.9| 132.0| 0.0|  19124|  1024|   0|   8188|   264|   0
+  concurrent@4|   17:58:14| 17:59:14|         60.0|    69|    4|    0| 546.6| 512.0| 0.0| 239.9|  60.5| 0.0|  37715|  2048|   0|  16553|   242|   0
+  concurrent@8|   17:59:19| 18:00:19|         60.0|   134|    8|    0| 546.6| 512.0| 0.0| 239.8| 126.6| 0.0|  73243|  4096|   0|  32135|  1013|   0
+ concurrent@16|   18:00:24| 18:01:24|         60.0|   259|   16|    0| 546.6| 512.0| 0.0| 239.0| 115.7| 0.0| 141561|  8192|   0|  61889|  1851|   0
+ concurrent@32|   18:01:30| 18:02:30|         60.0|   505|   32|    0| 546.5| 512.0| 0.0| 240.5| 113.2| 0.0| 275988| 16384|   0| 121466|  3623|   0
+ concurrent@64|   18:02:37| 18:03:37|         60.0|   818|   64|    0| 546.6| 512.0| 0.0| 244.5| 132.4| 0.0| 447087| 32768|   0| 199988|  8475|   0
+concurrent@128|   18:03:45| 18:04:45|         60.0|   964|  122|    0| 546.5| 512.0| 0.0| 242.5| 133.1| 0.0| 526866| 62464|   0| 233789| 16241|   0
+===================================================================================================================================================
+
+
+Benchmarks Stats:
+=======================================================================================================================================================
+Metadata      | Request Stats         || Out Tok/sec| Tot Tok/sec| Req Latency (sec)  ||| TTFT (ms)          ||| ITL (ms)        ||| TPOT (ms)       ||
+     Benchmark| Per Second| Concurrency|        mean|        mean|  mean|  median|   p99|  mean| median|    p99| mean| median|  p99| mean| median|  p99
+--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|-----
+  concurrent@1|       0.30|        1.00|        74.0|       238.0|  3.33|    3.44|  3.63|  49.6|   47.2|   66.1| 13.4|   13.3| 14.0| 13.3|   13.3| 14.0
+  concurrent@2|       0.59|        1.95|       137.1|       457.5|  3.32|    3.61|  3.67|  50.6|   48.6|   80.4| 14.0|   14.0| 14.2| 13.9|   13.9| 14.1
+  concurrent@4|       1.15|        3.95|       276.7|       907.2|  3.42|    3.61|  3.77|  52.7|   49.7|  106.9| 14.1|   14.0| 14.6| 14.0|   13.9| 14.5
+  concurrent@8|       2.26|        7.83|       541.4|      1775.4|  3.47|    3.70|  3.79|  57.3|   50.9|  171.3| 14.3|   14.3| 14.4| 14.2|   14.2| 14.4
+ concurrent@16|       4.33|       15.57|      1034.8|      3401.7|  3.60|    3.81|  4.22|  72.3|   52.0|  292.9| 14.8|   14.7| 16.3| 14.7|   14.7| 16.3
+ concurrent@32|       8.44|       31.12|      2029.7|      6641.5|  3.69|    3.89|  4.24|  91.6|   62.6|  504.6| 15.0|   15.0| 15.4| 14.9|   14.9| 15.4
+ concurrent@64|      13.64|       61.40|      3333.9|     10787.0|  4.50|    4.61|  5.67| 171.3|  101.2| 1165.6| 17.8|   17.7| 19.2| 17.7|   17.6| 19.1
+concurrent@128|      16.07|      119.45|      3897.0|     12679.4|  7.43|    7.63|  9.74| 446.4|  195.8| 2533.1| 28.9|   28.9| 31.0| 28.8|   28.8| 30.9
+=======================================================================================================================================================
+
+Saving benchmarks report...
+Benchmarks report saved to /benchmarks.json
+
+Benchmarking complete.
--- a/benchmarking/k8s-benchmark/results/guidellm-benchmark-vllm-v1-20250922-111127.txt
+++ b/benchmarking/k8s-benchmark/results/guidellm-benchmark-vllm-v1-20250922-111127.txt
@ -0,0 +1,170 @@
+Collecting uv
+  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
+Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 126.9 MB/s eta 0:00:00
+Installing collected packages: uv
+Successfully installed uv-0.8.19
+WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+
+[notice] A new release of pip is available: 24.0 -> 25.2
+[notice] To update, run: pip install --upgrade pip
+Using Python 3.11.13 environment at: /usr/local
+Resolved 61 packages in 561ms
+Downloading hf-xet (3.0MiB)
+Downloading pillow (6.3MiB)
+Downloading transformers (11.1MiB)
+Downloading pyarrow (40.8MiB)
+Downloading numpy (16.2MiB)
+Downloading pandas (11.8MiB)
+Downloading tokenizers (3.1MiB)
+Downloading pydantic-core (1.9MiB)
+Downloading pygments (1.2MiB)
+Downloading aiohttp (1.7MiB)
+ Downloading pydantic-core
+ Downloading aiohttp
+ Downloading tokenizers
+ Downloading hf-xet
+ Downloading pygments
+ Downloading pillow
+ Downloading numpy
+ Downloading pandas
+ Downloading transformers
+ Downloading pyarrow
+Prepared 61 packages in 1.25s
+Installed 61 packages in 114ms
+ + aiohappyeyeballs==2.6.1
+ + aiohttp==3.12.15
+ + aiosignal==1.4.0
+ + annotated-types==0.7.0
+ + anyio==4.10.0
+ + attrs==25.3.0
+ + certifi==2025.8.3
+ + charset-normalizer==3.4.3
+ + click==8.1.8
+ + datasets==4.1.1
+ + dill==0.4.0
+ + filelock==3.19.1
+ + frozenlist==1.7.0
+ + fsspec==2025.9.0
+ + ftfy==6.3.1
+ + guidellm==0.3.0
+ + h11==0.16.0
+ + h2==4.3.0
+ + hf-xet==1.1.10
+ + hpack==4.1.0
+ + httpcore==1.0.9
+ + httpx==0.28.1
+ + huggingface-hub==0.35.0
+ + hyperframe==6.1.0
+ + idna==3.10
+ + loguru==0.7.3
+ + markdown-it-py==4.0.0
+ + mdurl==0.1.2
+ + multidict==6.6.4
+ + multiprocess==0.70.16
+ + numpy==2.3.3
+ + packaging==25.0
+ + pandas==2.3.2
+ + pillow==11.3.0
+ + propcache==0.3.2
+ + protobuf==6.32.1
+ + pyarrow==21.0.0
+ + pydantic==2.11.9
+ + pydantic-core==2.33.2
+ + pydantic-settings==2.10.1
+ + pygments==2.19.2
+ + python-dateutil==2.9.0.post0
+ + python-dotenv==1.1.1
+ + pytz==2025.2
+ + pyyaml==6.0.2
+ + regex==2025.9.18
+ + requests==2.32.5
+ + rich==14.1.0
+ + safetensors==0.6.2
+ + six==1.17.0
+ + sniffio==1.3.1
+ + tokenizers==0.22.1
+ + tqdm==4.67.1
+ + transformers==4.56.2
+ + typing-extensions==4.15.0
+ + typing-inspection==0.4.1
+ + tzdata==2025.2
+ + urllib3==2.5.0
+ + wcwidth==0.2.14
+ + xxhash==3.5.0
+ + yarl==1.20.1
+Using Python 3.11.13 environment at: /usr/local
+Audited 1 package in 3ms
+Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
+Creating backend...
+Backend openai_http connected to http://vllm-server:8000 for model meta-llama/Llama-3.2-3B-Instruct.
+Creating request loader...
+Created loader with 1000 unique requests from prompt_tokens=512,output_tokens=256.
+
+
+╭─ Benchmarks ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ [18:11:47] ⠋ 100% concurrent@1   (complete)   Req:    0.3 req/s,    3.35s Lat,     1.0 Conc,      17 Comp,        1 Inc,        0 Err                                                                │
+│                                               Tok:   76.4 gen/s,  239.4 tot/s,  29.6ms TTFT,   13.0ms ITL,   547 Prompt,      256 Gen                                                                │
+│ [18:12:52] ⠋ 100% concurrent@2   (complete)   Req:    0.6 req/s,    3.53s Lat,     2.0 Conc,      32 Comp,        2 Inc,        0 Err                                                                │
+│                                               Tok:  145.0 gen/s,  454.5 tot/s,  36.9ms TTFT,   13.7ms ITL,   546 Prompt,      256 Gen                                                                │
+│ [18:13:57] ⠋ 100% concurrent@4   (complete)   Req:    1.1 req/s,    3.59s Lat,     4.0 Conc,      64 Comp,        4 Inc,        0 Err                                                                │
+│                                               Tok:  284.8 gen/s,  892.7 tot/s,  59.0ms TTFT,   13.9ms ITL,   546 Prompt,      256 Gen                                                                │
+│ [18:15:02] ⠋ 100% concurrent@8   (complete)   Req:    2.2 req/s,    3.70s Lat,     8.0 Conc,     128 Comp,        7 Inc,        0 Err                                                                │
+│                                               Tok:  553.5 gen/s, 1735.2 tot/s,  79.8ms TTFT,   14.2ms ITL,   547 Prompt,      256 Gen                                                                │
+│ [18:16:08] ⠋ 100% concurrent@16  (complete)   Req:    4.2 req/s,    3.83s Lat,    16.0 Conc,     240 Comp,       16 Inc,        0 Err                                                                │
+│                                               Tok: 1066.9 gen/s, 3344.6 tot/s,  97.5ms TTFT,   14.6ms ITL,   547 Prompt,      256 Gen                                                                │
+│ [18:17:13] ⠋ 100% concurrent@32  (complete)   Req:    8.1 req/s,    3.94s Lat,    31.8 Conc,     480 Comp,       31 Inc,        0 Err                                                                │
+│                                               Tok: 2069.7 gen/s, 6488.4 tot/s, 120.8ms TTFT,   15.0ms ITL,   547 Prompt,      256 Gen                                                                │
+│ [18:18:20] ⠋ 100% concurrent@64  (complete)   Req:   13.6 req/s,    4.60s Lat,    62.3 Conc,     813 Comp,       57 Inc,        0 Err                                                                │
+│                                               Tok: 3472.1 gen/s, 10884.9 tot/s, 190.9ms TTFT,   17.3ms ITL,   547 Prompt,      256 Gen                                                               │
+│ [18:19:28] ⠋ 100% concurrent@128 (complete)   Req:   16.8 req/s,    7.37s Lat,   123.5 Conc,    1005 Comp,      126 Inc,        0 Err                                                                │
+│                                               Tok: 4289.1 gen/s, 13445.8 tot/s, 356.4ms TTFT,   27.5ms ITL,   547 Prompt,      256 Gen                                                               │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (8/8) [ 0:08:43 < 0:00:00 ]
+
+Benchmarks Metadata:
+    Run id:8ccb6da1-83f4-4624-8d84-07c723b0b2a5
+    Duration:530.4 seconds
+    Profile:type=concurrent, strategies=['concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent', 'concurrent'], streams=[1, 2, 4, 8, 16, 32, 64, 128]
+    Args:max_number=None, max_duration=60.0, warmup_number=None, warmup_duration=3.0, cooldown_number=None, cooldown_duration=None
+    Worker:type_='generative_requests_worker' backend_type='openai_http' backend_target='http://vllm-server:8000' backend_model='meta-llama/Llama-3.2-3B-Instruct' backend_info={'max_output_tokens':
+    16384, 'timeout': 300, 'http2': True, 'follow_redirects': True, 'headers': {}, 'text_completions_path': '/v1/completions', 'chat_completions_path': '/v1/chat/completions'}
+    Request Loader:type_='generative_request_loader' data='prompt_tokens=512,output_tokens=256' data_args=None processor='meta-llama/Llama-3.2-3B-Instruct' processor_args=None
+    Extras:None
+
+
+Benchmarks Info:
+=====================================================================================================================================================
+Metadata                                       |||| Requests Made  ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total  ||
+     Benchmark| Start Time| End Time| Duration (s)|  Comp|  Inc|  Err|  Comp|   Inc| Err|  Comp|   Inc| Err|   Comp|   Inc| Err|    Comp|   Inc|  Err
+--------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|-------|------|----|--------|------|-----
+  concurrent@1|   18:11:52| 18:12:52|         60.0|    17|    1|    0| 546.5| 512.0| 0.0| 256.0| 231.0| 0.0|   9291|   512|   0|    4352|   231|    0
+  concurrent@2|   18:12:57| 18:13:57|         60.0|    32|    2|    0| 546.5| 512.0| 0.0| 256.0| 251.0| 0.0|  17488|  1024|   0|    8192|   502|    0
+  concurrent@4|   18:14:02| 18:15:02|         60.0|    64|    4|    0| 546.4| 512.0| 0.0| 256.0| 175.2| 0.0|  34972|  2048|   0|   16384|   701|    0
+  concurrent@8|   18:15:07| 18:16:07|         60.0|   128|    7|    0| 546.6| 512.0| 0.0| 256.0|  50.7| 0.0|  69966|  3584|   0|   32768|   355|    0
+ concurrent@16|   18:16:13| 18:17:13|         60.0|   240|   16|    0| 546.5| 512.0| 0.0| 256.0| 166.0| 0.0| 131170|  8192|   0|   61440|  2656|    0
+ concurrent@32|   18:17:18| 18:18:18|         60.0|   480|   31|    0| 546.5| 512.0| 0.0| 256.0|  47.4| 0.0| 262339| 15872|   0|  122880|  1468|    0
+ concurrent@64|   18:18:25| 18:19:25|         60.0|   813|   57|    0| 546.5| 512.0| 0.0| 256.0| 110.7| 0.0| 444341| 29184|   0|  208128|  6311|    0
+concurrent@128|   18:19:33| 18:20:33|         60.0|  1005|  126|    0| 546.5| 512.0| 0.0| 256.0|  65.8| 0.0| 549264| 64512|   0|  257280|  8296|    0
+=====================================================================================================================================================
+
+
+Benchmarks Stats:
+=======================================================================================================================================================
+Metadata      | Request Stats         || Out Tok/sec| Tot Tok/sec| Req Latency (sec)  ||| TTFT (ms)          ||| ITL (ms)        ||| TPOT (ms)       ||
+     Benchmark| Per Second| Concurrency|        mean|        mean|  mean|  median|   p99|  mean| median|    p99| mean| median|  p99| mean| median|  p99
+--------------|-----------|------------|------------|------------|------|--------|------|------|-------|-------|-----|-------|-----|-----|-------|-----
+  concurrent@1|       0.30|        1.00|        76.4|       239.4|  3.35|    3.35|  3.38|  29.6|   29.0|   38.9| 13.0|   13.0| 13.1| 13.0|   13.0| 13.0
+  concurrent@2|       0.57|        2.00|       145.0|       454.5|  3.53|    3.53|  3.55|  36.9|   39.0|   59.6| 13.7|   13.7| 13.8| 13.6|   13.7| 13.7
+  concurrent@4|       1.11|        4.00|       284.8|       892.7|  3.59|    3.59|  3.65|  59.0|   65.7|   88.2| 13.9|   13.8| 14.1| 13.8|   13.8| 14.0
+  concurrent@8|       2.16|        7.99|       553.5|      1735.2|  3.70|    3.69|  3.76|  79.8|   80.7|  152.6| 14.2|   14.2| 14.5| 14.1|   14.1| 14.4
+ concurrent@16|       4.17|       15.97|      1066.9|      3344.6|  3.83|    3.82|  3.99|  97.5|   96.3|  283.9| 14.6|   14.6| 14.9| 14.6|   14.6| 14.8
+ concurrent@32|       8.08|       31.84|      2069.7|      6488.4|  3.94|    3.90|  4.31| 120.8|  101.7|  564.3| 15.0|   14.9| 15.9| 14.9|   14.8| 15.9
+ concurrent@64|      13.56|       62.34|      3472.1|     10884.9|  4.60|    4.54|  5.43| 190.9|  133.9| 1113.2| 17.3|   17.2| 18.2| 17.2|   17.2| 18.2
+concurrent@128|      16.75|      123.45|      4289.1|     13445.8|  7.37|    7.21|  9.21| 356.4|  161.9| 2319.9| 27.5|   27.5| 28.8| 27.4|   27.4| 28.7
+=======================================================================================================================================================
+
+Saving benchmarks report...
+Benchmarks report saved to /benchmarks.json
+
+Benchmarking complete.
--- a/benchmarking/k8s-benchmark/results/vllm_replica1_benchmark_results.png
+++ b/benchmarking/k8s-benchmark/results/vllm_replica1_benchmark_results.png
--- a/benchmarking/k8s-benchmark/run-benchmark.sh
+++ b/benchmarking/k8s-benchmark/run-benchmark.sh
@ -1,148 +0,0 @@
-#!/usr/bin/env bash
-
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-set -euo pipefail
-
-# Default values
-TARGET="stack"
-DURATION=60
-CONCURRENT=10
-
-# Parse command line arguments
-usage() {
-    echo "Usage: $0 [options]"
-    echo "Options:"
-    echo "  -t, --target <stack|vllm>     Target to benchmark (default: stack)"
-    echo "  -d, --duration <seconds>      Duration in seconds (default: 60)"
-    echo "  -c, --concurrent <users>      Number of concurrent users (default: 10)"
-    echo "  -h, --help                    Show this help message"
-    echo ""
-    echo "Examples:"
-    echo "  $0 --target vllm              # Benchmark vLLM direct"
-    echo "  $0 --target stack             # Benchmark Llama Stack (default)"
-    echo "  $0 -t vllm -d 120 -c 20       # vLLM with 120s duration, 20 users"
-}
-
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        -t|--target)
-            TARGET="$2"
-            shift 2
-            ;;
-        -d|--duration)
-            DURATION="$2"
-            shift 2
-            ;;
-        -c|--concurrent)
-            CONCURRENT="$2"
-            shift 2
-            ;;
-        -h|--help)
-            usage
-            exit 0
-            ;;
-        *)
-            echo "Unknown option: $1"
-            usage
-            exit 1
-            ;;
-    esac
-done
-
-# Validate target
-if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then
-    echo "Error: Target must be 'stack' or 'vllm'"
-    usage
-    exit 1
-fi
-
-# Set configuration based on target
-if [[ "$TARGET" == "vllm" ]]; then
-    BASE_URL="http://vllm-server:8000/v1"
-    JOB_NAME="vllm-benchmark-job"
-    echo "Benchmarking vLLM direct..."
-else
-    BASE_URL="http://llama-stack-benchmark-service:8323/v1/openai/v1"
-    JOB_NAME="stack-benchmark-job"
-    echo "Benchmarking Llama Stack..."
-fi
-
-echo "Configuration:"
-echo "  Target: $TARGET"
-echo "  Base URL: $BASE_URL"
-echo "  Duration: ${DURATION}s"
-echo "  Concurrent users: $CONCURRENT"
-echo ""
-
-# Create temporary job yaml
-TEMP_YAML="/tmp/benchmark-job-temp-$(date +%s).yaml"
-cat > "$TEMP_YAML" << EOF
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: $JOB_NAME
-  namespace: default
-spec:
-  template:
-    spec:
-      containers:
-      - name: benchmark
-        image: python:3.11-slim
-        command: ["/bin/bash"]
-        args:
-        - "-c"
-        - |
-          pip install aiohttp &&
-          python3 /benchmark/benchmark.py \\
-            --base-url $BASE_URL \\
-            --model \${INFERENCE_MODEL} \\
-            --duration $DURATION \\
-            --concurrent $CONCURRENT
-        env:
-        - name: INFERENCE_MODEL
-          value: "meta-llama/Llama-3.2-3B-Instruct"
-        volumeMounts:
-        - name: benchmark-script
-          mountPath: /benchmark
-        resources:
-          requests:
-            memory: "256Mi"
-            cpu: "250m"
-          limits:
-            memory: "512Mi"
-            cpu: "500m"
-      volumes:
-      - name: benchmark-script
-        configMap:
-          name: benchmark-script
-      restartPolicy: Never
-  backoffLimit: 3
-EOF
-
-echo "Creating benchmark ConfigMap..."
-kubectl create configmap benchmark-script \
-  --from-file=benchmark.py=benchmark.py \
-  --dry-run=client -o yaml | kubectl apply -f -
-
-echo "Cleaning up any existing benchmark job..."
-kubectl delete job $JOB_NAME 2>/dev/null || true
-
-echo "Deploying benchmark Job..."
-kubectl apply -f "$TEMP_YAML"
-
-echo "Waiting for job to start..."
-kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=60s
-
-echo "Following benchmark logs..."
-kubectl logs -f job/$JOB_NAME
-
-echo "Job completed. Checking final status..."
-kubectl get job $JOB_NAME
-
-# Clean up temporary file
-rm -f "$TEMP_YAML"
--- a/benchmarking/k8s-benchmark/scripts/generate_charts.py
+++ b/benchmarking/k8s-benchmark/scripts/generate_charts.py
@ -0,0 +1,294 @@
+#!/usr/bin/env python3
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+# /// script
+# dependencies = [
+#   "matplotlib",
+# ]
+# ///
+"""
+Script to generate benchmark charts from guidellm text results.
+Creates 2x2 grid charts with RPS, Request Latency, TTFT, and ITL metrics against concurrent@x values.
+Outputs one chart file per vLLM replica group, with each line representing one benchmark run.
+"""
+
+import glob
+import os
+import re
+
+import matplotlib.pyplot as plt
+
+
+def extract_setup_name(filename: str) -> str:
+    """Extract setup name from filename and format legend appropriately."""
+    basename = os.path.basename(filename)
+
+    # Try new pattern: guidellm-benchmark-stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}-{timestamp}.txt
+    match = re.search(r"guidellm-benchmark-stack-s(\d+)-sw(\d+)-v(\d+)-(\d{8})-(\d{6})\.txt", basename)
+    if match:
+        stack_replicas = match.group(1)
+        workers = match.group(2)
+        vllm_replicas = match.group(3)
+        date = match.group(4)
+        time = match.group(5)
+        return f"stack-s{stack_replicas}-sw{workers}-v{vllm_replicas}"
+
+    # Try new vLLM pattern: guidellm-benchmark-vllm-v{vllm_replicas}-{timestamp}.txt
+    match = re.search(r"guidellm-benchmark-vllm-v(\d+)-(\d{8})-(\d{6})\.txt", basename)
+    if match:
+        vllm_replicas = match.group(1)
+        date = match.group(2)
+        time = match.group(3)
+        return f"vllm-v{vllm_replicas}"
+
+    # Fall back to old pattern: guidellm-benchmark-{target}-{stack_replicas}-w{workers}-{vllm_replicas}-{timestamp}.txt
+    match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-w(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename)
+    if match:
+        target = match.group(1)
+        stack_replicas = match.group(2)
+        workers = match.group(3)
+        vllm_replicas = match.group(4)
+        date = match.group(5)
+        time = match.group(6)
+
+        if target == "vllm":
+            return f"vllm-{vllm_replicas}-w{workers}-{vllm_replicas}"
+        else:
+            return f"stack-replicas{stack_replicas}-w{workers}-vllm-replicas{vllm_replicas}-{date}-{time}"
+
+    # Fall back to older pattern: guidellm-benchmark-{target}-{stack_replicas}-{vllm_replicas}-{timestamp}.txt
+    match = re.search(r"guidellm-benchmark-([^-]+)-(\d+)-(\d+)-(\d+)-(\d+)\.txt", basename)
+    if match:
+        target = match.group(1)
+        stack_replicas = match.group(2)
+        vllm_replicas = match.group(3)
+        date = match.group(4)
+        time = match.group(5)
+
+        if target == "vllm":
+            return f"vllm-{vllm_replicas}-w1-{vllm_replicas}"
+        else:
+            return f"stack-replicas{stack_replicas}-vllm-replicas{vllm_replicas}-{date}-{time}"
+
+    return basename.replace("guidellm-benchmark-", "").replace(".txt", "")
+
+
+def parse_txt_file(filepath: str) -> list[tuple[float, float, float, float, float, str]]:
+    """
+    Parse a text benchmark file and extract concurrent@x, RPS, TTFT, ITL, and request latency data.
+    Returns list of (concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name) tuples.
+    """
+    setup_name = extract_setup_name(filepath)
+    data_points = []
+
+    try:
+        with open(filepath) as f:
+            content = f.read()
+
+        # Find the benchmark stats table
+        lines = content.split("\n")
+        in_stats_table = False
+        header_lines_seen = 0
+
+        for line in lines:
+            line_stripped = line.strip()
+
+            # Look for the start of the stats table
+            if "Benchmarks Stats:" in line:
+                in_stats_table = True
+                continue
+
+            if in_stats_table:
+                # Skip the first few separator/header lines
+                if line_stripped.startswith("=") or line_stripped.startswith("-"):
+                    header_lines_seen += 1
+                    if header_lines_seen >= 3:  # After seeing multiple header lines, look for concurrent@ data
+                        if line_stripped.startswith("=") and "concurrent@" not in line_stripped:
+                            break
+                    continue
+
+            # Parse concurrent@ lines in the stats table (may have leading spaces)
+            if in_stats_table and "concurrent@" in line:
+                parts = [part.strip() for part in line.split("|")]
+
+                if len(parts) >= 12:  # Make sure we have enough columns for new format
+                    try:
+                        # Extract concurrency from benchmark name (e.g., concurrent@1 -> 1)
+                        concurrent_match = re.search(r"concurrent@(\d+)", parts[0])
+                        if not concurrent_match:
+                            continue
+                        concurrency = float(concurrent_match.group(1))
+
+                        # Extract metrics from the new table format
+                        # From your image, the table has these columns with | separators:
+                        # Benchmark | Per Second | Concurrency | Out Tok/sec | Tot Tok/sec | Req Latency (sec) | TTFT (ms) | ITL (ms) | TPOT (ms)
+                        # Looking at the mean/median/p99 structure, need to find the mean columns
+                        # The structure shows: mean | median | p99 for each metric
+                        rps_mean = float(parts[1])  # Per Second (RPS)
+                        req_latency_mean = float(parts[6]) * 1000  # Request latency mean (convert from sec to ms)
+                        ttft_mean = float(parts[9])  # TTFT mean column
+                        itl_mean = float(parts[12])  # ITL mean column
+
+                        data_points.append((concurrency, rps_mean, ttft_mean, itl_mean, req_latency_mean, setup_name))
+
+                    except (ValueError, IndexError) as e:
+                        print(f"Warning: Could not parse line '{line}' in {filepath}: {e}")
+                        continue
+
+    except (OSError, FileNotFoundError) as e:
+        print(f"Error reading {filepath}: {e}")
+
+    return data_points
+
+
+def generate_charts(benchmark_dir: str = "results"):
+    """Generate 2x2 grid charts (RPS, Request Latency, TTFT, ITL) from benchmark text files."""
+    # Find all text result files instead of JSON
+    txt_pattern = os.path.join(benchmark_dir, "guidellm-benchmark-*.txt")
+    txt_files = glob.glob(txt_pattern)
+
+    if not txt_files:
+        print(f"No text files found matching pattern: {txt_pattern}")
+        return
+
+    print(f"Found {len(txt_files)} text files")
+
+    # Parse all files and collect data
+    all_data = {}  # setup_name -> [(concurrency, rps, ttft, itl, req_latency), ...]
+
+    for txt_file in txt_files:
+        print(f"Processing {txt_file}")
+        data_points = parse_txt_file(txt_file)
+
+        for concurrency, rps, ttft, itl, req_latency, setup_name in data_points:
+            if setup_name not in all_data:
+                all_data[setup_name] = []
+            all_data[setup_name].append((concurrency, rps, ttft, itl, req_latency))
+
+    if not all_data:
+        print("No data found to plot")
+        return
+
+    # Sort data points by concurrency for each setup
+    for setup_name in all_data:
+        all_data[setup_name].sort(key=lambda x: x[0])  # Sort by concurrency
+
+    # Group setups by vLLM replica number (original approach)
+    replica_groups = {}  # vllm_replica_count -> {setup_name: points}
+
+    for setup_name, points in all_data.items():
+        # Extract vLLM replica number from setup name
+        # Expected formats:
+        # - New stack format: "stack-s{X}-sw{W}-v{Y}"
+        # - New vLLM format: "vllm-v{Y}"
+        # - Old formats: "stack-replicas{X}-w{W}-vllm-replicas{Y}" or "vllm-{Y}-w{W}-{Y}"
+
+        # Try new formats first
+        vllm_match = re.search(r"-v(\d+)$", setup_name)  # Matches both "stack-s1-sw2-v3" and "vllm-v1"
+        if not vllm_match:
+            # Try old stack format
+            vllm_match = re.search(r"vllm-replicas(\d+)", setup_name)
+        if not vllm_match:
+            # Try old vLLM format: "vllm-{Y}-w{W}-{Y}"
+            vllm_match = re.search(r"vllm-(\d+)-w\d+-\d+", setup_name)
+
+        if vllm_match:
+            vllm_replica_num = int(vllm_match.group(1))
+            if vllm_replica_num not in replica_groups:
+                replica_groups[vllm_replica_num] = {}
+            replica_groups[vllm_replica_num][setup_name] = points
+        else:
+            print(f"Warning: Could not extract vLLM replica count from setup name: {setup_name}")
+
+    def create_charts(data_dict, prefix, title_prefix):
+        """Create a 2x2 grid with RPS, Request Latency, TTFT, and ITL charts."""
+        if not data_dict:
+            print(f"No data found for {prefix}")
+            return
+
+        # Create 2x2 subplot grid
+        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
+        fig.suptitle(f"{title_prefix} Benchmark Results", fontsize=16, fontweight="bold")
+
+        # Collect all unique concurrency values for tick setting
+        all_concurrency_values = set()
+        for points in data_dict.values():
+            all_concurrency_values.update([p[0] for p in points])
+        all_concurrency_values = sorted(all_concurrency_values)
+
+        # Plot data for each setup in alphabetical order
+        for setup_name in sorted(data_dict.keys()):
+            points = data_dict[setup_name]
+            if not points:
+                continue
+
+            concurrency_values = [p[0] for p in points]
+            rps_values = [p[1] for p in points]
+            ttft_values = [p[2] for p in points]
+            itl_values = [p[3] for p in points]
+            req_latency_values = [p[4] for p in points]
+
+            # RPS chart (top-left)
+            ax1.plot(concurrency_values, rps_values, marker="o", label=setup_name, linewidth=2, markersize=6)
+
+            # Request Latency chart (top-right)
+            ax2.plot(concurrency_values, req_latency_values, marker="o", label=setup_name, linewidth=2, markersize=6)
+
+            # TTFT chart (bottom-left)
+            ax3.plot(concurrency_values, ttft_values, marker="o", label=setup_name, linewidth=2, markersize=6)
+
+            # ITL chart (bottom-right)
+            ax4.plot(concurrency_values, itl_values, marker="o", label=setup_name, linewidth=2, markersize=6)
+
+        # Configure all charts after plotting data
+        axes = [ax1, ax2, ax3, ax4]
+        titles = ["RPS", "Request Latency", "TTFT", "ITL"]
+        ylabels = [
+            "Requests Per Second (RPS)",
+            "Request Latency (ms)",
+            "Time to First Token (ms)",
+            "Inter Token Latency (ms)",
+        ]
+
+        for ax, title, ylabel in zip(axes, titles, ylabels, strict=False):
+            ax.set_xlabel("Concurrency", fontsize=12)
+            ax.set_ylabel(ylabel, fontsize=12)
+            ax.set_title(title, fontsize=14, fontweight="bold")
+            ax.set_xscale("log", base=2)
+            ax.set_xticks(all_concurrency_values)
+            ax.set_xticklabels([str(int(x)) for x in all_concurrency_values])
+            ax.grid(True, alpha=0.3)
+
+        # Add legend to the right-most subplot (top-right)
+        ax2.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
+
+        plt.tight_layout()
+
+        # Save the combined chart
+        combined_filename = os.path.join(benchmark_dir, f"{prefix}_benchmark_results.png")
+        plt.savefig(combined_filename, dpi=300, bbox_inches="tight")
+        plt.close()
+        print(f"Combined benchmark chart saved to {combined_filename}")
+
+    # Print grouping information
+    for replica_count, data_dict in replica_groups.items():
+        print(f"vLLM Replica {replica_count} setups: {list(data_dict.keys())}")
+
+    # Create separate charts for each replica group
+    for replica_count, data_dict in replica_groups.items():
+        prefix = f"vllm_replica{replica_count}"
+        title = f"vLLM Replicas={replica_count}"
+        create_charts(data_dict, prefix, title)
+
+    # Print summary
+    print("\nSummary:")
+    for setup_name, points in all_data.items():
+        print(f"{setup_name}: {len(points)} data points")
+
+
+if __name__ == "__main__":
+    generate_charts()
--- a/benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh
+++ b/benchmarking/k8s-benchmark/scripts/run-all-benchmarks.sh
@ -0,0 +1,103 @@
+#!/usr/bin/env bash
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+# Define benchmark configurations: (target, stack_replicas, vllm_replicas, stack_workers)
+configs=(
+    "stack 1 1 1"
+    "stack 1 1 2"
+    "stack 1 1 4"
+    "vllm 1 1 -"
+)
+
+set -euo pipefail
+
+# Get the directory where this script is located
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+echo "Running comprehensive GuideLL benchmark suite..."
+echo "Start time: $(date)"
+
+# Default deployment names
+STACK_DEPLOYMENT="llama-stack-benchmark-server"
+VLLM_DEPLOYMENT="vllm-server"
+
+# Scaling function
+scale_deployments() {
+    local stack_replicas=$1
+    local vllm_replicas=$2
+    local workers=$3
+
+    echo "Scaling deployments..."
+
+    if [[ "$vllm_replicas" != "-" ]]; then
+        echo "Scaling $VLLM_DEPLOYMENT to $vllm_replicas replicas..."
+        kubectl scale deployment $VLLM_DEPLOYMENT --replicas=$vllm_replicas
+        kubectl rollout status deployment $VLLM_DEPLOYMENT --timeout=600s
+    fi
+
+    if [[ "$target" == "stack" ]]; then
+        if [[ "$stack_replicas" != "-" ]]; then
+            echo "Scaling $STACK_DEPLOYMENT to $stack_replicas replicas..."
+            kubectl scale deployment $STACK_DEPLOYMENT --replicas=$stack_replicas
+            kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s
+        fi
+
+        if [[ "$workers" != "-" ]]; then
+            echo "Updating $STACK_DEPLOYMENT to use $workers workers..."
+            kubectl set env deployment/$STACK_DEPLOYMENT LLAMA_STACK_WORKERS=$workers
+            kubectl rollout status deployment $STACK_DEPLOYMENT --timeout=600s
+        fi
+    fi
+
+    echo "All scaling operations completed. Waiting additional 30s for services to stabilize..."
+    sleep 30
+}
+
+
+for config in "${configs[@]}"; do
+    read -r target stack_replicas vllm_replicas workers <<< "$config"
+
+    echo ""
+    echo "=========================================="
+    if [[ "$workers" != "-" ]]; then
+        echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas, workers=$workers)"
+    else
+        echo "Running benchmark: $target (stack=$stack_replicas, vllm=$vllm_replicas)"
+    fi
+    echo "Start: $(date)"
+    echo "=========================================="
+
+    # Scale deployments before running benchmark
+    scale_deployments "$stack_replicas" "$vllm_replicas" "$workers"
+
+    # Generate output filename with setup info
+    TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+    if [[ "$target" == "stack" ]]; then
+        OUTPUT_FILE="results/guidellm-benchmark-${target}-s${stack_replicas}-sw${workers}-v${vllm_replicas}-${TIMESTAMP}.txt"
+    else
+        OUTPUT_FILE="results/guidellm-benchmark-${target}-v${vllm_replicas}-${TIMESTAMP}.txt"
+    fi
+
+    # Run the benchmark with the cluster as configured
+    "$SCRIPT_DIR/run-guidellm-benchmark.sh" \
+        --target "$target" \
+        --output-file "$OUTPUT_FILE"
+
+    echo "Completed: $(date)"
+    echo "Waiting 30 seconds before next benchmark..."
+    sleep 30
+done
+
+echo ""
+echo "=========================================="
+echo "All benchmarks completed!"
+echo "End time: $(date)"
+echo "=========================================="
+echo ""
+echo "Results files generated:"
+ls -la results/guidellm-*.txt results/guidellm-*.json 2>/dev/null || echo "No result files found"
--- a/benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh
+++ b/benchmarking/k8s-benchmark/scripts/run-guidellm-benchmark.sh
@ -0,0 +1,219 @@
+#!/usr/bin/env bash
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+set -euo pipefail
+
+# Default values
+TARGET="stack"
+MAX_SECONDS=60
+PROMPT_TOKENS=512
+OUTPUT_TOKENS=256
+RATE_TYPE="concurrent"
+RATE="1,2,4,8,16,32,64,128"
+STACK_DEPLOYMENT="llama-stack-benchmark-server"
+STACK_URL="http://llama-stack-benchmark-service:8323/v1/openai"
+VLLM_DEPLOYMENT="vllm-server"
+OUTPUT_FILE=""
+
+# Parse command line arguments
+usage() {
+    echo "Usage: $0 [options]"
+    echo "Options:"
+    echo "  -t, --target <stack|vllm>     Target to benchmark (default: stack)"
+    echo "  -s, --max-seconds <seconds>   Maximum duration in seconds (default: 60)"
+    echo "  -p, --prompt-tokens <tokens>  Number of prompt tokens (default: 512)"
+    echo "  -o, --output-tokens <tokens>  Number of output tokens (default: 256)"
+    echo "  -r, --rate-type <type>        Rate type (default: concurrent)"
+    echo "  -c, --rate                    Rate (default: 1,2,4,8,16,32,64,128)"
+    echo "  --output-file <path>          Output file path (default: auto-generated)"
+    echo "  --stack-deployment <name>     Name of the stack deployment (default: llama-stack-benchmark-server)"
+    echo "  --vllm-deployment <name>      Name of the vllm deployment (default: vllm-server)"
+    echo "  --stack-url <url>             URL of the stack service (default: http://llama-stack-benchmark-service:8323/v1/openai)"
+    echo "  -h, --help                    Show this help message"
+    echo ""
+    echo "Examples:"
+    echo "  $0 --target vllm                              # Benchmark vLLM direct"
+    echo "  $0 --target stack                             # Benchmark Llama Stack (default)"
+    echo "  $0 -t vllm -s 60 -p 512 -o 256               # vLLM with custom parameters"
+    echo "  $0 --output-file results/my-benchmark.txt     # Specify custom output file"
+    echo "  $0 --stack-deployment my-stack-server         # Use custom stack deployment name"
+}
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -t|--target)
+            TARGET="$2"
+            shift 2
+            ;;
+        -s|--max-seconds)
+            MAX_SECONDS="$2"
+            shift 2
+            ;;
+        -p|--prompt-tokens)
+            PROMPT_TOKENS="$2"
+            shift 2
+            ;;
+        -o|--output-tokens)
+            OUTPUT_TOKENS="$2"
+            shift 2
+            ;;
+        -r|--rate-type)
+            RATE_TYPE="$2"
+            shift 2
+            ;;
+        -c|--rate)
+            RATE="$2"
+            shift 2
+            ;;
+        --output-file)
+            OUTPUT_FILE="$2"
+            shift 2
+            ;;
+        --stack-deployment)
+            STACK_DEPLOYMENT="$2"
+            shift 2
+            ;;
+        --vllm-deployment)
+            VLLM_DEPLOYMENT="$2"
+            shift 2
+            ;;
+        --stack-url)
+            STACK_URL="$2"
+            shift 2
+            ;;
+        -h|--help)
+            usage
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            usage
+            exit 1
+            ;;
+    esac
+done
+
+# Validate target
+if [[ "$TARGET" != "stack" && "$TARGET" != "vllm" ]]; then
+    echo "Error: Target must be 'stack' or 'vllm'"
+    usage
+    exit 1
+fi
+
+# Set configuration based on target
+if [[ "$TARGET" == "vllm" ]]; then
+    BASE_URL="http://${VLLM_DEPLOYMENT}:8000"
+    JOB_NAME="guidellm-vllm-benchmark-job"
+    echo "Benchmarking vLLM direct with GuideLLM..."
+else
+    BASE_URL="$STACK_URL"
+    JOB_NAME="guidellm-stack-benchmark-job"
+    echo "Benchmarking Llama Stack with GuideLLM..."
+fi
+
+
+echo "Configuration:"
+echo "  Target: $TARGET"
+echo "  Base URL: $BASE_URL"
+echo "  Max seconds: ${MAX_SECONDS}s"
+echo "  Prompt tokens: $PROMPT_TOKENS"
+echo "  Output tokens: $OUTPUT_TOKENS"
+echo "  Rate type: $RATE_TYPE"
+if [[ "$TARGET" == "vllm" ]]; then
+    echo "  vLLM deployment: $VLLM_DEPLOYMENT"
+else
+    echo "  Stack deployment: $STACK_DEPLOYMENT"
+fi
+echo ""
+
+# Create temporary job yaml
+TEMP_YAML="/tmp/guidellm-benchmark-job-temp-$(date +%s).yaml"
+cat > "$TEMP_YAML" << EOF
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: $JOB_NAME
+  namespace: default
+spec:
+  template:
+    spec:
+      containers:
+      - name: guidellm-benchmark
+        image: python:3.11-slim
+        command: ["/bin/bash"]
+        args:
+        - "-c"
+        - |
+          # Install uv and guidellm
+          pip install uv &&
+          uv pip install --system guidellm &&
+
+          # Login to HuggingFace
+          uv pip install --system huggingface_hub &&
+          python -c "from huggingface_hub import login; login(token='\$HF_TOKEN')" &&
+
+          # Run GuideLLM benchmark and save output
+          export COLUMNS=200
+          GUIDELLM__PREFERRED_ROUTE="chat_completions" uv run guidellm benchmark run \\
+            --target "$BASE_URL" \\
+            --rate-type "$RATE_TYPE" \\
+            --max-seconds $MAX_SECONDS \\
+            --data "prompt_tokens=$PROMPT_TOKENS,output_tokens=$OUTPUT_TOKENS" \\
+            --model "$INFERENCE_MODEL" \\
+            --rate "$RATE" \\
+            --warmup-percent 0.05 \\
+            2>&1
+        env:
+        - name: INFERENCE_MODEL
+          value: "meta-llama/Llama-3.2-3B-Instruct"
+        - name: HF_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: token
+        resources:
+          requests:
+            memory: "4Gi"
+            cpu: "500m"
+          limits:
+            memory: "8Gi"
+            cpu: "2000m"
+      restartPolicy: Never
+  backoffLimit: 3
+EOF
+
+echo "Cleaning up any existing GuideLLM benchmark job..."
+kubectl delete job $JOB_NAME 2>/dev/null || true
+
+echo "Deploying GuideLLM benchmark Job..."
+kubectl apply -f "$TEMP_YAML"
+
+echo "Waiting for job to start..."
+kubectl wait --for=condition=Ready pod -l job-name=$JOB_NAME --timeout=120s
+
+# Prepare file names and create results directory
+mkdir -p results
+if [[ -z "$OUTPUT_FILE" ]]; then
+    TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+    OUTPUT_FILE="results/guidellm-benchmark-${TARGET}-${TIMESTAMP}.txt"
+fi
+
+echo "Following GuideLLM benchmark logs..."
+kubectl logs -f job/$JOB_NAME
+
+echo "Job completed. Checking final status..."
+kubectl get job $JOB_NAME
+
+# Save benchmark results using kubectl logs
+echo "Saving benchmark results..."
+kubectl logs job/$JOB_NAME > "$OUTPUT_FILE"
+
+echo "Benchmark output saved to: $OUTPUT_FILE"
+
+# Clean up temporary file
+rm -f "$TEMP_YAML"
--- a/benchmarking/k8s-benchmark/stack-k8s.yaml.template
+++ b/benchmarking/k8s-benchmark/stack-k8s.yaml.template
@ -58,14 +58,14 @@ spec:
          value: "/etc/config/stack_run_config.yaml"
        - name: LLAMA_STACK_WORKERS
          value: "${LLAMA_STACK_WORKERS}"
-        command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$LLAMA_STACK_WORKERS", "--factory"]
+        command: ["uvicorn", "llama_stack.core.server.server:create_app", "--host", "0.0.0.0", "--port", "8323", "--workers", "$(LLAMA_STACK_WORKERS)", "--factory"]
        ports:
          - containerPort: 8323
        resources:
          requests:
-            cpu: "${LLAMA_STACK_WORKERS}"
+            cpu: "4"
          limits:
-            cpu: "${LLAMA_STACK_WORKERS}"
+            cpu: "4"
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.llama
--- a/pyproject.toml
+++ b/pyproject.toml
@ -177,6 +177,7 @@ exclude = [
    ".pre-commit-config.yaml",
    "*.md",
    ".flake8",
+    "benchmarking/k8s-benchmark/results",
 ]

 [tool.ruff.lint]