llama-stack-mirror/benchmarking/k8s-benchmark
ehhuang 4c2fcb6b51
Some checks failed
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Python Package Build Test / build (3.13) (push) Failing after 3s
Vector IO Integration Tests / test-matrix (push) Failing after 6s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 5s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 8s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 13s
Unit Tests / unit-tests (3.13) (push) Failing after 4s
Test External API and Providers / test-external (venv) (push) Failing after 7s
Unit Tests / unit-tests (3.12) (push) Failing after 6s
Python Package Build Test / build (3.12) (push) Failing after 10s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 18s
API Conformance Tests / check-schema-compatibility (push) Successful in 22s
UI Tests / ui-tests (22) (push) Successful in 29s
Pre-commit / pre-commit (push) Successful in 1m25s
chore: refactor server.main (#3462)
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.

To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.


## Test Plan
CI

> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml

works.

> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4

works.
2025-09-18 21:11:13 -07:00
..
apply.sh chore: refactor server.main (#3462) 2025-09-18 21:11:13 -07:00
benchmark.py chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00
openai-mock-server.py chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00
profile_running_server.sh chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00
README.md chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00
run-benchmark.sh chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00
stack-configmap.yaml chore: refactor server.main (#3462) 2025-09-18 21:11:13 -07:00
stack-k8s.yaml.template chore: refactor server.main (#3462) 2025-09-18 21:11:13 -07:00
stack_run_config.yaml chore: move benchmarking related code (#3406) 2025-09-10 13:19:44 -07:00

Llama Stack Benchmark Suite on Kubernetes

Motivation

Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.

Why This Benchmark Suite Exists

Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:

  • Llama Stack inference (with vLLM backend)
  • Direct vLLM inference calls
  • Both under identical Kubernetes deployment conditions

Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.

Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.

Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:

  • Kubernetes resource allocation (CPU, memory, GPU)
  • Auto-scaling configurations
  • Cost optimization strategies

Key Metrics Captured

The benchmark suite measures critical performance indicators:

  • Throughput: Requests per second under sustained load
  • Latency Distribution: P50, P95, P99 response times
  • Time to First Token (TTFT): Critical for streaming applications
  • Error Rates: Request failures and timeout analysis

This data enables data-driven architectural decisions and performance optimization efforts.

Setup

1. Deploy base k8s infrastructure:

cd ../../docs/source/distributions/k8s
./apply.sh

2. Deploy benchmark components:

./apply.sh

3. Verify deployment:

kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.

Quick Start

Basic Benchmarks

Benchmark Llama Stack (default):

./run-benchmark.sh

Benchmark vLLM direct:

./run-benchmark.sh --target vllm

Custom Configuration

Extended benchmark with high concurrency:

./run-benchmark.sh --target vllm --duration 120 --concurrent 20

Short test run:

./run-benchmark.sh --target stack --duration 30 --concurrent 5

Command Reference

run-benchmark.sh Options

./run-benchmark.sh [options]

Options:
  -t, --target <stack|vllm>     Target to benchmark (default: stack)
  -d, --duration <seconds>      Duration in seconds (default: 60)
  -c, --concurrent <users>      Number of concurrent users (default: 10)
  -h, --help                    Show help message

Examples:
  ./run-benchmark.sh --target vllm              # Benchmark vLLM direct
  ./run-benchmark.sh --target stack             # Benchmark Llama Stack
  ./run-benchmark.sh -t vllm -d 120 -c 20       # vLLM with 120s, 20 users

Local Testing

Running Benchmark Locally

For local development without Kubernetes:

1. Start OpenAI mock server:

uv run python openai-mock-server.py --port 8080

2. Run benchmark against mock server:

uv run python benchmark.py \
  --base-url http://localhost:8080/v1 \
  --model mock-inference \
  --duration 30 \
  --concurrent 5

3. Test against local vLLM server:

# If you have vLLM running locally on port 8000
uv run python benchmark.py \
  --base-url http://localhost:8000/v1 \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --duration 30 \
  --concurrent 5

4. Profile the running server:

./profile_running_server.sh

OpenAI Mock Server

The openai-mock-server.py provides:

  • OpenAI-compatible API for testing without real models
  • Configurable streaming delay via STREAM_DELAY_SECONDS env var
  • Consistent responses for reproducible benchmarks
  • Lightweight testing without GPU requirements

Mock server usage:

uv run python openai-mock-server.py --port 8080

The mock server is also deployed in k8s as openai-mock-service:8080 and can be used by changing the Llama Stack configuration to use the mock-vllm-inference provider.

Files in this Directory

  • benchmark.py - Core benchmark script with async streaming support
  • run-benchmark.sh - Main script with target selection and configuration
  • openai-mock-server.py - Mock OpenAI API server for local testing
  • README.md - This documentation file