# What does this PR do? Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes. ## Test Plan Benchmark: ``` cd /docs/source/distributions/k8s-benchmark # start mock server python openai-mock-server.py --port 8000 # start stack server uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml # run benchmark script uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct ``` Before: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.111s Median: 0.982s Min: 0.466s Max: 15.190s Std Dev: 1.091s Percentiles: P50: 0.982s P90: 1.281s P95: 1.439s P99: 5.476s Time to First Token (TTFT) Statistics: Mean: 0.474s Median: 0.347s Min: 0.175s Max: 15.129s Std Dev: 0.819s TTFT Percentiles: P50: 0.347s P90: 0.661s P95: 0.762s P99: 2.788s Streaming Statistics: Mean chunks per response: 67.2 Total chunks received: 122154 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 1919 Successful requests: 1819 Failed requests: 100 Success rate: 94.8% Requests per second: 15.16 Errors (showing first 5): Request error: Request error: Request error: Request error: Request error: Benchmark completed. Stopping server (PID: 679)... Server stopped. After: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.085s Median: 1.089s Min: 0.451s Max: 2.002s Std Dev: 0.212s Percentiles: P50: 1.089s P90: 1.343s P95: 1.409s P99: 1.617s Time to First Token (TTFT) Statistics: Mean: 0.407s Median: 0.361s Min: 0.182s Max: 1.178s Std Dev: 0.175s TTFT Percentiles: P50: 0.361s P90: 0.644s P95: 0.744s P99: 0.932s Streaming Statistics: Mean chunks per response: 66.8 Total chunks received: 367240 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 5495 Successful requests: 5495 Failed requests: 0 Success rate: 100.0% Requests per second: 45.79 Benchmark completed. Stopping server (PID: 97169)... Server stopped. |
||
---|---|---|
.. | ||
apply.sh | ||
benchmark.py | ||
openai-mock-server.py | ||
profile_running_server.sh | ||
README.md | ||
run-benchmark.sh | ||
stack-configmap.yaml | ||
stack-k8s.yaml.template | ||
stack_run_config.yaml |
Llama Stack Benchmark Suite on Kubernetes
Motivation
Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.
Why This Benchmark Suite Exists
Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:
- Llama Stack inference (with vLLM backend)
- Direct vLLM inference calls
- Both under identical Kubernetes deployment conditions
Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.
Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.
Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:
- Kubernetes resource allocation (CPU, memory, GPU)
- Auto-scaling configurations
- Cost optimization strategies
Key Metrics Captured
The benchmark suite measures critical performance indicators:
- Throughput: Requests per second under sustained load
- Latency Distribution: P50, P95, P99 response times
- Time to First Token (TTFT): Critical for streaming applications
- Error Rates: Request failures and timeout analysis
This data enables data-driven architectural decisions and performance optimization efforts.
Setup
1. Deploy base k8s infrastructure:
cd ../k8s
./apply.sh
2. Deploy benchmark components:
cd ../k8s-benchmark
./apply.sh
3. Verify deployment:
kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.
Quick Start
Basic Benchmarks
Benchmark Llama Stack (default):
cd docs/source/distributions/k8s-benchmark/
./run-benchmark.sh
Benchmark vLLM direct:
./run-benchmark.sh --target vllm
Custom Configuration
Extended benchmark with high concurrency:
./run-benchmark.sh --target vllm --duration 120 --concurrent 20
Short test run:
./run-benchmark.sh --target stack --duration 30 --concurrent 5
Command Reference
run-benchmark.sh Options
./run-benchmark.sh [options]
Options:
-t, --target <stack|vllm> Target to benchmark (default: stack)
-d, --duration <seconds> Duration in seconds (default: 60)
-c, --concurrent <users> Number of concurrent users (default: 10)
-h, --help Show help message
Examples:
./run-benchmark.sh --target vllm # Benchmark vLLM direct
./run-benchmark.sh --target stack # Benchmark Llama Stack
./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users
Local Testing
Running Benchmark Locally
For local development without Kubernetes:
1. Start OpenAI mock server:
uv run python openai-mock-server.py --port 8080
2. Run benchmark against mock server:
uv run python benchmark.py \
--base-url http://localhost:8080/v1 \
--model mock-inference \
--duration 30 \
--concurrent 5
3. Test against local vLLM server:
# If you have vLLM running locally on port 8000
uv run python benchmark.py \
--base-url http://localhost:8000/v1 \
--model meta-llama/Llama-3.2-3B-Instruct \
--duration 30 \
--concurrent 5
4. Profile the running server:
./profile_running_server.sh
OpenAI Mock Server
The openai-mock-server.py
provides:
- OpenAI-compatible API for testing without real models
- Configurable streaming delay via
STREAM_DELAY_SECONDS
env var - Consistent responses for reproducible benchmarks
- Lightweight testing without GPU requirements
Mock server usage:
uv run python openai-mock-server.py --port 8080
The mock server is also deployed in k8s as openai-mock-service:8080
and can be used by changing the Llama Stack configuration to use the mock-vllm-inference
provider.
Files in this Directory
benchmark.py
- Core benchmark script with async streaming supportrun-benchmark.sh
- Main script with target selection and configurationopenai-mock-server.py
- Mock OpenAI API server for local testingREADME.md
- This documentation file