llama-stack-mirror/docs/source/distributions/k8s-benchmark
Francisco Javier Arceo 6620b625f1 adding logo and favicon
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

chore: Enable keyword search for Milvus inline (#3073)

With https://github.com/milvus-io/milvus-lite/pull/294 - Milvus Lite
supports keyword search using BM25. While introducing keyword search we
had explicitly disabled it for inline milvus. This PR removes the need
for the check, and enables `inline::milvus` for tests.

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

Run llama stack with `inline::milvus` enabled:

```
pytest tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes --stack-config=http://localhost:8321 --embedding-model=all-MiniLM-L6-v2 -v
```

```
INFO     2025-08-07 17:06:20,932 tests.integration.conftest:64 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
=========================================================================================== test session starts ============================================================================================
platform darwin -- Python 3.12.11, pytest-7.4.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'macOS-14.7.6-arm64-arm-64bit', 'Packages': {'pytest': '7.4.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.23.8', 'cov': '6.0.0', 'timeout': '2.2.0', 'socket': '0.7.0', 'html': '3.1.1', 'langsmith': '0.3.39', 'anyio': '4.8.0', 'metadata': '3.0.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.23.8, cov-6.0.0, timeout-2.2.0, socket-0.7.0, html-3.1.1, langsmith-0.3.39, anyio-4.8.0, metadata-3.0.0
asyncio: mode=Mode.AUTO
collected 3 items

tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-vector] PASSED                                                   [ 33%]
tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-keyword] PASSED                                                  [ 66%]
tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-hybrid] PASSED                                                   [100%]

============================================================================================ 3 passed in 4.75s =============================================================================================
```

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>

chore: Fixup main pre commit (#3204)

build: Bump version to 0.2.18

chore: Faster npm pre-commit (#3206)

Adds npm to pre-commit.yml installation and caches ui
Removes node installation during pre-commit.

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

chiecking in for tonight, wip moving to agents api

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

remove log

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

updated

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

fix: disable ui-prettier & ui-eslint (#3207)

chore(pre-commit): add pre-commit hook to enforce llama_stack logger usage (#3061)

This PR adds a step in pre-commit to enforce using `llama_stack` logger.

Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.

Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>

fix: fix ```openai_embeddings``` for asymmetric embedding NIMs (#3205)

NVIDIA asymmetric embedding models (e.g.,
`nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter
not present in the standard OpenAI embeddings API. This PR adds the
`input_type="query"` as default and updates the documentation to suggest
using the `embedding` API for passage embeddings.

<!-- If resolving an issue, uncomment and update the line below -->
Resolves #2892

```
pytest -s -v tests/integration/inference/test_openai_embeddings.py   --stack-config="inference=nvidia"   --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2"   --env NVIDIA_API_KEY={nvidia_api_key}   --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com"
```

cleaning up

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

updating session manager to cache messages locally

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

fix linter

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

more cleanup

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-08-21 16:06:30 -04:00
..
apply.sh test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
benchmark.py adding logo and favicon 2025-08-21 16:06:30 -04:00
openai-mock-server.py test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
profile_running_server.sh test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
README.md test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
run-benchmark.sh test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
stack-configmap.yaml test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
stack-k8s.yaml.template test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00
stack_run_config.yaml test: benchmark scripts (#3160) 2025-08-15 11:24:29 -07:00

Llama Stack Benchmark Suite on Kubernetes

Motivation

Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.

Why This Benchmark Suite Exists

Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:

  • Llama Stack inference (with vLLM backend)
  • Direct vLLM inference calls
  • Both under identical Kubernetes deployment conditions

Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.

Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.

Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:

  • Kubernetes resource allocation (CPU, memory, GPU)
  • Auto-scaling configurations
  • Cost optimization strategies

Key Metrics Captured

The benchmark suite measures critical performance indicators:

  • Throughput: Requests per second under sustained load
  • Latency Distribution: P50, P95, P99 response times
  • Time to First Token (TTFT): Critical for streaming applications
  • Error Rates: Request failures and timeout analysis

This data enables data-driven architectural decisions and performance optimization efforts.

Setup

1. Deploy base k8s infrastructure:

cd ../k8s
./apply.sh

2. Deploy benchmark components:

cd ../k8s-benchmark
./apply.sh

3. Verify deployment:

kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.

Quick Start

Basic Benchmarks

Benchmark Llama Stack (default):

cd docs/source/distributions/k8s-benchmark/
./run-benchmark.sh

Benchmark vLLM direct:

./run-benchmark.sh --target vllm

Custom Configuration

Extended benchmark with high concurrency:

./run-benchmark.sh --target vllm --duration 120 --concurrent 20

Short test run:

./run-benchmark.sh --target stack --duration 30 --concurrent 5

Command Reference

run-benchmark.sh Options

./run-benchmark.sh [options]

Options:
  -t, --target <stack|vllm>     Target to benchmark (default: stack)
  -d, --duration <seconds>      Duration in seconds (default: 60)
  -c, --concurrent <users>      Number of concurrent users (default: 10)
  -h, --help                    Show help message

Examples:
  ./run-benchmark.sh --target vllm              # Benchmark vLLM direct
  ./run-benchmark.sh --target stack             # Benchmark Llama Stack
  ./run-benchmark.sh -t vllm -d 120 -c 20       # vLLM with 120s, 20 users

Local Testing

Running Benchmark Locally

For local development without Kubernetes:

1. Start OpenAI mock server:

uv run python openai-mock-server.py --port 8080

2. Run benchmark against mock server:

uv run python benchmark.py \
  --base-url http://localhost:8080/v1 \
  --model mock-inference \
  --duration 30 \
  --concurrent 5

3. Test against local vLLM server:

# If you have vLLM running locally on port 8000
uv run python benchmark.py \
  --base-url http://localhost:8000/v1 \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --duration 30 \
  --concurrent 5

4. Profile the running server:

./profile_running_server.sh

OpenAI Mock Server

The openai-mock-server.py provides:

  • OpenAI-compatible API for testing without real models
  • Configurable streaming delay via STREAM_DELAY_SECONDS env var
  • Consistent responses for reproducible benchmarks
  • Lightweight testing without GPU requirements

Mock server usage:

uv run python openai-mock-server.py --port 8080

The mock server is also deployed in k8s as openai-mock-service:8080 and can be used by changing the Llama Stack configuration to use the mock-vllm-inference provider.

Files in this Directory

  • benchmark.py - Core benchmark script with async streaming support
  • run-benchmark.sh - Main script with target selection and configuration
  • openai-mock-server.py - Mock OpenAI API server for local testing
  • README.md - This documentation file