Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> chore: Enable keyword search for Milvus inline (#3073) With https://github.com/milvus-io/milvus-lite/pull/294 - Milvus Lite supports keyword search using BM25. While introducing keyword search we had explicitly disabled it for inline milvus. This PR removes the need for the check, and enables `inline::milvus` for tests. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Run llama stack with `inline::milvus` enabled: ``` pytest tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes --stack-config=http://localhost:8321 --embedding-model=all-MiniLM-L6-v2 -v ``` ``` INFO 2025-08-07 17:06:20,932 tests.integration.conftest:64 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS =========================================================================================== test session starts ============================================================================================ platform darwin -- Python 3.12.11, pytest-7.4.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python cachedir: .pytest_cache metadata: {'Python': '3.12.11', 'Platform': 'macOS-14.7.6-arm64-arm-64bit', 'Packages': {'pytest': '7.4.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.23.8', 'cov': '6.0.0', 'timeout': '2.2.0', 'socket': '0.7.0', 'html': '3.1.1', 'langsmith': '0.3.39', 'anyio': '4.8.0', 'metadata': '3.0.0'}} rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack configfile: pyproject.toml plugins: asyncio-0.23.8, cov-6.0.0, timeout-2.2.0, socket-0.7.0, html-3.1.1, langsmith-0.3.39, anyio-4.8.0, metadata-3.0.0 asyncio: mode=Mode.AUTO collected 3 items tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-vector] PASSED [ 33%] tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-keyword] PASSED [ 66%] tests/integration/vector_io/test_openai_vector_stores.py::test_openai_vector_store_search_modes[None-None-all-MiniLM-L6-v2-None-384-hybrid] PASSED [100%] ============================================================================================ 3 passed in 4.75s ============================================================================================= ``` Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com> chore: Fixup main pre commit (#3204) build: Bump version to 0.2.18 chore: Faster npm pre-commit (#3206) Adds npm to pre-commit.yml installation and caches ui Removes node installation during pre-commit. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> chiecking in for tonight, wip moving to agents api Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> remove log Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> updated Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> fix: disable ui-prettier & ui-eslint (#3207) chore(pre-commit): add pre-commit hook to enforce llama_stack logger usage (#3061) This PR adds a step in pre-commit to enforce using `llama_stack` logger. Currently, various parts of the code base uses different loggers. As a custom `llama_stack` logger exist and used in the codebase, it is better to standardize its utilization. Signed-off-by: Mustafa Elbehery <melbeher@redhat.com> Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu> fix: fix ```openai_embeddings``` for asymmetric embedding NIMs (#3205) NVIDIA asymmetric embedding models (e.g., `nvidia/llama-3.2-nv-embedqa-1b-v2`) require an `input_type` parameter not present in the standard OpenAI embeddings API. This PR adds the `input_type="query"` as default and updates the documentation to suggest using the `embedding` API for passage embeddings. <!-- If resolving an issue, uncomment and update the line below --> Resolves #2892 ``` pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model="nvidia/llama-3.2-nv-embedqa-1b-v2" --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com" ``` cleaning up Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> updating session manager to cache messages locally Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> fix linter Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> more cleanup Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> |
||
|---|---|---|
| .. | ||
| apply.sh | ||
| benchmark.py | ||
| openai-mock-server.py | ||
| profile_running_server.sh | ||
| README.md | ||
| run-benchmark.sh | ||
| stack-configmap.yaml | ||
| stack-k8s.yaml.template | ||
| stack_run_config.yaml | ||
Llama Stack Benchmark Suite on Kubernetes
Motivation
Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.
Why This Benchmark Suite Exists
Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:
- Llama Stack inference (with vLLM backend)
- Direct vLLM inference calls
- Both under identical Kubernetes deployment conditions
Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.
Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.
Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:
- Kubernetes resource allocation (CPU, memory, GPU)
- Auto-scaling configurations
- Cost optimization strategies
Key Metrics Captured
The benchmark suite measures critical performance indicators:
- Throughput: Requests per second under sustained load
- Latency Distribution: P50, P95, P99 response times
- Time to First Token (TTFT): Critical for streaming applications
- Error Rates: Request failures and timeout analysis
This data enables data-driven architectural decisions and performance optimization efforts.
Setup
1. Deploy base k8s infrastructure:
cd ../k8s
./apply.sh
2. Deploy benchmark components:
cd ../k8s-benchmark
./apply.sh
3. Verify deployment:
kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.
Quick Start
Basic Benchmarks
Benchmark Llama Stack (default):
cd docs/source/distributions/k8s-benchmark/
./run-benchmark.sh
Benchmark vLLM direct:
./run-benchmark.sh --target vllm
Custom Configuration
Extended benchmark with high concurrency:
./run-benchmark.sh --target vllm --duration 120 --concurrent 20
Short test run:
./run-benchmark.sh --target stack --duration 30 --concurrent 5
Command Reference
run-benchmark.sh Options
./run-benchmark.sh [options]
Options:
-t, --target <stack|vllm> Target to benchmark (default: stack)
-d, --duration <seconds> Duration in seconds (default: 60)
-c, --concurrent <users> Number of concurrent users (default: 10)
-h, --help Show help message
Examples:
./run-benchmark.sh --target vllm # Benchmark vLLM direct
./run-benchmark.sh --target stack # Benchmark Llama Stack
./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users
Local Testing
Running Benchmark Locally
For local development without Kubernetes:
1. Start OpenAI mock server:
uv run python openai-mock-server.py --port 8080
2. Run benchmark against mock server:
uv run python benchmark.py \
--base-url http://localhost:8080/v1 \
--model mock-inference \
--duration 30 \
--concurrent 5
3. Test against local vLLM server:
# If you have vLLM running locally on port 8000
uv run python benchmark.py \
--base-url http://localhost:8000/v1 \
--model meta-llama/Llama-3.2-3B-Instruct \
--duration 30 \
--concurrent 5
4. Profile the running server:
./profile_running_server.sh
OpenAI Mock Server
The openai-mock-server.py provides:
- OpenAI-compatible API for testing without real models
- Configurable streaming delay via
STREAM_DELAY_SECONDSenv var - Consistent responses for reproducible benchmarks
- Lightweight testing without GPU requirements
Mock server usage:
uv run python openai-mock-server.py --port 8080
The mock server is also deployed in k8s as openai-mock-service:8080 and can be used by changing the Llama Stack configuration to use the mock-vllm-inference provider.
Files in this Directory
benchmark.py- Core benchmark script with async streaming supportrun-benchmark.sh- Main script with target selection and configurationopenai-mock-server.py- Mock OpenAI API server for local testingREADME.md- This documentation file