llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-04 18:13:44 +00:00

Author	SHA1	Message	Date
ehhuang	e980436a2e	chore: introduce write queue for inference_store (#3383 ) # What does this PR do? Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes. ## Test Plan Benchmark: ``` cd /docs/source/distributions/k8s-benchmark # start mock server python openai-mock-server.py --port 8000 # start stack server LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml # run benchmark script uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct ``` ## RPS from 21 -> 57	2025-09-10 11:57:42 -07:00
Ashwin Bharambe	81ad240faa	fix(k8s): unwedge run.yaml to add files Some checks failed SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 3s Details Python Package Build Test / build (3.12) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.13) (push) Failing after 1s Details Vector IO Integration Tests / test-matrix (push) Failing after 4s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 5s Details API Conformance Tests / check-schema-compatibility (push) Successful in 7s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details Update ReadTheDocs / update-readthedocs (push) Failing after 5s Details Unit Tests / unit-tests (3.13) (push) Failing after 6s Details UI Tests / ui-tests (22) (push) Successful in 38s Details Pre-commit / pre-commit (push) Successful in 1m28s Details	2025-09-09 23:02:26 -07:00
ehhuang	bcc7f2c7d0	chore: async inference store write (#3318 ) # What does this PR do? ## Test Plan ``` cd /docs/source/distributions/k8s-benchmark # start mock server python openai-mock-server.py --port 8000 # start stack server uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml # run benchmark script uv run python3 benchmark.py --duration 30 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct ``` Before: ============================================================ BENCHMARK RESULTS ============================================================ Total time: 30.00s Concurrent users: 50 Total requests: 1267 Successful requests: 1267 Failed requests: 0 Success rate: 100.0% Requests per second: 42.23 After: ============================================================ BENCHMARK RESULTS ============================================================ Total time: 30.00s Concurrent users: 50 Total requests: 1449 Successful requests: 1449 Failed requests: 0 Success rate: 100.0% Requests per second: 48.30	2025-09-04 11:37:46 -07:00
Matthew Farrellee	e7a812f5de	chore: Fixup main pre commit (#3204 )	2025-08-19 14:52:38 -04:00
ehhuang	2c06b24c77	test: benchmark scripts (#3160 ) # What does this PR do? 1. Add our own benchmark script instead of locust (doesn't support measuring streaming latency well) 2. Simplify k8s deployment 3. Add a simple profile script for locally running server ## Test Plan ❮ ./run-benchmark.sh --target stack --duration 180 --concurrent 10 ============================================================ BENCHMARK RESULTS ============================================================ Total time: 180.00s Concurrent users: 10 Total requests: 1636 Successful requests: 1636 Failed requests: 0 Success rate: 100.0% Requests per second: 9.09 Response Time Statistics: Mean: 1.095s Median: 1.721s Min: 0.136s Max: 3.218s Std Dev: 0.762s Percentiles: P50: 1.721s P90: 1.751s P95: 1.756s P99: 1.796s Time to First Token (TTFT) Statistics: Mean: 0.037s Median: 0.037s Min: 0.023s Max: 0.211s Std Dev: 0.011s TTFT Percentiles: P50: 0.037s P90: 0.040s P95: 0.044s P99: 0.055s Streaming Statistics: Mean chunks per response: 64.0 Total chunks received: 104775	2025-08-15 11:24:29 -07:00
ashwinb	47d5af703c	chore(responses): Refactor Responses Impl to be civilized (#3138 ) # What does this PR do? Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by: 1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic 2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py` ## Test Plan Existing tests	2025-08-15 00:05:35 +00:00
ehhuang	d6ae54723d	chore: setup for performance benchmarking (#3096 ) # What does this PR do? 1. Added a simple mock openai-compat server that serves chat/completion 2. Add a benchmark server in EKS that includes mock inference server 3. Add locust (https://locust.io/) file for load testing ## Test Plan bash apply.sh kubectl port-forward service/locust-web-ui 8089:8089 Go to localhost:8089 to start a load test <img width="1392" height="334" alt="image" src="https://github.com/user-attachments/assets/d6aa3deb-583a-42ed-889b-751262b8e91c" /> <img width="1362" height="881" alt="image" src="https://github.com/user-attachments/assets/6a28b9b4-05e6-44e2-b504-07e60c12d35e" />	2025-08-13 10:58:22 -07:00

7 commits