Francisco Javier Arceo
ff0bd414b1
chore: Updating documentation and adding exception handling for Vector Stores in RAG Tool and updating inference to use openai and updating memory implementation to use existing libraries
...
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-09-10 15:55:29 -04:00
ehhuang
bcc7f2c7d0
chore: async inference store write ( #3318 )
...
# What does this PR do?
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 30 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
Before:
============================================================
BENCHMARK RESULTS
============================================================
Total time: 30.00s
Concurrent users: 50
Total requests: 1267
Successful requests: 1267
Failed requests: 0
Success rate: 100.0%
Requests per second: 42.23
After:
============================================================
BENCHMARK RESULTS
============================================================
Total time: 30.00s
Concurrent users: 50
Total requests: 1449
Successful requests: 1449
Failed requests: 0
Success rate: 100.0%
Requests per second: 48.30
2025-09-04 11:37:46 -07:00
Matthew Farrellee
e7a812f5de
chore: Fixup main pre commit ( #3204 )
2025-08-19 14:52:38 -04:00
ehhuang
2c06b24c77
test: benchmark scripts ( #3160 )
...
# What does this PR do?
1. Add our own benchmark script instead of locust (doesn't support
measuring streaming latency well)
2. Simplify k8s deployment
3. Add a simple profile script for locally running server
## Test Plan
❮ ./run-benchmark.sh --target stack --duration 180 --concurrent 10
============================================================
BENCHMARK RESULTS
============================================================
Total time: 180.00s
Concurrent users: 10
Total requests: 1636
Successful requests: 1636
Failed requests: 0
Success rate: 100.0%
Requests per second: 9.09
Response Time Statistics:
Mean: 1.095s
Median: 1.721s
Min: 0.136s
Max: 3.218s
Std Dev: 0.762s
Percentiles:
P50: 1.721s
P90: 1.751s
P95: 1.756s
P99: 1.796s
Time to First Token (TTFT) Statistics:
Mean: 0.037s
Median: 0.037s
Min: 0.023s
Max: 0.211s
Std Dev: 0.011s
TTFT Percentiles:
P50: 0.037s
P90: 0.040s
P95: 0.044s
P99: 0.055s
Streaming Statistics:
Mean chunks per response: 64.0
Total chunks received: 104775
2025-08-15 11:24:29 -07:00
ashwinb
47d5af703c
chore(responses): Refactor Responses Impl to be civilized ( #3138 )
...
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
2025-08-15 00:05:35 +00:00
ehhuang
d6ae54723d
chore: setup for performance benchmarking ( #3096 )
...
# What does this PR do?
1. Added a simple mock openai-compat server that serves chat/completion
2. Add a benchmark server in EKS that includes mock inference server
3. Add locust (https://locust.io/ ) file for load testing
## Test Plan
bash apply.sh
kubectl port-forward service/locust-web-ui 8089:8089
Go to localhost:8089 to start a load test
<img width="1392" height="334" alt="image"
src="https://github.com/user-attachments/assets/d6aa3deb-583a-42ed-889b-751262b8e91c "
/>
<img width="1362" height="881" alt="image"
src="https://github.com/user-attachments/assets/6a28b9b4-05e6-44e2-b504-07e60c12d35e "
/>
2025-08-13 10:58:22 -07:00