Matthew Farrellee 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e7a812f5de 
								
							 
						 
						
							
							
								
								chore: Fixup main pre commit ( #3204 )  
							
							
							
						 
						
							2025-08-19 14:52:38 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									ehhuang 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								2c06b24c77 
								
							 
						 
						
							
							
								
								test: benchmark scripts ( #3160 )  
							
							... 
							
							
							
							# What does this PR do?
1. Add our own benchmark script instead of locust (doesn't support
measuring streaming latency well)
2. Simplify k8s deployment
3. Add a simple profile script for locally running server
## Test Plan
❮ ./run-benchmark.sh --target stack --duration 180 --concurrent 10
============================================================
BENCHMARK RESULTS
============================================================
Total time: 180.00s
Concurrent users: 10
Total requests: 1636
Successful requests: 1636
Failed requests: 0
Success rate: 100.0%
Requests per second: 9.09
Response Time Statistics:
  Mean: 1.095s
  Median: 1.721s
  Min: 0.136s
  Max: 3.218s
  Std Dev: 0.762s
Percentiles:
  P50: 1.721s
  P90: 1.751s
  P95: 1.756s
  P99: 1.796s
Time to First Token (TTFT) Statistics:
  Mean: 0.037s
  Median: 0.037s
  Min: 0.023s
  Max: 0.211s
  Std Dev: 0.011s
TTFT Percentiles:
  P50: 0.037s
  P90: 0.040s
  P95: 0.044s
  P99: 0.055s
Streaming Statistics:
  Mean chunks per response: 64.0
  Total chunks received: 104775 
							
						 
						
							2025-08-15 11:24:29 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									ashwinb 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								47d5af703c 
								
							 
						 
						
							
							
								
								chore(responses): Refactor Responses Impl to be civilized ( #3138 )  
							
							... 
							
							
							
							# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests 
							
						 
						
							2025-08-15 00:05:35 +00:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									ehhuang 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d6ae54723d 
								
							 
						 
						
							
							
								
								chore: setup for performance benchmarking ( #3096 )  
							
							... 
							
							
							
							# What does this PR do?
1. Added a simple mock openai-compat server that serves chat/completion
2. Add a benchmark server in EKS that includes mock inference server
3. Add locust (https://locust.io/ ) file for load testing
## Test Plan
bash apply.sh
kubectl port-forward service/locust-web-ui 8089:8089
Go to localhost:8089 to start a load test
<img width="1392" height="334" alt="image"
src="https://github.com/user-attachments/assets/d6aa3deb-583a-42ed-889b-751262b8e91c "
/>
<img width="1362" height="881" alt="image"
src="https://github.com/user-attachments/assets/6a28b9b4-05e6-44e2-b504-07e60c12d35e "
/> 
							
						 
						
							2025-08-13 10:58:22 -07:00