feat: api level request metrics via middleware

add RequestMetricsMiddleware which tracks key metrics related to each request the LLS server will recieve: 1. llama_stack_requests_total: tracks the total amount of requests the server has processed 2. llama_stack_request_duration_seconds: tracks the duration of each request 3. llama_stack_concurrent_requests: tracks concurrently processed requests by the server The usage of a middleware allows this to be done on the server level without having to add custom handling to each router like the inference router has today for its API specific metrics. Also, add some unit tests for this functionality resolves #2597 Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-10-04 04:04:14 +00:00 · 2025-07-11 20:52:32 -04:00 · 2025-07-11 20:52:32 -04:00 · 49b729b30a
commit 49b729b30a
parent dbfc15123e
4 changed files with 433 additions and 0 deletions
--- a/docs/source/building_applications/telemetry.md
+++ b/docs/source/building_applications/telemetry.md
@ -37,6 +37,9 @@ The following metrics are automatically generated for each inference request:
 | `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
 | `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
 | `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
+| `llama_stack_requests_total` | Counter | `requests` | Total number of requests | `api`, `status` |
+| `llama_stack_request_duration_seconds` | Gauge | `seconds` | Request duration | `api`, `status` |
+| `llama_stack_concurrent_requests` | Gauge | `requests` | Number of concurrent requests | `api` |

 #### Metric Generation Flow