mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-04 04:04:14 +00:00
add RequestMetricsMiddleware which tracks key metrics related to each request the LLS server will recieve: 1. llama_stack_requests_total: tracks the total amount of requests the server has processed 2. llama_stack_request_duration_seconds: tracks the duration of each request 3. llama_stack_concurrent_requests: tracks concurrently processed requests by the server The usage of a middleware allows this to be done on the server level without having to add custom handling to each router like the inference router has today for its API specific metrics. Also, add some unit tests for this functionality resolves #2597 Signed-off-by: Charlie Doern <cdoern@redhat.com>
146 lines
6 KiB
Markdown
146 lines
6 KiB
Markdown
## Telemetry
|
|
|
|
The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.
|
|
|
|
### Events
|
|
The telemetry system supports three main types of events:
|
|
|
|
- **Unstructured Log Events**: Free-form log messages with severity levels
|
|
```python
|
|
unstructured_log_event = UnstructuredLogEvent(
|
|
message="This is a log message", severity=LogSeverity.INFO
|
|
)
|
|
```
|
|
- **Metric Events**: Numerical measurements with units
|
|
```python
|
|
metric_event = MetricEvent(metric="my_metric", value=10, unit="count")
|
|
```
|
|
- **Structured Log Events**: System events like span start/end. Extensible to add more structured log types.
|
|
```python
|
|
structured_log_event = SpanStartPayload(name="my_span", parent_span_id="parent_span_id")
|
|
```
|
|
|
|
### Spans and Traces
|
|
- **Spans**: Represent operations with timing and hierarchical relationships
|
|
- **Traces**: Collection of related spans forming a complete request flow
|
|
|
|
### Metrics
|
|
|
|
Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
|
|
|
|
#### Available Metrics
|
|
|
|
The following metrics are automatically generated for each inference request:
|
|
|
|
| Metric Name | Type | Unit | Description | Labels |
|
|
|-------------|------|------|-------------|--------|
|
|
| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
|
|
| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
|
|
| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
|
|
| `llama_stack_requests_total` | Counter | `requests` | Total number of requests | `api`, `status` |
|
|
| `llama_stack_request_duration_seconds` | Gauge | `seconds` | Request duration | `api`, `status` |
|
|
| `llama_stack_concurrent_requests` | Gauge | `requests` | Number of concurrent requests | `api` |
|
|
|
|
#### Metric Generation Flow
|
|
|
|
1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
|
|
2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
|
|
3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
|
|
4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
|
|
|
|
#### Metric Aggregation Level
|
|
|
|
All metrics are generated and aggregated at the **inference request level**. This means:
|
|
|
|
- Each individual inference request generates its own set of metrics
|
|
- Metrics are not pre-aggregated across multiple requests
|
|
- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
|
|
- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
|
|
|
|
#### Example Metric Event
|
|
|
|
```python
|
|
MetricEvent(
|
|
trace_id="1234567890abcdef",
|
|
span_id="abcdef1234567890",
|
|
metric="total_tokens",
|
|
value=150,
|
|
timestamp=1703123456.789,
|
|
unit="tokens",
|
|
attributes={"model_id": "meta-llama/Llama-3.2-3B-Instruct", "provider_id": "tgi"},
|
|
)
|
|
```
|
|
|
|
#### Querying Metrics
|
|
|
|
When using the OpenTelemetry sink, metrics are exposed in standard OpenTelemetry format and can be queried through:
|
|
|
|
- **Prometheus**: Scrape metrics from the OpenTelemetry Collector's metrics endpoint
|
|
- **Grafana**: Create dashboards using Prometheus as a data source
|
|
- **OpenTelemetry Collector**: Forward metrics to other observability systems
|
|
|
|
Example Prometheus queries:
|
|
```promql
|
|
# Total tokens used across all models
|
|
sum(llama_stack_tokens_total)
|
|
|
|
# Tokens per model
|
|
sum by (model_id) (llama_stack_tokens_total)
|
|
|
|
# Average tokens per request
|
|
rate(llama_stack_tokens_total[5m])
|
|
```
|
|
|
|
### Sinks
|
|
- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger and collecting metrics for Prometheus.
|
|
- **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
|
|
- **Console**: Print events to the console.
|
|
|
|
### Providers
|
|
|
|
#### Meta-Reference Provider
|
|
Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
|
|
1) OpenTelemetry Collector (traces and metrics)
|
|
2) SQLite (traces only)
|
|
3) Console (all events)
|
|
|
|
#### Configuration
|
|
|
|
Here's an example that sends telemetry signals to all sink types. Your configuration might use only one or a subset.
|
|
|
|
```yaml
|
|
telemetry:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config:
|
|
service_name: "llama-stack-service"
|
|
sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
|
|
otel_exporter_otlp_endpoint: "http://localhost:4318"
|
|
sqlite_db_path: "/path/to/telemetry.db"
|
|
```
|
|
|
|
**Environment Variables:**
|
|
- `OTEL_EXPORTER_OTLP_ENDPOINT`: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
|
|
- `OTEL_SERVICE_NAME`: Service name for telemetry (default: empty string)
|
|
- `TELEMETRY_SINKS`: Comma-separated list of sinks (default: `console,sqlite`)
|
|
|
|
### Jaeger to visualize traces
|
|
|
|
The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
|
|
|
|
Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
|
|
|
|
```bash
|
|
$ docker run --pull always --rm --name jaeger \
|
|
-p 16686:16686 -p 4318:4318 \
|
|
jaegertracing/jaeger:2.1.0
|
|
```
|
|
|
|
Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.
|
|
|
|
### Querying Traces Stored in SQLite
|
|
|
|
The `sqlite` sink allows you to query traces without an external system. Here are some example
|
|
queries. Refer to the notebook at [Llama Stack Building AI
|
|
Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for
|
|
more examples on how to query traces and spans.
|