feat: improve telemetry (#2590)

# What does this PR do? * Use a single env variable to setup OTEL endpoint * Update telemetry provider doc * Update general telemetry doc with the metric with generate * Left a script to setup telemetry for testing Closes: https://github.com/meta-llama/llama-stack/issues/783 Note to reviewer: the `setup_telemetry.sh` script was useful for me, it was nicely generated by AI, if we don't want it in the repo, and I can delete it, and I would understand. Signed-off-by: Sébastien Han <seb@redhat.com>
2025-07-06 14:00:42 +00:00 · 2025-07-04 17:29:09 +02:00 · 2025-07-04 17:29:09 +02:00 · ea966565f6
commit ea966565f6
parent 4eae0cbfa4
11 changed files with 237 additions and 38 deletions
--- a/docs/source/building_applications/telemetry.md
+++ b/docs/source/building_applications/telemetry.md
@ -24,37 +24,106 @@ structured_log_event = SpanStartPayload(name="my_span", parent_span_id="parent_s
 - **Spans**: Represent operations with timing and hierarchical relationships
 - **Traces**: Collection of related spans forming a complete request flow

+### Metrics
+
+Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
+
+#### Available Metrics
+
+The following metrics are automatically generated for each inference request:
+
+| Metric Name | Type | Unit | Description | Labels |
+|-------------|------|------|-------------|--------|
+| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
+| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
+| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
+
+#### Metric Generation Flow
+
+1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
+2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
+3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
+4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
+
+#### Metric Aggregation Level
+
+All metrics are generated and aggregated at the **inference request level**. This means:
+
+- Each individual inference request generates its own set of metrics
+- Metrics are not pre-aggregated across multiple requests
+- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
+- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
+
+#### Example Metric Event
+
+```python
+MetricEvent(
+    trace_id="1234567890abcdef",
+    span_id="abcdef1234567890",
+    metric="total_tokens",
+    value=150,
+    timestamp=1703123456.789,
+    unit="tokens",
+    attributes={"model_id": "meta-llama/Llama-3.2-3B-Instruct", "provider_id": "tgi"},
+)
+```
+
+#### Querying Metrics
+
+When using the OpenTelemetry sink, metrics are exposed in standard OpenTelemetry format and can be queried through:
+
+- **Prometheus**: Scrape metrics from the OpenTelemetry Collector's metrics endpoint
+- **Grafana**: Create dashboards using Prometheus as a data source
+- **OpenTelemetry Collector**: Forward metrics to other observability systems
+
+Example Prometheus queries:
+```promql
+# Total tokens used across all models
+sum(llama_stack_tokens_total)
+
+# Tokens per model
+sum by (model_id) (llama_stack_tokens_total)
+
+# Average tokens per request
+rate(llama_stack_tokens_total[5m])
+```
+
 ### Sinks
- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger.
+- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger and collecting metrics for Prometheus.
 - **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
 - **Console**: Print events to the console.

 ### Providers

 #### Meta-Reference Provider
-Currently, only the meta-reference provider is implemented. It can be configured to send events to three sink types:
-1) OpenTelemetry Collector
-2) SQLite
-3) Console
+Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
+1) OpenTelemetry Collector (traces and metrics)
+2) SQLite (traces only)
+3) Console (all events)

 #### Configuration

-Here's an example that sends telemetry signals to all three sink types. Your configuration might use only one.
+Here's an example that sends telemetry signals to all sink types. Your configuration might use only one or a subset.
+
 ```yaml
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
+      service_name: "llama-stack-service"
      sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
-      otel_trace_endpoint: "http://localhost:4318/v1/traces"
-      otel_metric_endpoint: "http://localhost:4318/v1/metrics"
+      otel_exporter_otlp_endpoint: "http://localhost:4318"
      sqlite_db_path: "/path/to/telemetry.db"
 ```

+**Environment Variables:**
+- `OTEL_EXPORTER_OTLP_ENDPOINT`: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
+- `OTEL_SERVICE_NAME`: Service name for telemetry (default: empty string)
+- `TELEMETRY_SINKS`: Comma-separated list of sinks (default: `console,sqlite`)
+
 ### Jaeger to visualize traces

-The `otel` sink works with any service compatible with the OpenTelemetry collector, traces and metrics has two separate endpoints.
-Let's use Jaeger to visualize this data.
+The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.

 Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:

@ -68,4 +137,7 @@ Once the Jaeger instance is running, you can visualize traces by navigating to h

 ### Querying Traces Stored in SQLite

-The `sqlite` sink allows you to query traces without an external system. Here are some example queries. Refer to the notebook at [Llama Stack Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on how to query traces and spaces.
+The `sqlite` sink allows you to query traces without an external system. Here are some example
+queries. Refer to the notebook at [Llama Stack Building AI
+Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for
+more examples on how to query traces and spans.
--- a/docs/source/providers/telemetry/inline_meta-reference.md
+++ b/docs/source/providers/telemetry/inline_meta-reference.md
@ -8,10 +8,9 @@ Meta's reference implementation of telemetry and observability using OpenTelemet

 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
-| `otel_trace_endpoint` | `str \| None` | No |  | The OpenTelemetry collector endpoint URL for traces |
-| `otel_metric_endpoint` | `str \| None` | No |  | The OpenTelemetry collector endpoint URL for metrics |
+| `otel_exporter_otlp_endpoint` | `str \| None` | No |  | The OpenTelemetry collector endpoint URL (base URL for traces, metrics, and logs). If not set, the SDK will use OTEL_EXPORTER_OTLP_ENDPOINT environment variable. |
 | `service_name` | `<class 'str'>` | No |  | The service name to use for telemetry |
-| `sinks` | `list[inline.telemetry.meta_reference.config.TelemetrySink` | No | [<TelemetrySink.CONSOLE: 'console'>, <TelemetrySink.SQLITE: 'sqlite'>] | List of telemetry sinks to enable (possible values: otel, sqlite, console) |
+| `sinks` | `list[inline.telemetry.meta_reference.config.TelemetrySink` | No | [<TelemetrySink.CONSOLE: 'console'>, <TelemetrySink.SQLITE: 'sqlite'>] | List of telemetry sinks to enable (possible values: otel_trace, otel_metric, sqlite, console) |
 | `sqlite_db_path` | `<class 'str'>` | No | ~/.llama/runtime/trace_store.db | The path to the SQLite database to use for storing traces |

 ## Sample Configuration
@ -20,6 +19,7 @@ Meta's reference implementation of telemetry and observability using OpenTelemet
 service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
 sinks: ${env.TELEMETRY_SINKS:=console,sqlite}
 sqlite_db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/trace_store.db
+otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=}

 ```