mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-06 22:10:41 +00:00
Some checks failed
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 4s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 8s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 15s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.12, agents) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 17s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 20s
Integration Tests / test-matrix (server, 3.12, datasets) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 17s
Integration Tests / test-matrix (server, 3.12, inference) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 20s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 20s
Integration Tests / test-matrix (server, 3.12, inspect) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 18s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 17s
Integration Tests / test-matrix (server, 3.12, post_training) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, providers) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, scoring) (push) Failing after 9s
Integration Tests / test-matrix (server, 3.12, vector_io) (push) Failing after 6s
Integration Tests / test-matrix (server, 3.13, datasets) (push) Failing after 7s
Integration Tests / test-matrix (server, 3.13, inspect) (push) Failing after 8s
Integration Tests / test-matrix (server, 3.13, providers) (push) Failing after 6s
Integration Tests / test-matrix (server, 3.13, tool_runtime) (push) Failing after 5s
Integration Tests / test-matrix (server, 3.13, vector_io) (push) Failing after 5s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 4s
Integration Tests / test-matrix (server, 3.12, tool_runtime) (push) Failing after 18s
Integration Tests / test-matrix (server, 3.13, agents) (push) Failing after 19s
Integration Tests / test-matrix (server, 3.13, post_training) (push) Failing after 16s
Integration Tests / test-matrix (server, 3.13, inference) (push) Failing after 18s
Integration Tests / test-matrix (server, 3.13, scoring) (push) Failing after 17s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 14s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 10s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 9s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 7s
Test Llama Stack Build / generate-matrix (push) Successful in 3s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 15s
Python Package Build Test / build (3.13) (push) Failing after 0s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s
Test Llama Stack Build / build-single-provider (push) Failing after 6s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 17s
Update ReadTheDocs / update-readthedocs (push) Failing after 4s
Test Llama Stack Build / build (push) Failing after 4s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 7s
Test External Providers / test-external-providers (venv) (push) Failing after 5s
Unit Tests / unit-tests (3.13) (push) Failing after 4s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 58s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 1m0s
Python Package Build Test / build (3.12) (push) Failing after 49s
Pre-commit / pre-commit (push) Successful in 1m40s
# What does this PR do? * Use a single env variable to setup OTEL endpoint * Update telemetry provider doc * Update general telemetry doc with the metric with generate * Left a script to setup telemetry for testing Closes: https://github.com/meta-llama/llama-stack/issues/783 Note to reviewer: the `setup_telemetry.sh` script was useful for me, it was nicely generated by AI, if we don't want it in the repo, and I can delete it, and I would understand. Signed-off-by: Sébastien Han <seb@redhat.com>
143 lines
5.8 KiB
Markdown
143 lines
5.8 KiB
Markdown
## Telemetry
|
|
|
|
The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.
|
|
|
|
### Events
|
|
The telemetry system supports three main types of events:
|
|
|
|
- **Unstructured Log Events**: Free-form log messages with severity levels
|
|
```python
|
|
unstructured_log_event = UnstructuredLogEvent(
|
|
message="This is a log message", severity=LogSeverity.INFO
|
|
)
|
|
```
|
|
- **Metric Events**: Numerical measurements with units
|
|
```python
|
|
metric_event = MetricEvent(metric="my_metric", value=10, unit="count")
|
|
```
|
|
- **Structured Log Events**: System events like span start/end. Extensible to add more structured log types.
|
|
```python
|
|
structured_log_event = SpanStartPayload(name="my_span", parent_span_id="parent_span_id")
|
|
```
|
|
|
|
### Spans and Traces
|
|
- **Spans**: Represent operations with timing and hierarchical relationships
|
|
- **Traces**: Collection of related spans forming a complete request flow
|
|
|
|
### Metrics
|
|
|
|
Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
|
|
|
|
#### Available Metrics
|
|
|
|
The following metrics are automatically generated for each inference request:
|
|
|
|
| Metric Name | Type | Unit | Description | Labels |
|
|
|-------------|------|------|-------------|--------|
|
|
| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
|
|
| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
|
|
| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
|
|
|
|
#### Metric Generation Flow
|
|
|
|
1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
|
|
2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
|
|
3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
|
|
4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
|
|
|
|
#### Metric Aggregation Level
|
|
|
|
All metrics are generated and aggregated at the **inference request level**. This means:
|
|
|
|
- Each individual inference request generates its own set of metrics
|
|
- Metrics are not pre-aggregated across multiple requests
|
|
- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
|
|
- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
|
|
|
|
#### Example Metric Event
|
|
|
|
```python
|
|
MetricEvent(
|
|
trace_id="1234567890abcdef",
|
|
span_id="abcdef1234567890",
|
|
metric="total_tokens",
|
|
value=150,
|
|
timestamp=1703123456.789,
|
|
unit="tokens",
|
|
attributes={"model_id": "meta-llama/Llama-3.2-3B-Instruct", "provider_id": "tgi"},
|
|
)
|
|
```
|
|
|
|
#### Querying Metrics
|
|
|
|
When using the OpenTelemetry sink, metrics are exposed in standard OpenTelemetry format and can be queried through:
|
|
|
|
- **Prometheus**: Scrape metrics from the OpenTelemetry Collector's metrics endpoint
|
|
- **Grafana**: Create dashboards using Prometheus as a data source
|
|
- **OpenTelemetry Collector**: Forward metrics to other observability systems
|
|
|
|
Example Prometheus queries:
|
|
```promql
|
|
# Total tokens used across all models
|
|
sum(llama_stack_tokens_total)
|
|
|
|
# Tokens per model
|
|
sum by (model_id) (llama_stack_tokens_total)
|
|
|
|
# Average tokens per request
|
|
rate(llama_stack_tokens_total[5m])
|
|
```
|
|
|
|
### Sinks
|
|
- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger and collecting metrics for Prometheus.
|
|
- **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
|
|
- **Console**: Print events to the console.
|
|
|
|
### Providers
|
|
|
|
#### Meta-Reference Provider
|
|
Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
|
|
1) OpenTelemetry Collector (traces and metrics)
|
|
2) SQLite (traces only)
|
|
3) Console (all events)
|
|
|
|
#### Configuration
|
|
|
|
Here's an example that sends telemetry signals to all sink types. Your configuration might use only one or a subset.
|
|
|
|
```yaml
|
|
telemetry:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config:
|
|
service_name: "llama-stack-service"
|
|
sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
|
|
otel_exporter_otlp_endpoint: "http://localhost:4318"
|
|
sqlite_db_path: "/path/to/telemetry.db"
|
|
```
|
|
|
|
**Environment Variables:**
|
|
- `OTEL_EXPORTER_OTLP_ENDPOINT`: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
|
|
- `OTEL_SERVICE_NAME`: Service name for telemetry (default: empty string)
|
|
- `TELEMETRY_SINKS`: Comma-separated list of sinks (default: `console,sqlite`)
|
|
|
|
### Jaeger to visualize traces
|
|
|
|
The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
|
|
|
|
Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
|
|
|
|
```bash
|
|
$ docker run --pull always --rm --name jaeger \
|
|
-p 16686:16686 -p 4318:4318 \
|
|
jaegertracing/jaeger:2.1.0
|
|
```
|
|
|
|
Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.
|
|
|
|
### Querying Traces Stored in SQLite
|
|
|
|
The `sqlite` sink allows you to query traces without an external system. Here are some example
|
|
queries. Refer to the notebook at [Llama Stack Building AI
|
|
Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for
|
|
more examples on how to query traces and spans.
|