mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-03 19:57:35 +00:00
342 lines
10 KiB
Text
342 lines
10 KiB
Text
---
|
|
title: Telemetry
|
|
description: Monitor and observe Llama Stack applications with comprehensive telemetry capabilities
|
|
sidebar_label: Telemetry
|
|
sidebar_position: 8
|
|
---
|
|
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
# Telemetry
|
|
|
|
The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output for complete observability of your AI applications.
|
|
|
|
## Event Types
|
|
|
|
The telemetry system supports three main types of events:
|
|
|
|
<Tabs>
|
|
<TabItem value="unstructured" label="Unstructured Logs">
|
|
|
|
Free-form log messages with severity levels for general application logging:
|
|
|
|
```python
|
|
unstructured_log_event = UnstructuredLogEvent(
|
|
message="This is a log message",
|
|
severity=LogSeverity.INFO
|
|
)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="metrics" label="Metric Events">
|
|
|
|
Numerical measurements with units for tracking performance and usage:
|
|
|
|
```python
|
|
metric_event = MetricEvent(
|
|
metric="my_metric",
|
|
value=10,
|
|
unit="count"
|
|
)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="structured" label="Structured Logs">
|
|
|
|
System events like span start/end that provide structured operation tracking:
|
|
|
|
```python
|
|
structured_log_event = SpanStartPayload(
|
|
name="my_span",
|
|
parent_span_id="parent_span_id"
|
|
)
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Spans and Traces
|
|
|
|
- **Spans**: Represent individual operations with timing information and hierarchical relationships
|
|
- **Traces**: Collections of related spans that form a complete request flow across your application
|
|
|
|
This hierarchical structure allows you to understand the complete execution path of requests through your Llama Stack application.
|
|
|
|
## Automatic Metrics Generation
|
|
|
|
Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
|
|
|
|
### Available Metrics
|
|
|
|
The following metrics are automatically generated for each inference request:
|
|
|
|
| Metric Name | Type | Unit | Description | Labels |
|
|
|-------------|------|------|-------------|--------|
|
|
| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
|
|
| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
|
|
| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
|
|
|
|
### Metric Generation Flow
|
|
|
|
1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
|
|
2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
|
|
3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
|
|
4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
|
|
|
|
### Metric Aggregation Level
|
|
|
|
All metrics are generated and aggregated at the **inference request level**. This means:
|
|
|
|
- Each individual inference request generates its own set of metrics
|
|
- Metrics are not pre-aggregated across multiple requests
|
|
- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
|
|
- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
|
|
|
|
### Example Metric Event
|
|
|
|
```python
|
|
MetricEvent(
|
|
trace_id="1234567890abcdef",
|
|
span_id="abcdef1234567890",
|
|
metric="total_tokens",
|
|
value=150,
|
|
timestamp=1703123456.789,
|
|
unit="tokens",
|
|
attributes={
|
|
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
|
|
"provider_id": "tgi"
|
|
},
|
|
)
|
|
```
|
|
|
|
## Telemetry Sinks
|
|
|
|
Choose from multiple sink types based on your observability needs:
|
|
|
|
<Tabs>
|
|
<TabItem value="opentelemetry" label="OpenTelemetry">
|
|
|
|
Send events to an OpenTelemetry Collector for integration with observability platforms:
|
|
|
|
**Use Cases:**
|
|
- Visualizing traces in tools like Jaeger
|
|
- Collecting metrics for Prometheus
|
|
- Integration with enterprise observability stacks
|
|
|
|
**Features:**
|
|
- Standard OpenTelemetry format
|
|
- Compatible with all OpenTelemetry collectors
|
|
- Supports both traces and metrics
|
|
|
|
</TabItem>
|
|
<TabItem value="sqlite" label="SQLite">
|
|
|
|
Store events in a local SQLite database for direct querying:
|
|
|
|
**Use Cases:**
|
|
- Local development and debugging
|
|
- Custom analytics and reporting
|
|
- Offline analysis of application behavior
|
|
|
|
**Features:**
|
|
- Direct SQL querying capabilities
|
|
- Persistent local storage
|
|
- No external dependencies
|
|
|
|
</TabItem>
|
|
<TabItem value="console" label="Console">
|
|
|
|
Print events to the console for immediate debugging:
|
|
|
|
**Use Cases:**
|
|
- Development and testing
|
|
- Quick debugging sessions
|
|
- Simple logging without external tools
|
|
|
|
**Features:**
|
|
- Immediate output visibility
|
|
- No setup required
|
|
- Human-readable format
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Configuration
|
|
|
|
### Meta-Reference Provider
|
|
|
|
Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
|
|
|
|
```yaml
|
|
telemetry:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config:
|
|
service_name: "llama-stack-service"
|
|
sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
|
|
otel_exporter_otlp_endpoint: "http://localhost:4318"
|
|
sqlite_db_path: "/path/to/telemetry.db"
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
Configure telemetry behavior using environment variables:
|
|
|
|
- **`OTEL_EXPORTER_OTLP_ENDPOINT`**: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
|
|
- **`OTEL_SERVICE_NAME`**: Service name for telemetry (default: empty string)
|
|
- **`TELEMETRY_SINKS`**: Comma-separated list of sinks (default: `console,sqlite`)
|
|
|
|
## Visualization with Jaeger
|
|
|
|
The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
|
|
|
|
### Starting Jaeger
|
|
|
|
Start a Jaeger instance with OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686:
|
|
|
|
```bash
|
|
docker run --pull always --rm --name jaeger \
|
|
-p 16686:16686 -p 4318:4318 \
|
|
jaegertracing/jaeger:2.1.0
|
|
```
|
|
|
|
Once running, you can visualize traces by navigating to [http://localhost:16686/](http://localhost:16686/).
|
|
|
|
## Querying Metrics
|
|
|
|
When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:
|
|
|
|
<Tabs>
|
|
<TabItem value="prometheus" label="Prometheus Queries">
|
|
|
|
Example Prometheus queries for analyzing token usage:
|
|
|
|
```promql
|
|
# Total tokens used across all models
|
|
sum(llama_stack_tokens_total)
|
|
|
|
# Tokens per model
|
|
sum by (model_id) (llama_stack_tokens_total)
|
|
|
|
# Average tokens per request over 5 minutes
|
|
rate(llama_stack_tokens_total[5m])
|
|
|
|
# Token usage by provider
|
|
sum by (provider_id) (llama_stack_tokens_total)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="grafana" label="Grafana Dashboards">
|
|
|
|
Create dashboards using Prometheus as a data source:
|
|
|
|
- **Token Usage Over Time**: Line charts showing token consumption trends
|
|
- **Model Performance**: Comparison of different models by token efficiency
|
|
- **Provider Analysis**: Breakdown of usage across different providers
|
|
- **Request Patterns**: Understanding peak usage times and patterns
|
|
|
|
</TabItem>
|
|
<TabItem value="otlp" label="OpenTelemetry Collector">
|
|
|
|
Forward metrics to other observability systems:
|
|
|
|
- Export to multiple backends simultaneously
|
|
- Apply transformations and filtering
|
|
- Integrate with existing monitoring infrastructure
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## SQLite Querying
|
|
|
|
The `sqlite` sink allows you to query traces without an external system. This is particularly useful for development and custom analytics.
|
|
|
|
### Example Queries
|
|
|
|
```sql
|
|
-- Query recent traces
|
|
SELECT * FROM traces WHERE timestamp > datetime('now', '-1 hour');
|
|
|
|
-- Analyze span durations
|
|
SELECT name, AVG(duration_ms) as avg_duration
|
|
FROM spans
|
|
GROUP BY name
|
|
ORDER BY avg_duration DESC;
|
|
|
|
-- Find slow operations
|
|
SELECT * FROM spans
|
|
WHERE duration_ms > 1000
|
|
ORDER BY duration_ms DESC;
|
|
```
|
|
|
|
:::tip[Advanced Analytics]
|
|
Refer to the [Getting Started notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on querying traces and spans programmatically.
|
|
:::
|
|
|
|
## Best Practices
|
|
|
|
### 🔍 **Monitoring Strategy**
|
|
- Use OpenTelemetry for production environments
|
|
- Combine multiple sinks for development (console + SQLite)
|
|
- Set up alerts on key metrics like token usage and error rates
|
|
|
|
### 📊 **Metrics Analysis**
|
|
- Track token usage trends to optimize costs
|
|
- Monitor response times across different models
|
|
- Analyze usage patterns to improve resource allocation
|
|
|
|
### 🚨 **Alerting & Debugging**
|
|
- Set up alerts for unusual token consumption spikes
|
|
- Use trace data to debug performance issues
|
|
- Monitor error rates and failure patterns
|
|
|
|
### 🔧 **Configuration Management**
|
|
- Use environment variables for flexible deployment
|
|
- Configure appropriate retention policies for SQLite
|
|
- Ensure proper network access to OpenTelemetry collectors
|
|
|
|
## Integration Examples
|
|
|
|
### Basic Telemetry Setup
|
|
|
|
```python
|
|
from llama_stack_client import LlamaStackClient
|
|
|
|
# Client with telemetry headers
|
|
client = LlamaStackClient(
|
|
base_url="http://localhost:8000",
|
|
extra_headers={
|
|
"X-Telemetry-Service": "my-ai-app",
|
|
"X-Telemetry-Version": "1.0.0"
|
|
}
|
|
)
|
|
|
|
# All API calls will be automatically traced
|
|
response = client.chat.completions.create(
|
|
model="meta-llama/Llama-3.2-3B-Instruct",
|
|
messages=[{"role": "user", "content": "Hello!"}]
|
|
)
|
|
```
|
|
|
|
### Custom Telemetry Context
|
|
|
|
```python
|
|
# Add custom span attributes for better tracking
|
|
with tracer.start_as_current_span("custom_operation") as span:
|
|
span.set_attribute("user_id", "user123")
|
|
span.set_attribute("operation_type", "chat_completion")
|
|
|
|
response = client.chat.completions.create(
|
|
model="meta-llama/Llama-3.2-3B-Instruct",
|
|
messages=[{"role": "user", "content": "Hello!"}]
|
|
)
|
|
```
|
|
|
|
## Related Resources
|
|
|
|
- **[Agents](./agent)** - Monitoring agent execution with telemetry
|
|
- **[Evaluations](./evals)** - Using telemetry data for performance evaluation
|
|
- **[Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Telemetry examples and queries
|
|
- **[OpenTelemetry Documentation](https://opentelemetry.io/)** - Comprehensive observability framework
|
|
- **[Jaeger Documentation](https://www.jaegertracing.io/)** - Distributed tracing visualization
|