docs: concepts and building_applications migration (#3534)

# What does this PR do? - Migrates the remaining documentation sections to the new documentation format    ## Test Plan - Partial migration
2025-12-30 22:00:00 +00:00 · 2025-09-24 14:05:30 -07:00 · 2025-09-24 14:05:30 -07:00 · c71ce8df61
commit c71ce8df61
parent 05ff4c4420
82 changed files with 2535 additions and 1237 deletions
--- a/docs/docs/building_applications/telemetry.mdx
+++ b/docs/docs/building_applications/telemetry.mdx
@ -0,0 +1,342 @@
+---
+title: Telemetry
+description: Monitor and observe Llama Stack applications with comprehensive telemetry capabilities
+sidebar_label: Telemetry
+sidebar_position: 8
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Telemetry
+
+The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output for complete observability of your AI applications.
+
+## Event Types
+
+The telemetry system supports three main types of events:
+
+<Tabs>
+<TabItem value="unstructured" label="Unstructured Logs">
+
+Free-form log messages with severity levels for general application logging:
+
+```python
+unstructured_log_event = UnstructuredLogEvent(
+    message="This is a log message",
+    severity=LogSeverity.INFO
+)
+```
+
+</TabItem>
+<TabItem value="metrics" label="Metric Events">
+
+Numerical measurements with units for tracking performance and usage:
+
+```python
+metric_event = MetricEvent(
+    metric="my_metric",
+    value=10,
+    unit="count"
+)
+```
+
+</TabItem>
+<TabItem value="structured" label="Structured Logs">
+
+System events like span start/end that provide structured operation tracking:
+
+```python
+structured_log_event = SpanStartPayload(
+    name="my_span",
+    parent_span_id="parent_span_id"
+)
+```
+
+</TabItem>
+</Tabs>
+
+## Spans and Traces
+
+- **Spans**: Represent individual operations with timing information and hierarchical relationships
+- **Traces**: Collections of related spans that form a complete request flow across your application
+
+This hierarchical structure allows you to understand the complete execution path of requests through your Llama Stack application.
+
+## Automatic Metrics Generation
+
+Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the **inference request level** and provide insights into token usage and model performance.
+
+### Available Metrics
+
+The following metrics are automatically generated for each inference request:
+
+| Metric Name | Type | Unit | Description | Labels |
+|-------------|------|------|-------------|--------|
+| `llama_stack_prompt_tokens_total` | Counter | `tokens` | Number of tokens in the input prompt | `model_id`, `provider_id` |
+| `llama_stack_completion_tokens_total` | Counter | `tokens` | Number of tokens in the generated response | `model_id`, `provider_id` |
+| `llama_stack_tokens_total` | Counter | `tokens` | Total tokens used (prompt + completion) | `model_id`, `provider_id` |
+
+### Metric Generation Flow
+
+1. **Token Counting**: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
+2. **Metric Construction**: For each request, `MetricEvent` objects are created with the token counts
+3. **Telemetry Logging**: Metrics are sent to the configured telemetry sinks
+4. **OpenTelemetry Export**: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters
+
+### Metric Aggregation Level
+
+All metrics are generated and aggregated at the **inference request level**. This means:
+
+- Each individual inference request generates its own set of metrics
+- Metrics are not pre-aggregated across multiple requests
+- Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
+- Each metric includes labels for `model_id` and `provider_id` to enable filtering and grouping
+
+### Example Metric Event
+
+```python
+MetricEvent(
+    trace_id="1234567890abcdef",
+    span_id="abcdef1234567890",
+    metric="total_tokens",
+    value=150,
+    timestamp=1703123456.789,
+    unit="tokens",
+    attributes={
+        "model_id": "meta-llama/Llama-3.2-3B-Instruct",
+        "provider_id": "tgi"
+    },
+)
+```
+
+## Telemetry Sinks
+
+Choose from multiple sink types based on your observability needs:
+
+<Tabs>
+<TabItem value="opentelemetry" label="OpenTelemetry">
+
+Send events to an OpenTelemetry Collector for integration with observability platforms:
+
+**Use Cases:**
+- Visualizing traces in tools like Jaeger
+- Collecting metrics for Prometheus
+- Integration with enterprise observability stacks
+
+**Features:**
+- Standard OpenTelemetry format
+- Compatible with all OpenTelemetry collectors
+- Supports both traces and metrics
+
+</TabItem>
+<TabItem value="sqlite" label="SQLite">
+
+Store events in a local SQLite database for direct querying:
+
+**Use Cases:**
+- Local development and debugging
+- Custom analytics and reporting
+- Offline analysis of application behavior
+
+**Features:**
+- Direct SQL querying capabilities
+- Persistent local storage
+- No external dependencies
+
+</TabItem>
+<TabItem value="console" label="Console">
+
+Print events to the console for immediate debugging:
+
+**Use Cases:**
+- Development and testing
+- Quick debugging sessions
+- Simple logging without external tools
+
+**Features:**
+- Immediate output visibility
+- No setup required
+- Human-readable format
+
+</TabItem>
+</Tabs>
+
+## Configuration
+
+### Meta-Reference Provider
+
+Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:
+
+```yaml
+telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      service_name: "llama-stack-service"
+      sinks: ['console', 'sqlite', 'otel_trace', 'otel_metric']
+      otel_exporter_otlp_endpoint: "http://localhost:4318"
+      sqlite_db_path: "/path/to/telemetry.db"
+```
+
+### Environment Variables
+
+Configure telemetry behavior using environment variables:
+
+- **`OTEL_EXPORTER_OTLP_ENDPOINT`**: OpenTelemetry Collector endpoint (default: `http://localhost:4318`)
+- **`OTEL_SERVICE_NAME`**: Service name for telemetry (default: empty string)
+- **`TELEMETRY_SINKS`**: Comma-separated list of sinks (default: `console,sqlite`)
+
+## Visualization with Jaeger
+
+The `otel_trace` sink works with any service compatible with the OpenTelemetry collector. Traces and metrics use separate endpoints but can share the same collector.
+
+### Starting Jaeger
+
+Start a Jaeger instance with OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686:
+
+```bash
+docker run --pull always --rm --name jaeger \
+  -p 16686:16686 -p 4318:4318 \
+  jaegertracing/jaeger:2.1.0
+```
+
+Once running, you can visualize traces by navigating to [http://localhost:16686/](http://localhost:16686/).
+
+## Querying Metrics
+
+When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:
+
+<Tabs>
+<TabItem value="prometheus" label="Prometheus Queries">
+
+Example Prometheus queries for analyzing token usage:
+
+```promql
+# Total tokens used across all models
+sum(llama_stack_tokens_total)
+
+# Tokens per model
+sum by (model_id) (llama_stack_tokens_total)
+
+# Average tokens per request over 5 minutes
+rate(llama_stack_tokens_total[5m])
+
+# Token usage by provider
+sum by (provider_id) (llama_stack_tokens_total)
+```
+
+</TabItem>
+<TabItem value="grafana" label="Grafana Dashboards">
+
+Create dashboards using Prometheus as a data source:
+
+- **Token Usage Over Time**: Line charts showing token consumption trends
+- **Model Performance**: Comparison of different models by token efficiency
+- **Provider Analysis**: Breakdown of usage across different providers
+- **Request Patterns**: Understanding peak usage times and patterns
+
+</TabItem>
+<TabItem value="otlp" label="OpenTelemetry Collector">
+
+Forward metrics to other observability systems:
+
+- Export to multiple backends simultaneously
+- Apply transformations and filtering
+- Integrate with existing monitoring infrastructure
+
+</TabItem>
+</Tabs>
+
+## SQLite Querying
+
+The `sqlite` sink allows you to query traces without an external system. This is particularly useful for development and custom analytics.
+
+### Example Queries
+
+```sql
+-- Query recent traces
+SELECT * FROM traces WHERE timestamp > datetime('now', '-1 hour');
+
+-- Analyze span durations
+SELECT name, AVG(duration_ms) as avg_duration
+FROM spans
+GROUP BY name
+ORDER BY avg_duration DESC;
+
+-- Find slow operations
+SELECT * FROM spans
+WHERE duration_ms > 1000
+ORDER BY duration_ms DESC;
+```
+
+:::tip[Advanced Analytics]
+Refer to the [Getting Started notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on querying traces and spans programmatically.
+:::
+
+## Best Practices
+
+### 🔍 **Monitoring Strategy**
+- Use OpenTelemetry for production environments
+- Combine multiple sinks for development (console + SQLite)
+- Set up alerts on key metrics like token usage and error rates
+
+### 📊 **Metrics Analysis**
+- Track token usage trends to optimize costs
+- Monitor response times across different models
+- Analyze usage patterns to improve resource allocation
+
+### 🚨 **Alerting & Debugging**
+- Set up alerts for unusual token consumption spikes
+- Use trace data to debug performance issues
+- Monitor error rates and failure patterns
+
+### 🔧 **Configuration Management**
+- Use environment variables for flexible deployment
+- Configure appropriate retention policies for SQLite
+- Ensure proper network access to OpenTelemetry collectors
+
+## Integration Examples
+
+### Basic Telemetry Setup
+
+```python
+from llama_stack_client import LlamaStackClient
+
+# Client with telemetry headers
+client = LlamaStackClient(
+    base_url="http://localhost:8000",
+    extra_headers={
+        "X-Telemetry-Service": "my-ai-app",
+        "X-Telemetry-Version": "1.0.0"
+    }
+)
+
+# All API calls will be automatically traced
+response = client.inference.chat_completion(
+    model="meta-llama/Llama-3.2-3B-Instruct",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+### Custom Telemetry Context
+
+```python
+# Add custom span attributes for better tracking
+with tracer.start_as_current_span("custom_operation") as span:
+    span.set_attribute("user_id", "user123")
+    span.set_attribute("operation_type", "chat_completion")
+
+    response = client.inference.chat_completion(
+        model="meta-llama/Llama-3.2-3B-Instruct",
+        messages=[{"role": "user", "content": "Hello!"}]
+    )
+```
+
+## Related Resources
+
+- **[Agents](./agent)** - Monitoring agent execution with telemetry
+- **[Evaluations](./evals)** - Using telemetry data for performance evaluation
+- **[Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)** - Telemetry examples and queries
+- **[OpenTelemetry Documentation](https://opentelemetry.io/)** - Comprehensive observability framework
+- **[Jaeger Documentation](https://www.jaegertracing.io/)** - Distributed tracing visualization