Instrument AI Applications with OpenTelemetry
Introduction
3 minDistributed AI pipelines span multiple services โ API gateway, embedding service, vector search, LLM orchestrator. When a user reports slow responses, which service is the bottleneck? OpenTelemetry answers this with distributed traces showing the full request journey across all services, linked by a single trace ID. Azure Monitor Application Insights stores, queries, and visualizes this telemetry.
Observability Concepts
7 minDistributed Trace: One trace ID links all spans across services โ find the bottleneck instantly
1. The Three Pillars of Observability
- Distributed Traces โ full path of a request through all services with timing. Answers: "Where is the latency?" Primary AI-200 focus.
- Metrics โ aggregate numbers over time (request rate, error rate, p95 latency). Answers: "Is something trending wrong?"
- Logs โ timestamped discrete events. Answers: "Why did this specific operation fail?"
Together: Metrics tell you something changed, Traces tell you where, Logs tell you why.
2. Traces and Spans Anatomy
A trace = the complete record of one request across all services. A span = one named, timed unit of work within that trace.
- Trace ID โ 128-bit hex, shared by ALL spans in one request journey
- Span ID โ 64-bit hex, unique to this specific operation
- Parent Span ID โ links this span to its caller (forms a tree/waterfall)
- Name โ operation name:
HTTP GET /api/chat,vector_search,llm_inference - Start/End timestamps โ exact duration in milliseconds
- Attributes โ key-value metadata: HTTP method, model name, token count
- Status โ OK or ERROR
3. W3C traceparent โ How Spans Connect Across Services
When Service A calls Service B via HTTP, it passes the trace context in a header so B's spans link to A's span in the same trace:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ version ^^ 32-char trace ID (shared!) ^^ 16-char span ID ^^ flags Without this header: each service creates a disconnected trace. With it: all services' spans form one unified waterfall.
4. OpenTelemetry โ Application Insights Term Mapping
| # | OpenTelemetry Concept | App Insights Term | KQL Table |
|---|---|---|---|
| 1 | Trace ID | Operation ID (operation_Id) | All tables |
| 2 | Server span (SpanKind.SERVER) | Request | requests table |
| 3 | Client/Internal span | Dependency | dependencies table |
| 4 | span.set_attribute("key", val) | customDimensions | customDimensions column |
| 5 | Span events | Traces | traces table |
requests table. Outgoing calls (OpenAI API, Cosmos DB, Key Vault) = dependencies table. KQL query "find slow OpenAI calls" โ query the dependencies table where name contains "openai". Configure the Azure Monitor OpenTelemetry Distro
8 min1. Install and Initialize
pip install azure-monitor-opentelemetry from azure.monitor.opentelemetry import configure_azure_monitor
# One call โ sets up traces, metrics, and logs
# Reads APPLICATIONINSIGHTS_CONNECTION_STRING from environment variable
configure_azure_monitor() This single call initializes trace, metric, and log providers โ all exporting to Application Insights. No separate configuration needed for each.
2. Connection String
# Recommended: environment variable
export APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=...;IngestionEndpoint=..."
# Code-based (avoid in production)
configure_azure_monitor(connection_string="InstrumentationKey=...") 3. What the Distro Auto-Instruments (No Code Changes)
- requests / urllib3 โ outgoing HTTP calls to OpenAI, vector DBs โ appear as dependency spans
- Flask / Django / FastAPI โ incoming HTTP requests โ appear as server spans (requests table)
- psycopg2 โ PostgreSQL queries โ dependencies table
- Azure SDK โ Key Vault, Storage, Service Bus calls โ dependencies table
- Python logging module โ standard log calls flow to traces table automatically
4. Cloud Role Name โ Multi-Service Identification
Without unique cloud role names, all services appear as a single node on the Application Map. You lose visibility into inter-service latency.
from opentelemetry.sdk.resources import Resource
configure_azure_monitor(
resource=Resource.create({
"service.name": "embedding-service",
"service.namespace": "rag-pipeline"
})
)
# App Map shows node: "rag-pipeline.embedding-service" # Or via env var (no code change):
export OTEL_SERVICE_NAME=embedding-service
export OTEL_RESOURCE_ATTRIBUTES=service.namespace=rag-pipeline Create Custom Spans for AI Operations
10 min1. Custom Span for Model Inference
from opentelemetry import trace
tracer = trace.get_tracer("rag-pipeline")
def run_inference(prompt: str) -> dict:
with tracer.start_as_current_span("llm_inference") as span:
span.set_attribute("model", "gpt-4o")
span.set_attribute("prompt_tokens", len(prompt.split()))
result = call_openai_api(prompt)
span.set_attribute("completion_tokens", result.usage.completion_tokens)
span.set_attribute("total_tokens", result.usage.total_tokens)
return result 2. Record Errors in Spans
from opentelemetry.trace import StatusCode
with tracer.start_as_current_span("vector_search") as span:
try:
results = vector_db.search(embedding, top_k=5)
span.set_attribute("results_count", len(results))
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e) # Adds stack trace to span
raise 3. Span Kinds and Their Mappings
- SpanKind.SERVER โ incoming request handler โ
requeststable in App Insights - SpanKind.CLIENT โ outgoing call to external service โ
dependenciestable - SpanKind.INTERNAL โ internal processing (chunking, ranking) โ
dependenciestable
Analyze Telemetry in Application Insights
7 min1. Application Map
Visualizes the topology of your AI pipeline. Shows each service as a node, with call volume and failure rates on edges. Requires each service to have a unique cloud role name. Immediately reveals which service is the bottleneck or source of errors.
2. KQL Queries for AI Pipeline Analysis
// 1. Find slowest operations in last hour
requests
| where timestamp > ago(1h)
| summarize avg(duration), percentile(duration, 95) by name
| order by percentile_duration_95 desc
// 2. Find slow OpenAI calls
dependencies
| where name contains "openai"
| where duration > 5000 // over 5 seconds
| project timestamp, name, duration, customDimensions
// 3. Query by custom span attribute (model name)
requests
| where customDimensions.model == "gpt-4o"
| summarize count(), avg(duration) by bin(timestamp, 5m)
// 4. Error rate by service (cloud_RoleName)
requests
| where timestamp > ago(1h)
| summarize total=count(), errors=countif(success==false) by cloud_RoleName
| extend error_rate = round(100.0 * errors / total, 2) โก OpenTelemetry + App Insights Master Cheatsheet
configure_azure_monitor()APPLICATIONINSIGHTS_CONNECTION_STRINGspan.set_status(ERROR) + span.record_exception(e)Exercise โ Instrument a RAG Pipeline
30 min- Install
azure-monitor-opentelemetryand callconfigure_azure_monitor() - Set unique cloud role names for each service (API, embeddings, vector search)
- Create custom spans for embedding generation and LLM inference with token attributes
- Add error recording with
record_exception() - Send requests and observe the Application Map in Application Insights
- Write KQL to find the slowest operation across the pipeline
Summary
2 minOpenTelemetry provides vendor-neutral observability: Traces (where), Metrics (what), Logs (why). The Azure Monitor Distro sets everything up in one call. W3C traceparent links spans across services. Set unique service.name per service for Application Map. Create custom spans for AI-specific operations. Query Application Insights with KQL: incoming requests โ requests table, outgoing calls โ dependencies table.
๐ง Memory Tricks
Three pillars mnemonic โ TML: Traces (where), Metrics (what trend), Logs (why it failed)
Table mapping: "My server receives Requests. My client creates Dependencies." SERVER = requests. CLIENT = dependencies.
traceparent: "The GPS coordinate that finds your span in the global trace timeline."
OpenTelemetry + Application Insights
๐ Key Facts
- Three pillars (TML) โ Traces (where) | Metrics (what trend) | Logs (why)
- W3C traceparent โ HTTP header linking spans across service boundaries into one trace
- Server span โ KQL โ requests table (incoming HTTP to your service)
- Client span โ KQL โ dependencies table (outgoing: OpenAI, Cosmos, Key Vault)
- Trace ID โ App Insights โ operation_Id column โ joins all spans in a request
- Custom attr โ KQL โ span.set_attribute() โ customDimensions column
- Cloud role name โ service.name + service.namespace โ Application Map node
- configure_azure_monitor() โ One call sets up ALL three pillars (traces + metrics + logs)
๐ป Commands & Patterns
pip install azure-monitor-opentelemetry
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry.sdk.resources import Resource
# One call โ reads APPLICATIONINSIGHTS_CONNECTION_STRING env var
configure_azure_monitor(resource=Resource.create({
"service.name": "embedding-service",
"service.namespace": "rag-pipeline"
}))
# Custom span for AI operation
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("rag-pipeline")
def run_inference(prompt):
with tracer.start_as_current_span("llm_inference") as span:
span.set_attribute("model", "gpt-4o")
span.set_attribute("prompt_tokens", len(prompt.split()))
try:
return call_openai(prompt)
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e); raise Analyze Logs and Set Up Alerts with KQL and Azure Monitor
Introduction to KQL
3 minKusto Query Language (KQL) is the query language for Azure Monitor Logs and Application Insights. Use it to query telemetry tables, find errors, calculate latency, and set up metric-based alerts โ all critical for AI app observability.
Core KQL Operators
10 minEssential Query Patterns
// Failed requests in last hour
requests
| where timestamp > ago(1h)
| where success == false
| project timestamp, name, resultCode, duration
// P95 latency by operation (bin = time bucket)
requests
| where timestamp > ago(24h)
| summarize p95=percentile(duration, 95) by
bin(timestamp, 1h), name
| render timechart
// Join requests with exceptions
requests
| where timestamp > ago(1h) and success == false
| join kind=leftouter exceptions
on operation_Id
| project timestamp, name, type, outerMessage
// Track OpenAI call latency via dependencies
dependencies
| where timestamp > ago(6h)
| where target contains "openai"
| summarize avg_ms=avg(duration), calls=count()
by bin(timestamp, 30m)
| render timechart Azure Monitor Alerts
8 minAlert Rule โ Action Group โ Notification
# Create action group (email notification)
az monitor action-group create \
--name ai-alerts-ag \
--resource-group rg \
--short-name aialerts \
--email-receiver name=oncall [email protected]
# Create log alert โ fires when error rate exceeds threshold
az monitor scheduled-query create \
--name high-error-rate \
--resource-group rg \
--scopes $APP_INSIGHTS_ID \
--condition-query "requests | where success==false
| summarize count() by bin(timestamp,5m)
| where count_ > 10" \
--condition-threshold 0 \
--condition-operator GreaterThan \
--evaluation-frequency 5m \
--window-size 5m \
--action-groups $ACTION_GROUP_ID Custom Metrics and Dashboards
6 minTrack AI-Specific Metrics
from applicationinsights import TelemetryClient
tc = TelemetryClient("YOUR_INSTRUMENTATION_KEY")
# Track token usage as custom metric
tc.track_metric("openai_tokens_used", total_tokens,
properties={"model": model, "operation": "embed"})
# Track cache hit/miss ratio
tc.track_metric("semantic_cache_hit", 1 if cache_hit else 0)
tc.flush()
# Query in KQL:
# customMetrics
# | where name == "openai_tokens_used"
# | summarize sum(value) by bin(timestamp, 1h) customMetrics KQL table. Use them to track AI-specific KPIs: token cost, cache hit rate, embedding latency, RAG relevance scores. Summary
2 minKQL: pipe-based queries on telemetry tables (requests, dependencies, exceptions, traces, customMetrics). Core operators: where โ summarize โ project โ render. Alerts: KQL signal + threshold โ action group โ notification. Custom metrics for AI KPIs (tokens, cache hits). bin() for time-series, ago() for relative time windows, percentile() for latency percentiles.