OpenTelemetry: the standard that changed the rules
Before OpenTelemetry, instrumenting for observability meant choosing a vendor (Datadog, New Relic, Dynatrace) and installing their proprietary agent on every service. Changing vendors meant re-instrumenting the entire application. The cost of that lock-in was real and calculable, and led many teams to renew contracts out of inertia rather than value.
OpenTelemetry (OTel) breaks that model by separating the instrumentation from the destination backend. You instrument once with OTel APIs and SDKs — available for all major languages — and configure the Collector to send data to any compatible backend. Migrating from Datadog to Grafana Tempo, or adding a second backend for comparison, becomes configuration, not code.
In 2025, OTel is the de facto standard. The three main signals — traces, metrics and logs — have stable specifications and mature SDKs. Automatic instrumentation for the most widely used frameworks (Express, FastAPI, gRPC, Spring Boot, .NET) covers most cases without modifying application code. Adoption in new projects should be the default option, not a decision to evaluate.
SLOs: managing reliability with explicit contracts
Service Level Objectives are the most powerful tool for aligning engineering teams with business expectations about reliability. An SLO turns 'the system has to be available' into '99.5% of requests must complete in under 500ms, measured over a 28-day window'. That specificity fundamentally changes how the team prioritises work.
The error budget concept is the most valuable consequence of SLOs: if your target is 99.5% success, you have a budget of 0.5% failures. While that budget is not exhausted, the team can deploy with confidence and accept some risk. When the budget is close to being exhausted, the alerts change: you do not alert on individual errors, you alert on error budget consumption. This distinction eliminates the noise of low-severity alerts that nobody attends to and focuses attention on what matters.
Implementing SLOs correctly requires choosing the right SLIs (Service Level Indicators) — the metrics that truly represent user experience, not the infrastructure metrics that are easy to measure but imperfect proxies. For HTTP services, success rate and 95th-percentile latency are usually the right indicators. For batch processing systems, throughput and queue lag. Choosing the right indicator is harder than the technical implementation.
Signal correlation: from data to diagnosis
The three observability signals — metrics, logs and traces — have limited value in isolation. A metric showing high latency says there is a problem. A log with an error says where the problem is. A distributed trace shows the complete path taken by the failed request. Automatic correlation between these three signals is what reduces diagnosis time from hours to minutes.
Correlation works when all three signals share common context: a trace ID that appears both in the span of the trace and in the log generated during that request, and a service tag that links the latency metric to the specific service the trace identifies as the bottleneck. This does not happen on its own — it requires that the instrumentation is consistent and that the data pipeline (the OTel Collector, the log agent) propagates these identifiers without truncating them.
The platforms that have best implemented this correlation in 2025 are Grafana Stack (with Tempo for traces, Loki for logs and Mimir/Prometheus for metrics), Honeycomb (with its wide-events model that unifies the three types in a single structure), and Datadog APM with its integrated Log Management. The choice between them depends more on team size, data volume and economic model than on fundamental technical differences.
AIOps: where AI delivers real value in operations
AIOps is one of the most overloaded terms in the industry. In practice, there are three areas where ML models applied to operations demonstrate real and repeatable value: alert noise reduction (grouping related alerts into a single incident), anomaly detection in time series (identifying unusual patterns before they manifest as failures) and assisted root cause analysis (correlating incident timing with recent deployments, configuration changes and infrastructure events).
Anomaly detection has the greatest demonstrated impact in production. Systems that generate millions of metrics per minute cannot be monitored with manual thresholds — there are too many variables and too many seasonal patterns (peak hours, weekly cycles, effects of marketing campaigns) to define static thresholds that are sensitive but do not generate false positives. Time series models trained on historical system behaviour detect statistically significant deviations with a consistently lower false positive rate than manual thresholds.
Automated root cause analysis is the area with the greatest potential and the greatest room for maturation. Current tools (Moogsoft, BigPanda, Dynatrace Davis AI) are useful for incidents with root causes in a recent well-documented change, but have significant limitations for incidents with accumulated causes or in systems with high interdependency. The direction of integrating LLMs with observability context to generate root cause hypotheses is promising — the first products in this line (Datadog Bits AI, Grafana Sift) are in production but still in early adoption.
How long does your team take to diagnose an incident?
We design and implement your observability stack from scratch or improve the existing one: OTel, SLOs, correlation and noise reduction. With measurable improvement metrics from day one.