Observability Data Explosion
Teams adopt observability platforms (Datadog, New Relic, Splunk) to understand their distributed systems. The pitch: instrument everything, collect all the data, and you'll be able to debug any issue. So they do. Every service emits metrics, traces, and logs. The data volume explodes. The observability bill becomes one of the largest line items in the infrastructure budget — sometimes exceeding the cost of the infrastructure being observed. And despite all this data, teams still can't find the needle in the haystack when production breaks.
What people believe
“More observability data means better debugging and faster incident resolution.”
| Metric | Before | After | Delta |
|---|---|---|---|
| Monthly observability cost | $1-5K | $10-50K+ | +500-1000% |
| Observability cost as % of infra spend | 5-10% | 20-40% | +200% |
| Dashboards created vs actively used | Most used | 80% never viewed | 80% waste |
| Mean time to resolve (MTTR) | Expected to decrease | Flat or marginal improvement | Minimal |
Don't If
- •Your observability bill exceeds 20% of your infrastructure cost
- •Your team has more dashboards than they can review in a week
If You Must
- 1.Define SLOs first, then instrument only what's needed to measure them
- 2.Implement sampling for high-volume traces and logs — you don't need 100% of everything
- 3.Set cardinality budgets and enforce them in CI
- 4.Audit dashboards and alerts quarterly — delete what nobody uses
Alternatives
- OpenTelemetry + open-source backends — Vendor-neutral instrumentation with Grafana/Prometheus/Jaeger — control your data and costs
- SLO-driven observability — Instrument for SLO measurement, not for 'collect everything' — focused and cost-effective
- Adaptive sampling — Sample more during incidents, less during normal operation — right data at the right time
This analysis is wrong if:
- Collecting more observability data consistently reduces MTTR proportionally
- Observability costs grow slower than infrastructure costs as systems scale
- Teams with more dashboards and metrics resolve incidents faster than teams with fewer
- 1.Datadog S-1 Filing
Datadog's revenue growth demonstrates how observability costs scale with customer infrastructure
- 2.Chronosphere: Observability Cost Report
Analysis showing observability costs growing 30-50% annually, often faster than infrastructure costs
- 3.Honeycomb: Observability Engineering
Framework for effective observability that avoids the 'collect everything' trap
- 4.CNCF: OpenTelemetry Project
Vendor-neutral observability standard that reduces lock-in and enables cost control
This is a mirror — it shows what's already true.
Want to surface the hidden consequences of your engineering decisions?