What Really Is a Metric?
A metric is not reality. It’s a lossy measurement with assumptions baked in.
I said it. I’ll own it. But when we talk about observability, what really is a metric?
Tools for Prometheus monitoring: SLO-based PromQL generator, error budget calculator, and scaling to avoid OOMs.
Articles and tutorials about Prometheus monitoring, SLOs, and observability
A metric is not reality. It’s a lossy measurement with assumptions baked in.
I said it. I’ll own it. But when we talk about observability, what really is a metric?
The site is down. Customers are complaining. Your on-call engineer is in hour three of spelunking through dashboards that look like a Jackson Pollock painting. You’re paying six+ figures annually for an observability platform, and the most useful alert so far has been “Something is wrong. Probably.”
The observability vendors charge by the byte. They want you to send everything. The industry tells you to measure everything. So you instrument everything, send it all to your vendor, and wait for clarity.
Instead, you get an Observability bill that’s higher than your AWS or GCP compute costs. And you still can’t answer basic questions: Is my application healthy? Are customers experiencing problems right now? Should I be paging someone?
Even with a top-tier vendor and unlimited budget, more data doesn’t equal more clarity. You’re drowning in metrics, dashboards, and alerts — but you still don’t know what actually matters.
Quick reference for Prometheus Query Language (PromQL) with practical examples for monitoring, alerting, and SLO calculations. Covers essential functions, aggregations, and common patterns for effective Observability.
Every alert should have a Runbook. (Sometimes called Playbook.) A Runbook is a guide for SREs, DevOps, On-Call engineers, and Software Developers that prescribes potential remediations for specific alerts. The goal is to reduce MTTR and improve incident response with structured troubleshooting, verification steps, and escalation paths for SRE and DevOps teams. A place to build and share knowledge about a potential event.
Traditional infrastructure alerts page you when CPU hits 80%, but your users are fine. Meanwhile, degraded API performance goes unnoticed because no arbitrary threshold was crossed. An SLO (Service Level Objective) changes this - it’s a target reliability goal that measures what users actually experience, like “99.9% of requests succeed over 30 days.” Born from Google’s Site Reliability Engineering (SRE) practices, SLO-based alerting only pages when user experience is genuinely at risk, eliminating alert fatigue while catching real issues early.
Traditional threshold alerts fire on every spike, creating alert fatigue. Burn rate alerting is different - it tracks how quickly you’re consuming your error budget and only alerts when errors are sustained enough to threaten your reliability target. This gives you early warnings before user experience degrades, while dramatically reducing noise.
Calculating error budget over 30 days should be simple, but naive Prometheus queries time out on high-cardinality metrics. This tool uses a Riemann Sum-inspired technique that pre-computes error ratios at 5-minute intervals, turning an expensive range query into a single fast aggregation. The result: accurate error budget calculations that scale.
Why does a 5% error rate trigger an alert at 2 AM? Is it catastrophic during peak traffic or meaningless during low usage? Traditional static thresholds can’t tell you. SLO-based alerting asks a better question: “Are we consuming our error budget faster than planned?” This approach ties alerts directly to user-impacting reliability issues, eliminating arbitrary thresholds and reducing alert fatigue while catching real problems early.