Alerting

I Wrote a Book About the Thing That Almost Broke Me

June 2, 2026 • Cardinality Cloud • 2 min read

The SRE On-Call Review Practice book cover

I remember the knock on the door.

I don’t know what I was doing. Probably something that felt urgent. What I know is that when someone knocked on the door of my office – a shed in the back yard – I started screaming. It was another interruption from another direction, and every neuron in my body had been trained to respond to interruption with panic.

I scared my friend. I scared myself. That’s when I knew I needed a therapist.

The Pages That Wouldn't Stop (And Why Faster Response Wasn't the Answer)

March 17, 2026 • Cardinality Cloud • 4 min read

We kept getting paged for latency.

The SRE team knew the drill. Shift load to replicas, scale the database, bounce connections. It worked, usually. Things settled down. The on-call engineer closed the incident and went back to sleep.

Then the same page fired three nights later.

sre on-call alerting toil prometheus

Runbook Template

October 30, 2025 • Cardinality Cloud • 5 min read

Every alert should have a Runbook. (Sometimes called Playbook.) A Runbook is a guide for SREs, DevOps, On-Call engineers, and Software Developers that prescribes potential remediations for specific alerts. The goal is to reduce MTTR and improve incident response with structured troubleshooting, verification steps, and escalation paths for SRE and DevOps teams. A place to build and share knowledge about a potential event.

runbook alerting prometheus operations sre incident-response troubleshooting on-call mttr remediation monitoring devops pagerduty

What is an SLO and why should I use SLO-based alerts?

October 20, 2025 • Cardinality Cloud • 9 min read

Traditional infrastructure alerts page you when CPU hits 80%, but your users are fine. Meanwhile, degraded API performance goes unnoticed because no arbitrary threshold was crossed. An SLO (Service Level Objective) changes this - it’s a target reliability goal that measures what users actually experience, like “99.9% of requests succeed over 30 days.” Born from Google’s Site Reliability Engineering (SRE) practices, SLO-based alerting only pages when user experience is genuinely at risk, eliminating alert fatigue while catching real issues early.

faq slo alerting fundamentals

Why is burn rate alerting useful?

October 18, 2025 • Cardinality Cloud • 2 min read

Traditional threshold alerts fire on every spike, creating alert fatigue. Burn rate alerting is different - it tracks how quickly you’re consuming your error budget and only alerts when errors are sustained enough to threaten your reliability target. This gives you early warnings before user experience degrades, while dramatically reducing noise.

faq slo burn-rate alerting

Understanding SLO-Based Alerting

October 15, 2025 • Cardinality Cloud • 2 min read

Why does a 5% error rate trigger an alert at 2 AM? Is it catastrophic during peak traffic or meaningless during low usage? Traditional static thresholds can’t tell you. SLO-based alerting asks a better question: “Are we consuming our error budget faster than planned?” This approach ties alerts directly to user-impacting reliability issues, eliminating arbitrary thresholds and reducing alert fatigue while catching real problems early.

slo alerting prometheus

Free Prometheus Alert Rule and SLO Generator