Free Prometheus Alert Rule and SLO Generator

Tools for Prometheus monitoring: SLO-based PromQL generator, error budget calculator, and scaling to avoid OOMs.

Ready to see what an Independent Observability Architect can do for you? Cardinality Cloud Cardinality Cloud, LLC

I Wrote a Book About the Thing That Almost Broke Me

The SRE On-Call Review Practice book cover

I remember the knock on the door.

I don’t know what I was doing. Probably something that felt urgent. What I know is that when someone knocked on the door of my office – a shed in the back yard – I started screaming. It was another interruption from another direction, and every neuron in my body had been trained to respond to interruption with panic.

I scared my friend. I scared myself. That’s when I knew I needed a therapist.

I’ve been in this industry for 25 years. I’ve watched teams get paged into the ground, usually without anyone naming what was happening to them. Nobody had written the book I kept wishing existed. So I wrote it.

Every alert that fires during an on-call shift has exactly three valid responses. Only three.

Action it. It’s a real event. Acknowledge, assess impact, remediate, update the runbook.

Fix the alert rule. It’s not a real event, or the thresholds are wrong. The problem is in your alerting config, not in production.

Escalate. It belongs to another team. Route it correctly, then update the routing so it doesn’t come back to you.

The discipline is in actually choosing one, every time, without letting alerts pile up in a state where their meaning gets lost. This sounds simple. It is not easy. The book is about building the practice that makes it sustainable.

The SRE On-Call Review Practice is the first book in the Observability Practitioner Series from Cardinality Cloud.

Read the full story and get your copy