What is an SLO and why should I use SLO-based alerts?
Traditional infrastructure alerts page you when CPU hits 80%, but your users are fine. Meanwhile, degraded API performance goes unnoticed because no arbitrary threshold was crossed. An SLO (Service Level Objective) changes this - it’s a target reliability goal that measures what users actually experience, like “99.9% of requests succeed over 30 days.” Born from Google’s Site Reliability Engineering (SRE) practices, SLO-based alerting only pages when user experience is genuinely at risk, eliminating alert fatigue while catching real issues early.
Understanding the SLI/SLO/SLA Hierarchy
Before diving deeper into SLOs, it’s important to understand how they fit into the broader reliability framework:
SLI (Service Level Indicator)
A Service Level Indicator is a quantitative measure of service behavior. It’s the actual measurement you take. It can also be called a Key Performance Indicator (KPI). Common SLIs include:
- Availability: Ratio of successful requests to total requests
- Latency: Percentage of requests faster than a threshold (e.g., 95th percentile < 500ms)
- Error rate: Percentage of requests that return errors
- Throughput: Requests processed per second
- Durability: Percentage of data retained without loss
- Saturation: How “full” a service instance is expressed as a ratio or percentage. When does the service need to scale up?
Example SLIs:
1# Availability SLI
2sum(rate(http_requests_total{status=~"2.."}[5m]))
3/
4sum(rate(http_requests_total[5m]))
5
6# Latency SLI (requests under 500ms)
7sum(rate(http_request_duration_bucket{le="0.5"}[5m]))
8/
9sum(rate(http_request_duration_count[5m]))
SLO (Service Level Objective)
A Service Level Objective is a target value or range for an SLI, measured over a specific time window. It’s your internal goal.
Example SLOs:
- 99.9% of HTTP requests succeed (availability SLO)
- 99% of requests complete in < 500ms (latency SLO)
- 99.95% of writes are durable over 30 days (durability SLO)
SLOs should be:
- Achievable: Based on historical performance with some buffer
- Meaningful: Aligned with actual user experience
- Measurable: Based on observable metrics you control
- Documented: Clear ownership and measurement methodology
SLA (Service Level Agreement)
A Service Level Agreement is an explicit or implicit contract with your users that includes consequences if the SLO is not met. This is the external goal. SLAs typically have:
- Less strict targets than internal SLOs (buffer for safety)
- Financial or contractual penalties for violations
- Legal implications
Example SLA: “We guarantee 99.5% uptime per month. If we fail to meet this, customers receive a 10% service credit.”
Key relationship: SLA ≤ SLO ≤ actual performance. Your internal SLO should be stricter than your external SLA to give you room to respond before violating customer agreements.
Why SLO-Based Alerting is Superior
Traditional monitoring alerts on arbitrary infrastructure thresholds: “CPU > 80%”, “Memory > 90%”, “Request count > 1000/s”. These alerts suffer from several problems:
Problems with Traditional Threshold Alerts
1. Disconnected from User Experience
- High CPU doesn’t necessarily mean users are impacted. In fact, you probably want your CPU cores to be well utilized.
- Low latency on individual components doesn’t guarantee good user experience.
- User-visible problems may not manifest as infrastructure related issues.
2. Alert Fatigue
- Teams become numb to constant paging
- Important alerts get lost in noise
- On-call engineers learn to ignore certain alerts
- Engineers tell each other “It’s okay to ignore that alert!”
3. Lack of Business Context
- No connection to business objectives
- Unclear which issues require immediate attention. Where is this alert on the Eisenhower Matrix?
- Difficult to prioritize incident response
4. Static Thresholds Don’t Scale
- Same thresholds don’t work across dev/staging/prod
- Traffic patterns change over time
- Seasonal variations require constant tuning
5. Reactive Rather Than Predictive
- Alerts fire after problems occur
- No warning when burning through reliability budget
- Can’t distinguish between “annoying” and “urgent”
- It feels like everything is always on fire
Advantages of SLO-Based Alerting
1. User-Centric Focus
SLO-based alerts directly measure what users experience. If users aren’t impacted, you don’t page. If users are impacted, you always know.
Example: Your database CPU is at 95%, but request success rate is still 99.99%. Traditional alerts would page. SLO-based alerts would not, because users are unaffected.
2. Multi-Window Burn Rate Detection
Modern SLO alerting uses burn rates to determine urgency. Burn rate measures how quickly you’re consuming your error budget compared to the target consumption rate.
With a 99.9% SLO over 30 days:
- Normal consumption: 0.1% error budget = 43.2 minutes of allowed downtime per 30 days
- 1x burn rate: Consuming at the target rate (will exactly hit SLO)
- 14.4x burn rate: Will exhaust entire month’s budget in 2 days (CRITICAL)
- 6x burn rate: Will exhaust budget in 5 days (WARNING)
Multi-window approach requires burn rate to persist across multiple time windows:
- Fast burn (1h/5m): High burn rate sustained for 1 hour, confirmed by 5-minute window
- Slow burn (6h/30m): Moderate burn rate sustained for 6 hours, confirmed by 30-minute window
This dramatically reduces false positives while ensuring real issues are caught early.
3. Contextual Severity
Alert severity is determined by burn rate, providing automatic prioritization:
| Burn Rate | Alert Severity | Time to Budget Exhaustion | Response Time |
|---|---|---|---|
| 14.4x | CRITICAL (page) | 2 days | Immediate |
| 6x | WARNING (ticket) | 5 days | Next business day |
| 3x | INFO (log) | 10 days | Review in retrospective |
| 1x | Normal | 30 days (on target) | No action |
4. Reduces Alert Fatigue
You only page when error budget is genuinely at risk. Small, transient issues that self-resolve don’t generate alerts. This leads to:
- Higher signal-to-noise ratio
- More actionable alerts
- Healthier on-call rotation
- Faster incident response (people trust alerts)
5. Business Alignment
Error budget provides a shared language between engineering and business:
- Budget remaining? Ship faster, take more risks
- Budget exhausted? Focus on reliability, slow down releases
- Consistent budget surplus? SLO might be too strict, costing engineering time
This creates a quantitative framework for reliability vs. velocity trade-offs.
6. Predictive Early Warning
Burn rate alerts warn you before you exhaust your error budget, giving time to:
- Investigate root causes
- Implement fixes
- Roll back problematic changes
- Prevent SLA violations
Traditional alert: “Database is down” (reactive, users already impacted) SLO alert: “Burning error budget 6x faster than target” (predictive, time to respond)
Common SLO Anti-Patterns to Avoid
1. Too Many Nines
Don’t default to 99.99% or “five nines” (99.999%) without justification. Each additional nine:
- Costs exponentially more in engineering effort
- Reduces innovation velocity
- May not improve actual user experience
Ask: Does going from 99.9% to 99.99% meaningfully improve user experience? For many services, users can’t perceive the difference between 99.9% (43 minutes downtime/month) and 99.99% (4.3 minutes/month).
2. Alerting on SLO Achievement
Don’t alert when you’re meeting your SLO. Only alert when error budget is at risk.
Wrong: Alert when availability drops below 99.9% Right: Alert when burn rate indicates you’ll miss 99.9% over the SLO window
3. Too Many SLOs
More SLOs = more cognitive load = less clarity. Start with 1-3 SLOs per user-facing service:
- One availability/success rate SLO
- One latency SLO (optional)
- One saturation SLO (when to scale)
4. Infrastructure-Based SLOs
SLOs should measure user experience, not infrastructure health.
Wrong: “99% of database queries complete successfully” Right: “99% of user requests complete successfully”
The database might fail queries that get retried transparently. Focus on what users see.
5. Ignoring Error Budget
Error budget exists to be spent! If you’re consistently at 100% error budget remaining, your SLO is too strict and you’re over-investing in reliability.
Practical Implementation Steps
1. Choose User-Facing Metrics
Identify what users actually care about:
- Can they access the service? (Availability)
- Is it fast enough? (Latency)
- Is their data safe? (Durability)
2. Analyze Historical Performance
Look at 3-6 months of data:
- What’s your current performance?
- What are normal variance ranges?
- What level could you commit to?
New service without historical data? Start at a 95% goal and choose what means success. If the service responds to an HTTP API, start at a latency of 500ms as the goal. Then iterate.
3. Set Initial SLOs Conservatively
Start with achievable targets:
- If you’re at 99.95% historically, start with 99.5% or 99.9%
- Leave headroom for unexpected issues
- You can always tighten later
4. Define Error Budget
Error budget = 100% - SLO target over the time window
For 99.9% SLO over 30 days:
- Error budget = 0.1% = 43.2 minutes of allowed downtime
- Burn rate 1x = consuming 1.44 minutes/day
- Burn rate 14.4x = consuming 20.7 minutes/day (critical)
5. Implement Recording Rules
Pre-compute SLI measurements at regular intervals (usually 1-5 minutes). This enables efficient burn rate calculations without expensive queries. Tools for generating these recording rules efficiently and consistently is what this website is about.
6. Set Up Burn Rate Alerts
The tools on this website implement multi-window burn rate alerts:
- Fast burn (1h/5m windows): Page immediately
- Slow burn (6h/30m windows): Create ticket for next day
7. Monitor and Iterate
SLOs are a process, not a project. The weekly meeting on-call engineers have to review the last week’s alerts and pages is a great time to do this. Review SLO performance at least quarterly:
- Are you consistently meeting SLOs? (Too easy?)
- Are you consistently missing SLOs? (Too strict?)
- Do alerts correlate with actual user impact?
Real-World Example
Scenario: E-commerce API service
SLO Definition:
- Metric: HTTP request success rate
- Target: 99.9% of requests succeed
- Window: 30 days rolling
- Error budget: 0.1% = 43.2 minutes of errors per month
Recording Rules:
1# 5-minute error ratio
2job:api:error_ratio_5m =
3 sum(rate(http_requests_total{status=~"5.."}[5m]))
4 /
5 sum(rate(http_requests_total[5m]))
6
7# 1-hour error ratio
8job:api:error_ratio_1h =
9 avg_over_time(job:api:error_ratio_5m[1h])
10
11# 6-hour error ratio
12job:api:error_ratio_6h =
13 avg_over_time(job:api:error_ratio_5m[6h])
14
15# Error budget remaining
16job:api:error_budget_remaining =
17 1 - (avg_over_time(job:api:error_ratio_5m[30d]) / 0.001)
Alerts:
1# Fast burn: 14.4x burn rate over 1h, confirmed by 5m
2- alert: APIErrorBudgetFastBurn
3 expr: |
4 job:api:error_ratio_1h > 0.00144 # 14.4 * 0.0001
5 and
6 job:api:error_ratio_5m > 0.00144
7 for: 0m
8 severity: critical
9
10# Slow burn: 6x burn rate over 6h, confirmed by 30m
11- alert: APIErrorBudgetSlowBurn
12 expr: |
13 job:api:error_ratio_6h > 0.0006 # 6 * 0.0001
14 and
15 job:api:error_ratio_30m > 0.0006
16 for: 0m
17 severity: warning
Benefits Realized:
- Pages reduced by 80% (only 2-3 actionable pages per month)
- All pages correlated with actual user impact
- Team can confidently ship during business hours
- Clear framework for discussing reliability vs. features
When NOT to Use SLO-Based Alerting
SLO-based alerting may not be appropriate for:
1. Non-user-facing services - Internal batch jobs, data pipelines → Traditional threshold alerts may be simpler
2. Hard real-time requirements - Safety-critical systems, financial trading → Need instant detection, can’t tolerate burn rate windows
3. Very low traffic services - < 100 requests/day → Statistical significance issues, hard to measure SLIs accurately
4. Services without clear user-facing metrics - Infrastructure components → Define SLOs for services that depend on this component instead
Learn More
The foundational concepts are covered in the Google SRE books:
- Service Level Objectives (SRE Book Chapter 4)
- Implementing SLOs (SRE Workbook Chapter 2)
- Alerting on SLOs (SRE Workbook Chapter 5)
For a comprehensive practical guide, see Alex Hidalgo’s Implementing Service Level Objectives (O’Reilly, 2020), which provides detailed strategies for defining, measuring, and alerting on SLOs in real-world environments.
Additional resources:
- SLOs, SLIs, SLAs, oh my! - Atlassian’s beginner-friendly overview