How does this tool efficiently calculate error budget over long SLO windows?

October 16, 2025

Cardinality Cloud

2 min read

Calculating error budget over 30 days should be simple, but naive Prometheus queries time out on high-cardinality metrics. This tool uses a Riemann Sum-inspired technique that pre-computes error ratios at 5-minute intervals, turning an expensive range query into a single fast aggregation. The result: accurate error budget calculations that scale.

The Problem

Calculating error budget remaining over a 30-day window naively would require something like:

11 - (sum_over_time(rate(errors[5m])[30d]) / sum_over_time(rate(total[5m])[30d]))

This is computationally expensive and can cause Prometheus query timeouts, especially with high cardinality metrics.

The Riemann Sum Solution

Instead, this tool generates a recording rule that pre-computes the error ratio at 5-minute intervals:

1# Error ratio over 5m window (evaluated every 1m by default)
2job:slo_burn:ratio_5m = rate(errors[5m]) / rate(total[5m])

Then, to calculate error budget remaining over the full 30-day window, we use:

11 - (avg_over_time(job:slo_burn:ratio_5m[30d]) / error_budget)

Why This Works: The Math

This technique is inspired by Riemann Sums from calculus. The error budget consumed over a time period is:

$$\int_0^T \text{error\_rate}(t) \, dt \, / \, \int_0^T \text{total\_rate}(t) \, dt$$

By sampling the error ratio at regular intervals (every evaluation, typically 1m), we’re approximating this integral as:

$$\frac{1}{n} \times \sum_{i=1}^{n} \frac{\text{error\_rate}(t_i)}{\text{total\_rate}(t_i)}$$

This is exactly what avg_over_time(job:slo_burn:ratio_5m[30d]) computes - it takes the average of all 5-minute error ratio samples over the 30-day window. The 5-minute window smooths out instantaneous spikes while the evaluation interval (1m) ensures we sample frequently enough to capture meaningful changes.

Key Advantages

Performance: Queries only pre-computed recording rule samples, not raw metrics
Accuracy: With 1-minute evaluation intervals, we get ~43,200 samples over 30 days, providing high accuracy
Scalability: Query cost is constant regardless of request rate or metric cardinality
Simplicity: Single avg_over_time() function instead of complex nested aggregations

The Trade-off

The accuracy depends on:

Evaluation interval (shorter = more samples = more accurate)
Recording rule window (5m provides good balance between smoothing and responsiveness)

With the default 1-minute evaluation interval, the approximation error is negligible for practical SLO monitoring.

Historical Context

This approach is similar to how Prometheus’s recording rules were designed - pre-compute expensive aggregations and then query the results. The connection to Riemann Sums makes the mathematical foundation clear: we’re numerically integrating the error rate over time using the midpoint rule with small step sizes.

Free Prometheus Alert Rule and SLO Generator