Free Prometheus Alert Rule and SLO Generator

Tools for Prometheus monitoring: SLO-based PromQL generator, error budget calculator, and scaling to avoid OOMs.

Brought to you by Cardinality Cloud, LLC.

PromQL Cheat Sheet - Complete Reference Guide

Quick reference for Prometheus Query Language (PromQL) with practical examples for monitoring, alerting, and SLO calculations. Covers essential functions, aggregations, and common patterns for effective Observability.

Metric Types

Type Description Use Case
Counter Monotonically increasing counter, resets on application restart Track requests or bytes processed – basis for rates
Gauge Can increase or decrease over time, does not reset Size, queue depth, temperature, number of concurrent threads
Summary A set of Counters tracking averages and quantiles Sizes or latencies of rapid fire events – quantiles can NOT be aggregated
Histogram A set of Counters for tracking arbitrary quantiles Sizes or latencies of rapid fire events – can be aggregated

Data Types

Type Description PromQL Example
Scalar Simple floating-point value 3.14
Instant Vector Set of time series with single sample per series http_requests_total
Range Vector Set of time series with range of samples http_requests_total[5m]
String String literal (limited use) "some text"

Selectors & Filtering

Basic Selectors

Select all time series for a metric named http_requests_total:

1http_requests_total

Label Matchers

Operator Description Example
= Exact match http_requests_total{status_code="500"}
!= Not equal http_requests_total{status_code!="500"}
=~ Regex match http_requests_total{status_code=~"5.."}
!~ Regex not match http_requests_total{status_code!~"5.."}

Multiple label filters:

1http_requests_total{method="GET", status=~"2.."}

Time Ranges

Range Vector Selector:

1http_requests_total[5m]  # Last 5 minutes of data

Time Units:

  • ms - milliseconds
  • s - seconds
  • m - minutes
  • h - hours
  • d - days
  • w - weeks
  • y - years

Offset modifier Compare with data from the past:

1rate(http_requests_total[1h]) <= rate(http_requests_total[1h] offset 1w)

Operators

Arithmetic Operators

Basic math operators: +, -, *, /, %, ^

Example - Converting Bytes:

Unit Name Sloppy Name PromQL Expression
KiB Kibibyte Kilobyte node_memory_bytes / 2^10
MiB Mebibyte Megabyte node_memory_bytes / 2^20
GiB Gibibyte Gigabyte node_memory_bytes / 2^30
TiB Tebibyte Terabyte node_memory_bytes / 2^40
PiB Pebibyte Petabyte node_memory_bytes / 2^50

Comparison Operators

Operator Description
== Equal
!= Not equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal

Comparison filters results:

1http_requests_total > 100

Logical Operators

  • and - Intersection
  • or - Union
  • unless - Complement

Combine conditions:

1up{job="prometheus"} == 1 and on(instance) rate(http_requests_total[5m]) > 10

Essential Functions

Rate Functions (for Counters)

Tip: Default to using rate() for alerting and graphs. Use irate() for volatile metrics where you want to see quick changes at high resolution (like CPU usage). Remember this as the “mad rate function.”

rate() - Per-second rate of increase interpolated from the last 5m

1rate(http_requests_total[5m])

Use for counters that always increase. Handles counter resets. Always normalized to per-second.

irate() - Instant rate (last 2 samples within range)

1irate(http_requests_total[5m])

Uses the last two samples within the time range. More sensitive to short-term spikes. Good for CPU metrics. Normalized to per-second.

increase() - Total increase over time range

1increase(http_requests_total[1h])

Extrapolates total increase. Use for counters. Effectively rate() multiplied by the number of seconds in the time range.

Aggregation Functions

Reduces many time series into fewer time series.

Function Operation Example Description
sum() Sum of all values sum(rate(http_requests_total[5m])) Sum all rates for system-wide throughput
avg() Average of all values avg(rate(http_requests_total[5m])) Average rate of each container or pod
min() Minimum value min(rate(http_requests_total[5m])) Lowest rate or throughput for each container or pod
max() Maximum value max(rate(http_requests_total[5m])) Highest rate or throughput for each container or pod
count() Count of elements count(rate(http_requests_total[5m])) Number of active containers or pods
stddev() Standard deviation stddev(rate(http_requests_total[5m])) How many seconds is 1 standard deviation assuming a normal distribution
stdvar() Standard variance stdvar(rate(http_requests_total[5m])) Standard variance assuming a normal distribution – are all containers processing at similar rates?
topk() Largest k values topk(10, rate(http_requests_total[5m])) 10 containers or pods with the highest rates – does not modify labels
bottomk() Smallest k values bottomk(10, rate(http_requests_total[5m])) 10 containers or pods with the lowest or smallest rates – does not modify labels
quantile() Value at qth quantile quantile(0.5, rate(http_requests_total[5m])) Median rate of throughput for all containers or pods

Aggregation with BY and WITHOUT

Keep or drop specific labels.

Group by specific labels:

1sum by (job, instance) (rate(http_requests_total[5m]))

Exclude specific labels:

1sum without (pod) (rate(http_requests_total[5m]))

Time & Date Functions

These all (except time()) take a unix timestamp as an optional first argument. Values returned are all in UTC.

  • time() - Current Unix timestamp
  • minute() - Minute of hour (0-59)
  • hour() - Hour of day (0-23)
  • day_of_week() - Day of week (0-6, Sunday=0)
  • day_of_month() - Day of month (1-31)
  • days_in_month() - Number of days in month (28-31)
  • month() - Month (1-12)
  • year() - Current year

Example - Repeating Test Patterns

 1- alert: TestAlert
 2  annotations:
 3    description: |
 4      This alert fires every 2 hours and resolves after 60 minutes.
 5    runbook_url: https://example.com
 6    summary: Test that alerts are working
 7  expr: |
 8    vector(1)
 9    unless (
10      (hour() % 2 == 0)
11    )
12  labels:
13    severity: none

Example - Alert only during business hours:

Anti-Pattern: This works only if you live in Greenwich, England and ignore British Summer Time. Don’t do this, there are better ways.
1http_errors > 100 and on() hour() >= 9 and on() hour() < 17

Math Functions

Rounding:

  • round() - Round to nearest integer, or nearest multiple of second argument that defaults to 1
  • ceil() - Round up away from zero
  • floor() - Round down toward zero

Other Math:

  • abs() - Absolute value
  • sqrt() - Square root
  • exp() - Exponential
  • ln() - Natural logarithm
  • log2() - Log base 2
  • log10() - Log base 10

Need trigonometric functions?

Quantile Functions

quantile() - quantile aggregation

1quantile(0.95, rate(http_requests_total[5m]))

Get 95th percentile of the containers or pods HTTP rate of requests per second. Used to detect if a pod is much slower or faster than the rest.

histogram_quantile() - Estimate quantile from histogram

1histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Calculate the 95th percentile from a histogram aggregating all pods or containers together. Useful to judge have behavior is anomalous vs expected.

Prediction Functions

predict_linear() - Linear prediction

Modeling Data is Hard: This uses simple linear regression to make predictions. If the data doesn’t look like a mostly straight line that could be modeled with $y = mx + b$ then it won’t create very good predictions. File space free isn’t well modeled by this.
1predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

Predict disk space in 4 hours based on last hour’s trend.

deriv() - Derivative normalized to per-second

1deriv(node_memory_active_bytes[10m])

Rate of change over time. Useful for Gauge metrics. Can be negative. This example would show how fast memory is being consumed on a VM.

Sorting Functions

PromQL Function Description
sort() Sort smallest first, greatest last
sort_desc() Sort greatest first, smallest last
topk(5, ...) Top 5 values
bottomk(5, ...) Bottom 5 values

Common Query Patterns

CPU Usage

CPU usage per instance:

CPU Usage is Tough to Track: With multi-core and fractional-core provisioning in Kubernetes and modern environments, it is most effective to count the number of CPU cores in use. Avoid using percentages here.

Node-Exporter Metrics:

1sum by (instance) (rate(node_cpu_seconds_total{mode!="idle", mode!="iowait", mode!="steal"}[5m]))

Kubernetes Metrics:

1sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))

Memory Usage

Memory usage percentage:

Node-Exporter Metrics:

1100 * (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
2/
3node_memory_MemTotal_bytes

Kubernetes Metrics:

1100 * sum by (pod) (container_memory_working_set_bytes)
2/
3sum by (pod) (kube_pod_container_resource_limits{resource="memory"})

Disk Usage

Disk usage percentage:

Node-Exporter Metrics:

1100 * (node_filesystem_size_bytes - node_filesystem_free_bytes)
2/
3node_filesystem_size_bytes

Kubernetes PersistentVolume Metrics:

1100 * (1 - kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes)

HTTP Error Rate

Percentage of 5xx errors:

1100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
2/
3sum(rate(http_requests_total[5m]))

Request Latency (p95)

95th percentile latency:

1histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))

Kubernetes Pod Restarts

Pods restarting frequently:

1increase(kube_pod_container_status_restarts_total[30m]) > 1

Available Services

Count of healthy targets:

1sum(up{job="myapp"})

Percentage of healthy targets:

1100 * avg(up{job="myapp"})

Alert Rule Examples

High CPU Throttling Alert

CPU Usage Alerts: Generally CPU usage alerts are considered harmful. Instead, we want to know if Kubernetes is forcing our applications off the CPU for attempting to use more than their assigned limit.
 1- alert: HighCPUThrottles
 2  expr: |
 3    100 * sum(
 4      increase(
 5        container_cpu_cfs_throttled_periods_total{container!=""}[5m]
 6      )
 7    ) by (container, pod, namespace)
 8    /
 9    sum(
10      increase(
11        container_cpu_cfs_periods_total[5m]
12      )
13    ) by (container, pod, namespace)
14    > 25
15  for: 5m
16  labels:
17    severity: warning
18  annotations:
19    summary: "High CPU on {{ $labels.instance }}"
20    description: "CPU slices were throttled {{ $value | humanizePercentage }} over the last 5m"

High Memory Alert

1- alert: HighMemory
2  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
3  for: 5m
4  labels:
5    severity: critical
6  annotations:
7    summary: "High memory usage on {{ $labels.instance }}"
8    description: "Memory usage is {{ $value | humanizePercentage }}"

High Error Rate

1- alert: HighErrorRate
2  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
3  for: 5m
4  labels:
5    severity: critical
6  annotations:
7    summary: "High 5xx error rate"
8    description: "Error rate is {{ $value | humanizePercentage }}"

Pod Crash Loop

1- alert: PodCrashLooping
2  expr: increase(kube_pod_container_status_restarts_total[15m]) > 1
3  for: 5m
4  labels:
5    severity: warning
6  annotations:
7    summary: "Pod {{ $labels.pod }} is crash looping"
8    description: "Pod in namespace {{ $labels.namespace }} restarting frequently"

Best Practices

Use rate() for counters: Always use rate() or irate() when querying counter metrics. Never query counters directly.
Choose appropriate time ranges: For rate(), use at least 2-3x your scrape interval. If scraping every 30s, use [2m] or longer. 60s scrapes? Use [5m].
Rate Before Aggregation: Always calculate the rate() first then sum() or aggregate as needed. PromQL is designed to help enforce this. Taking the rate last is NOT THE SAME OPERATION and you will get unexpected results!
Avoid high cardinality: Do not use labels with unbounded values sets like user IDs, timestamps, email addresses, IPs, etc. If you need this, you really need tracing.
Use 'for' clause in alerts: Always use a for duration to avoid flapping alerts from temporary spikes. 5m or more is a good place to start.
Use recording rules for complex queries: Pre-calculate expensive queries that are used in multiple dashboards or alerts. Such as SLOs and Error Budgets.

Quick Reference

Recording Rule Example

1groups:
2  - name: example
3    interval: 30s
4    rules:
5      - record: instance:node_cpu:avg_rate5m
6        expr: avg by(instance) (rate(node_cpu_seconds_total[5m]))

Template Variables in Annotations

1{{ $labels.instance }}           # Label value
2{{ $value }}                      # Current value
3{{ $value | humanize }}           # Human-readable value (1000 -> 1k)
4{{ $value | humanizePercentage }} # Format as percentage
5{{ $value | humanizeDuration }}   # Format as duration

Further Resources


Generate alerts automatically: Use our Prometheus Alert Generator to create SLO-based alerting rules with multi-window burn rate detection.