SLO multi-window alerting: short and long windows together
Multi-window SLO alerting
The Google SRE / CRE alerting pattern. Two windows running simultaneously catch both fast-and-loud and slow-and-stealthy failures.
the two windows
- A short window — 1 hour. Catches fast burn (an outage now).
- A long window — 6 hours. Catches slow burn (a chronic regression eating budget over a workday).
A page fires only when both windows agree burn rate is elevated. This kills almost all transient false positives while keeping the worst-case detection time inside an hour.
the rule
alert:
name: payments-availability-fast-burn
slo: payments-availability-99.9
rule: |
burn_rate(short=1h) > 14.4
AND
burn_rate(long=6h) > 14.4
channels: [pagerduty:payments-oncall]14.4 is the standard "use up 1% of budget in 1 hour" threshold for a 30-day window. We've pre-calibrated the table for the windows we recommend; see the dashboard's "Alert calculator" or the cheat-sheet on /blog/slo-burn-rates.
a second, slower rule
You also want to know about chronic burn that won't trip the fast rule:
alert:
name: payments-availability-slow-burn
slo: payments-availability-99.9
rule: |
burn_rate(short=6h) > 6.0
AND
burn_rate(long=24h) > 6.0
channels: [slack:#payments-oncall]This routes to Slack, not PagerDuty — see alert-channels.
why both, not one
A monitor-failure alert says "this thing failed." A burn-rate alert says "we are off-track on our promise to users." The two-rule pattern keeps you on-track without paging at every flap.
anti-pattern: alerting on raw SLI
Don't alert when the SLI dips below the target — it'll fire constantly on a healthy SLO. The SLI is supposed to spend budget; that's what budgets are for. Always use burn rate.