Product · SLO tracking

SLOs that say something useful.

Most SLO products are uptime checkboxes with extra fields. Our model is the one in the SRE workbook: a target, a window, an explicit good-event definition, and an error budget that gets spent when things go wrong. We page on burn rate, not on raw failures.

yaml

name: payments-availability-99.9
target: 0.999
window: 28d
sli:
  source_monitor: payments-api-charge
  good_event:
    status_in: [200, 201, 204]
    latency_lt_ms: 1000

alert:
  rule: |
    burn_rate(short=1h) > 14.4
    AND
    burn_rate(long=6h) > 14.4
  channels: [pagerduty:payments-oncall]

Multi-window burn-rate alerting

The Google CRE pattern. A short window catches fast outages; a long window catches slow regressions. Both windows have to agree before an alert fires, which kills almost all transient false positives.

Amaia’s burn-rate cheat-sheet is a useful pocket reference. The full doc is slo-multi-window-alerting.

Status pages, included.

Every SLO can back a public or private status page component. NorthLoop runs a per-customer status page off this — see the NorthLoop case study for the operational shape.