Incident management: open, acknowledge, resolve, postmortem

Owner Amaia Larrea · Last updated 2026-04-04 · v3.0

incidentackresolvepostmortemresponse

Incident management

When an alert fires, DevFlow opens an incident. The incident is the unit of "thing on fire" — it groups subsequent alerts on the same monitor + same root failure into one place, instead of paging on every check.

the lifecycle

Open — first failing check on a monitor. The monitor becomes alerting. An incident is created.
Notified — channel(s) deliver. We retry channel delivery up to 3 times.
Acknowledged — a human ack'd. The page-storm stops.
Resolved — either the monitor recovers (auto-resolve, default on) or a human marks it resolved.
Postmortem — DevFlow writes a stub Markdown doc with the timeline and links it from the incident page.

auto-resolve

If a monitor goes back to passing for 5 consecutive checks, the incident auto-resolves with a "DevFlow auto-resolved at HH:MM" note. Tunable per project.

the postmortem stub

markdown

# Incident: payments-api-charge — 2026-04-15

- Started: 2026-04-15 14:32 UTC
- Resolved: 2026-04-15 14:51 UTC
- Duration: 19m
- Affected SLO: payments-availability-99.9 (consumed 47% of remaining error budget)

## timeline
- 14:32 — Monitor failed in us-east-1, eu-west-1, ap-southeast-1.
- 14:33 — PagerDuty paged on-call (Priya).
- 14:35 — Acknowledged.
- ...

## root cause
TBD by author.

## action items
TBD.

We don't replace tools like Jeli or FireHydrant — the stub is a starting point you copy out.

escalation

If no acknowledgement after fallback.after_minutes, the schedule's fallback fires. See on-call-schedules.

notification-throttling for mute / maintenance behaviour during a known incident.
status-pages for surfacing the incident publicly.

Was this helpful?