Incident management: open, acknowledge, resolve, postmortem
Owner Amaia Larrea · Last updated 2026-04-04 · v3.0
incidentackresolvepostmortemresponse
Incident management
When an alert fires, DevFlow opens an incident. The incident is the unit of "thing on fire" — it groups subsequent alerts on the same monitor + same root failure into one place, instead of paging on every check.
the lifecycle
- Open — first failing check on a monitor. The monitor becomes
alerting. An incident is created. - Notified — channel(s) deliver. We retry channel delivery up to 3 times.
- Acknowledged — a human ack'd. The page-storm stops.
- Resolved — either the monitor recovers (auto-resolve, default on) or a human marks it resolved.
- Postmortem — DevFlow writes a stub Markdown doc with the timeline and links it from the incident page.
auto-resolve
If a monitor goes back to passing for 5 consecutive checks, the incident auto-resolves with a "DevFlow auto-resolved at HH:MM" note. Tunable per project.
the postmortem stub
markdown
# Incident: payments-api-charge — 2026-04-15
- Started: 2026-04-15 14:32 UTC
- Resolved: 2026-04-15 14:51 UTC
- Duration: 19m
- Affected SLO: payments-availability-99.9 (consumed 47% of remaining error budget)
## timeline
- 14:32 — Monitor failed in us-east-1, eu-west-1, ap-southeast-1.
- 14:33 — PagerDuty paged on-call (Priya).
- 14:35 — Acknowledged.
- ...
## root cause
TBD by author.
## action items
TBD.We don't replace tools like Jeli or FireHydrant — the stub is a starting point you copy out.
escalation
If no acknowledgement after fallback.after_minutes, the schedule's fallback fires. See on-call-schedules.
related
- notification-throttling for mute / maintenance behaviour during a known incident.
- status-pages for surfacing the incident publicly.