2025-11-19 · Kelly Mendoza · 6-min read

How we write runbooks at DevFlow

Our internal runbook style guide, in case it's useful for your team.

what a runbook is for

A runbook is a checklist for an on-call engineer being woken at 3am. It is not a wiki page about a system. The audience is tired, partial-information, time-pressured.

the four sections

Every runbook has exactly four sections, in this order:

Symptoms — what fired, what the on-call sees in their first 30 seconds.
First action — the thing to do before anything else. Often "page Priya". Sometimes "do nothing for 5 minutes".
Diagnosis — short branching tree of "if X then Y" with links to dashboards.
Resolution — the steps to fix, ordered by reversibility (least-destructive first).

what we deliberately leave out

Architecture diagrams. They go on the wiki.
The history of why the system is the way it is. They go on the wiki.
Speculation about root cause. The on-call doesn't have time.

a real example

Our runbook for "probe network: 1+ region failing":

Symptoms. PagerDuty page probe-region-down. Dashboard at status.devflow.io shows a region in degraded state.

First action. Open status.devflow.io and confirm the region. Post in #incident-room.

Diagnosis. - Single region: skip to "single-region degraded" branch. - Multiple regions: skip to "wide-blast-radius" branch.

Resolution. - Single-region: drain probes from the affected region (devflow ops probe drain --region us-east-1). The remaining quorum will absorb load. Wait 5 minutes; if recovery, drain stays in place until on-call hands off. If no recovery, escalate to L2. - Wide-blast-radius: page Priya immediately. Do not attempt drain.

what to read next

If you want the long version, the SRE Workbook chapter on incident response is the canonical text. Our runbook style is pulled directly from there.

— Kelly