We replaced our vendors' vendor (and you can too)
On the architectural decision to run our own probe network instead of leasing one.
the lesson from PagerDuty
When I joined PagerDuty in 2018, the alerting platform leaned heavily on a third-party WebSocket vendor. It worked great until the vendor had a 47-minute global outage and PagerDuty's core product went silent at 4am Pacific time.
The team that ran the vendor relationship had been negotiating better SLAs for months. None of that mattered while customers couldn't be paged.
what we decided about probes
When we started DevFlow, we made a non-obvious call: don't lease our probe network. Build it.
Most synthetic-monitoring vendors are reselling somebody else's edge. That's fine until your synthetic-monitoring is itself part of your customers' incident-response loop. At that point, your monitoring's monitoring is probably theirs, and your customers are paying you to lock-step on someone else's outage.
The 14-edge probe fleet is ours, on our metal, on three independent cloud providers. We can rebuild any region from scratch in under an hour. The probe binary is open-source-readable for our customers under NDA.
what this costs
A lot. Probably a quarter of our gross margin. We could have shipped the product six months earlier without it.
why it's worth it
Two reasons. One: our customers' SLOs depend on ours, and we're not willing to inherit anyone else's outage tail. Two: we can make product calls the resellers can't — like sub-10s checks at Scale tier, which is impossible if you're paying per-probe to a vendor.
— Priya