Troubleshooting: false positives and noise
False positives
A false positive is an alert that fired when the system was actually fine. It's a more general problem than flapping (troubleshooting-flapping-monitors) — the cure is similar but worth thinking about distinctly.
diagnose first
Before tuning, ask: was the customer-facing request actually fine, or did our check just disagree?
- Check the upstream's own metrics for the same window. If they were 100% healthy, the false positive is real.
- Check the assertion that failed. Was it about the customer experience, or an implementation detail?
If the upstream had a real blip — even one your customers wouldn't notice — that is a true positive. The right fix is upstream, not in DevFlow.
fix candidates
- Loosen the assertion to what customers actually depend on. A response that returns a stable contract but with a new optional field shouldn't fail your check.
- Move from monitor-failure paging to SLO burn-rate paging. A 0.5% blip doesn't trip a 14.4× burn rate over a 1h window — see slo-multi-window-alerting.
- Use multi-region with multi-region-setup fail-quorum to absorb single-edge weirdness.
- Use retry-policy for endpoints with known transient noise.
TLS-related noise
If you keep getting handshake failures, check whether your server's cert chain includes the intermediate. Browsers fix this with cached intermediates; curl and our probe don't. mtls covers the related debug command.
rate-limit-related noise
If your service rate-limits anonymous traffic, you may see DevFlow's checks getting 429. Service-account auth + IP allow-list (rest-api-authentication) is the usual fix on your side.
still seeing false positives?
devflow monitor inspect <name> --window 7d --include-assertionsPrints assertion-by-assertion outcomes for every check. Pattern-match what's failing — it's usually obvious.