Measuring developer productivity IRL: practical tips for platform engineers
What should you measure and how ? Industry experts weight in sharing insights from their experience leading engineering organizations at scale.
September 27, 2024
5 mins
Alert fatigue is a problem that every SRE faces—too many false alarms, duplicated alerts, and unnecessary noise can wreak havoc on your ability to respond effectively. This post outlines practical strategies for managing alert fatigue, from adjusting thresholds and automating triage to maintaining clear on-call schedules.
You had to pass on that mountain hike with the boys because you’re on call this weekend. That’s okay, you’ll just catch up on a series while your wife is out at the salon with her friends. No alerts come through during the day—not so bad after all.
Just when you’re opening a bottle of wine for dinner at home, you get the dreaded alert. You ask your wife to finish up the steak while you run to your laptop and restart a service, hoping that’ll be the end of it. It works. You’re back downstairs, having a good time again.
"It’s okay, babe," your wife says when you have to run upstairs for the third time in the evening. All false alarms after that first incident. Your dinner got cold, and you missed the climax of the movie. You’re quite upset about this on-call shift.
Your phone will wake you up two more times during the night. If it’s another false alarm, the company phone better be insured because you’re throwing it out the window.
Alert fatigue is a very real issue for SREs and anyone who has to take on-call shifts. In this article you’ll learn about common causes, consequences, and strategies to prevent it.
Most organizations have a cap on the number of incidents a responder tackles before handing over the primary role to a colleague from the secondary rotation. That’s because mitigating a third incident in the same night is pretty tough for one person. It wouldn’t be reasonable to expect top-notch performance under these circumstances.
Alerts, however, are trickier. There’s no limit to how many you can get in one shift. Even subtle adjustments to your observability toolchain can lead to a flood of alerts in one day. Alert fatigue doesn’t creep in quietly over months; you can start noticing it during a single shift. Your attention to alerts diminishes, and you’re less ready to react after each one.
Alert fatigue stems from receiving too many alerts—whether due to duplicated alerts, false positives, or too much noise filtering through.
What do a 24-hour outage at GitHub and dozens of delayed flights at Delta have in common? Both were incidents caused by missed critical alerts. On one hand, you tune your alerting system to avoid missing important issues. But if you let too much noise through, you’ll start missing alerts at the responder level.
When you're bombarded with constant noise, it’s easy to overlook the one alert that actually matters. This can lead to serious consequences, like missing a critical service failure or a security breach that could have been prevented with better alert hygiene.
The more alerts you receive, the less ready you are to deal with them, slowing down your overall response rate. You might put off acting on an alert due to recent false positives, risking a delay that could make an actual incident take longer to resolve than it should have.
Dealing with a constant stream of unnecessary alerts can drain your mental energy, leaving you feeling exhausted and disengaged. Over time, this fatigue chips away at your productivity, making it harder to maintain focus and motivation in your day-to-day tasks.
As alert fatigue sets in, the likelihood of human error increases, and that directly impacts system reliability. If you're not operating at your best, even minor issues can snowball into bigger problems, putting the entire system at risk.
Tune your monitoring tools to avoid an overflow of alerts triggered by minor deviations. Set thresholds that reflect real performance issues, not just minor blips, so your team isn’t responding to noise. However, keep a close eye on this area and refine it constantly—you don’t want to miss out on critical alerts.
Ensure sufficient on-call coverage so engineers have time to recover between shifts, reducing alert overload and burnout. A well-balanced on-call rotation with defined handoff points helps prevent fatigue from creeping into the workweek. Consider offering additional time off to team members who go through difficult shifts to ensure they can continue to bring their best game.
Ensure alerts include enough context—such as estimated severity level, potential impact, and recommended actions—so responders can quickly assess the situation. An alert without the right information can lead responders to underestimate it or take the wrong path when starting an investigation.
Leverage on-call automation tools to triage alerts automatically, allowing the system to escalate only the most critical issues to human responders. Modern on-call solutions also offer more detailed alert re-routing, so make sure you only ping the most relevant contact for each alert.
Continuously review and optimize your alerting system to adapt to evolving infrastructure and team needs. What works today might not work tomorrow, so staying proactive with regular evaluations ensures your alerts remain relevant. Periodic tuning helps maintain a balance between catching critical issues and avoiding unnecessary interruptions.
Rootly helps prevent alert fatigue by letting you group similar alerts to avoid noise and by helping you route alerts to the most relevant responder.
Book a demo with one of our reliability experts to learn how you can reduce duplicated alerts and keep your team motivated.
{{cta-on-call}}
See Rootly in action and book a personalized demo with our team