Most organizations today have an expansive set of tools to monitor their applications and services. This is to ensure that all the system metrics, events, logs, etc. are tracked to keep abreast of how their systems are doing. But it is humanly impossible to constantly supervise the various dashboards of these tools. So, it makes sense then that when these tools detect anything that is even remotely important, there is a notification that the team received informing them of this. This in turn enables engineering teams to know how reliable their systems are and be proactive in avoiding downtime.
But the issues arise when engineers start to get flooded with alerts from their monitoring setup. The sheer volume of alerts that are mostly informational and not necessarily actionable are much higher in comparison with those that are actual incidents that need immediate action.
So, a typical day in the life of an on call engineer would be to wade through the ocean of alerts on their incident management platform of choice. Engineers who have experienced this know how overwhelming it can get. The really important incidents start to get lost in the superfluous alert noise. This is Alert Fatigue.
Alert fatigue has become an increasingly painful and widespread problem in DevOps and SRE teams given the amount of data that is available to them. While the whole point of using monitoring tools to send alerts is to build a culture of proactive incident management, it slowly begins diminishing this whole objective.
You know you have a problem to fix if the volume of low-priority/warning alerts greatly exceeds the number of actionable alerts to such an extent that the real, high-severity incidents end up getting detected much later or not at all.
It follows from this that it is super important to ensure that on call engineers who work on responding to these incidents are not overloaded with alert noise.
The problem now becomes centred around finding a way to capture all the data but at the same time ensuring that you’re getting particularly notified for only the actionable ones, or in essence, finding a tool that can distinguish between alerts and incidents.
No Engineer wants to be woken up at 3AM only to find out that it is a false alarm.
Let’s take a look at this in an illustrative way.
This is Kevin and he is an SRE (crowd cheers? Hahaha). He deals with services and makes sure they are healthy. And to top it all, he needs to do this while not losing his sanity.
An alert woke him up. Another one woke him up even more.
And this is a Herculean task when he is being woken up by a production alert at 1AM.
Looking like a zombie himself, and the King of Pop’s Thriller ringing on his phone is keeping up with the theme of this unfortunate series of events.
Don't judge him. (Cause this is THRILLER 🧟 on loop).
So, Kevin sees that the service sent a warning message for CPU Usage. It will probably take a week for it to move into the critical stage. He took steps to fix this by reaching out to his team. But the service continues to send him notifications disrupting his sleep.
While he understands that the alerting tool is just doing its job by pinging him ruthlessly until he wakes up to his responsibilities, he sees no reason to lose his sleep or sanity unless there's a serious production issue (he secretly prays that this isn't the case every time the phone rings)
Here's how he lost his sanity in just about an hour. I'm pretty sure he's a little sick of Thriller by now.
Timeline of D-Day:
Kevin saw that his alerts were pouring in from Prometheus. He realises that he can't keep dunking his phone in the coffee when alerts flood in.
He decides to deal with the alert noise once and for all after resolving the prod issue.
He manages to configure deduplication rules on his platform.
Prometheus was complaining about deployment rolling updates and some completely unrelated CPU usage issues every 10 seconds or so. He executes a runbook and fixes both the issues (apparently this happens once every month).
Now he rolls up his sleeves and decides to configure de-duplication for his alerts.
He sees that the alert payload for one specific alert was to do with the deployment of that service.
He writes a rule to de-duplicate the incident for deployment errors.
He writes a similar rule for the CPU Usage based alerts and adds another one to fire this incident again only if it has occurred 50 times in a row.
Rule that Kevin used for this:
At least he won't hate Thriller now + No phone dunking + No coffee wastage + most importantly, No more alert noise!!!
TL;DR
Kevin finally manages to configure de-duplication rules for his Prometheus alerts and sets severities for incidents to get woken up for just the really _really_ important ones.
Kevin is smart. Be like Kevin.