📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
On-Call
Reducing On-call Alert Fatigue with Deduplication

Reducing On-call Alert Fatigue with Deduplication

January 8, 2020
Reducing On-call Alert Fatigue with Deduplication
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

What is alert fatigue?

Most organizations today have an expansive set of tools to monitor their applications and services. This is to ensure that all the system metrics, events, logs, etc. are tracked to keep abreast of how their systems are doing. But it is humanly impossible to constantly supervise the various dashboards of these tools. So, it makes sense then that when these tools detect anything that is even remotely important, there is a notification that the team received informing them of this. This in turn enables engineering teams to know how reliable their systems are and be proactive in avoiding downtime.

But the issues arise when engineers start to get flooded with alerts from their monitoring setup. The sheer volume of alerts that are mostly informational and not necessarily actionable are much higher in comparison with those that are actual incidents that need immediate action.

So, a typical day in the life of an on call engineer would be to wade through the ocean of alerts on their incident management platform of choice. Engineers who have experienced this know how overwhelming it can get. The really important incidents start to get lost in the superfluous alert noise. This is Alert Fatigue.

Alert noise can kill on-call productivity

Alert fatigue has become an increasingly painful and widespread problem in DevOps and SRE teams given the amount of data that is available to them. While the whole point of using monitoring tools to send alerts is to build a culture of proactive incident management, it slowly begins diminishing this whole objective.

You know you have a problem to fix if the volume of low-priority/warning alerts greatly exceeds the number of actionable alerts to such an extent that the real, high-severity incidents end up getting detected much later or not at all.

It follows from this that it is super important to ensure that on call engineers who work on responding to these incidents are not overloaded with alert noise.

The problem now becomes centred around finding a way to capture all the data but at the same time ensuring that you’re getting particularly notified for only the actionable ones, or in essence, finding a tool that can distinguish between alerts and incidents.

No Engineer wants to be woken up at 3AM only to find out that it is a false alarm.

How Kevin Loses His Sanity Because of Alert Fatigue : An On-call Story

Let’s take a look at this in an illustrative way.

This is Kevin and he is an SRE (crowd cheers? Hahaha). He deals with services and makes sure they are healthy. And to top it all, he needs to do this while not losing his sanity.

An alert woke him up. Another one woke him up even more.

And this is a Herculean task when he is being woken up by a production alert at 1AM.

Looking like a zombie himself, and the King of Pop’s Thriller ringing on his phone is keeping up with the theme of this unfortunate series of events.

Don't judge him. (Cause this is THRILLER 🧟 on loop).

So, Kevin sees that the service sent a warning message for CPU Usage. It will probably take a week for it to move into the critical stage. He took steps to fix this by reaching out to his team. But the service continues to send him notifications disrupting his sleep.

While he understands that the alerting tool is just doing its job by pinging him ruthlessly until he wakes up to his responsibilities, he sees no reason to lose his sleep or sanity unless there's a serious production issue (he secretly prays that this isn't the case every time the phone rings)

Here's how he lost his sanity in just about an hour. I'm pretty sure he's a little sick of Thriller by now.

Timeline of D-Day:

  • 12:58:59PM Thriller
  • 01:00:22AM Sleep deprived yet, slapping in the face to remain awake and see audit logs
  • 01:21:31AM Woke up from a unexpected snooze off, found out spacebar ain't working anymore due to salivary short circuit
  • 01:30:01AM Copy spaces using mouse from sites and pasting in grep to filter logs
  • 01:36:03AM Eureka moment,followed by a thought of "Oh shoot, I'm desperate now"
  • 01:40:40AM Food delivery arrives.The high point of this incident so far.
  • 01:40:41AM Thriller
  • 01:47:12AM BURP
  • 01:52:15AM Coffee Refill.
  • 01:52:34AM Thriller
  • 02:00:44AM Thriller
  • 02:12:49AM Thriller
  • 02:33:52AM Thriller
  • 02:45:53AM Thriller
  • 02:52:53AM Thriller Thriller Thriller
  • 02:56:54AM Thriller Thriller Thriller Thriller Thriller
  • 03:03:00AM Played dunk the phone in coffee. sparks
  • 03:08:17AM Wakes up the duck. _Duck is not so thrilled
  • 03:10:29AM Hot air to the face..._either from the duck or the CPU exhaust
  • 03:27:05AM Manages to find the fix
  • 03:29:30AM Figures out that his phone survived the 6 inch dunk
  • 03:37:15AM Face hits the pillow as he contemplates throwing his phone out of the window

Kevin Configures De-duplication in Squadcast

Kevin saw that his alerts were pouring in from Prometheus. He realises that he can't keep dunking his phone in the coffee when alerts flood in.

He decides to deal with the alert noise once and for all after resolving the prod issue.

He manages to configure deduplication rules on his platform.

Prometheus was complaining about deployment rolling updates and some completely unrelated CPU usage issues every 10 seconds or so. He executes a runbook and fixes both the issues (apparently this happens once every month).

Now he rolls up his sleeves and decides to configure de-duplication for his alerts.

  • For deployment issues, he decides to group and de-duplicate alerts based on the impacted services.
  • For CPU Usage related issues, he decides to group and de-duplicate alerts based on the impacted services but create a new alert if the same event had already occurred 50 times.

He sees that the alert payload for one specific alert was to do with the deployment of that service.

He writes a rule to de-duplicate the incident for deployment errors.

He writes a similar rule for the CPU Usage based alerts and adds another one to fire this incident again only if it has occurred 50 times in a row.

Rule that Kevin used for this:

At least he won't hate Thriller now + No phone dunking + No coffee wastage + most importantly, No more alert noise!!!

TL;DR

Kevin finally manages to configure de-duplication rules for his Prometheus alerts and sets severities for incidents to get woken up for just the really _really_ important ones.

Kevin is smart. Be like Kevin.

Written By:
Prakya Vasudevan
Akilan Elango
Prakya Vasudevan
Akilan Elango
January 8, 2020
On-Call
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Reducing On-call Alert Fatigue with Deduplication

Jan 8, 2020
Last Updated:
October 4, 2024
Share this post:
Reducing On-call Alert Fatigue with Deduplication

Alert noise is a very common on call complaint leading to fatigue and on call burnout. This article is an attempt at helping folks address this problem.

Table of Contents:

    What is alert fatigue?

    Most organizations today have an expansive set of tools to monitor their applications and services. This is to ensure that all the system metrics, events, logs, etc. are tracked to keep abreast of how their systems are doing. But it is humanly impossible to constantly supervise the various dashboards of these tools. So, it makes sense then that when these tools detect anything that is even remotely important, there is a notification that the team received informing them of this. This in turn enables engineering teams to know how reliable their systems are and be proactive in avoiding downtime.

    But the issues arise when engineers start to get flooded with alerts from their monitoring setup. The sheer volume of alerts that are mostly informational and not necessarily actionable are much higher in comparison with those that are actual incidents that need immediate action.

    So, a typical day in the life of an on call engineer would be to wade through the ocean of alerts on their incident management platform of choice. Engineers who have experienced this know how overwhelming it can get. The really important incidents start to get lost in the superfluous alert noise. This is Alert Fatigue.

    Alert noise can kill on-call productivity

    Alert fatigue has become an increasingly painful and widespread problem in DevOps and SRE teams given the amount of data that is available to them. While the whole point of using monitoring tools to send alerts is to build a culture of proactive incident management, it slowly begins diminishing this whole objective.

    You know you have a problem to fix if the volume of low-priority/warning alerts greatly exceeds the number of actionable alerts to such an extent that the real, high-severity incidents end up getting detected much later or not at all.

    It follows from this that it is super important to ensure that on call engineers who work on responding to these incidents are not overloaded with alert noise.

    The problem now becomes centred around finding a way to capture all the data but at the same time ensuring that you’re getting particularly notified for only the actionable ones, or in essence, finding a tool that can distinguish between alerts and incidents.

    No Engineer wants to be woken up at 3AM only to find out that it is a false alarm.

    How Kevin Loses His Sanity Because of Alert Fatigue : An On-call Story

    Let’s take a look at this in an illustrative way.

    This is Kevin and he is an SRE (crowd cheers? Hahaha). He deals with services and makes sure they are healthy. And to top it all, he needs to do this while not losing his sanity.

    An alert woke him up. Another one woke him up even more.

    And this is a Herculean task when he is being woken up by a production alert at 1AM.

    Looking like a zombie himself, and the King of Pop’s Thriller ringing on his phone is keeping up with the theme of this unfortunate series of events.

    Don't judge him. (Cause this is THRILLER 🧟 on loop).

    So, Kevin sees that the service sent a warning message for CPU Usage. It will probably take a week for it to move into the critical stage. He took steps to fix this by reaching out to his team. But the service continues to send him notifications disrupting his sleep.

    While he understands that the alerting tool is just doing its job by pinging him ruthlessly until he wakes up to his responsibilities, he sees no reason to lose his sleep or sanity unless there's a serious production issue (he secretly prays that this isn't the case every time the phone rings)

    Here's how he lost his sanity in just about an hour. I'm pretty sure he's a little sick of Thriller by now.

    Timeline of D-Day:

    • 12:58:59PM Thriller
    • 01:00:22AM Sleep deprived yet, slapping in the face to remain awake and see audit logs
    • 01:21:31AM Woke up from a unexpected snooze off, found out spacebar ain't working anymore due to salivary short circuit
    • 01:30:01AM Copy spaces using mouse from sites and pasting in grep to filter logs
    • 01:36:03AM Eureka moment,followed by a thought of "Oh shoot, I'm desperate now"
    • 01:40:40AM Food delivery arrives.The high point of this incident so far.
    • 01:40:41AM Thriller
    • 01:47:12AM BURP
    • 01:52:15AM Coffee Refill.
    • 01:52:34AM Thriller
    • 02:00:44AM Thriller
    • 02:12:49AM Thriller
    • 02:33:52AM Thriller
    • 02:45:53AM Thriller
    • 02:52:53AM Thriller Thriller Thriller
    • 02:56:54AM Thriller Thriller Thriller Thriller Thriller
    • 03:03:00AM Played dunk the phone in coffee. sparks
    • 03:08:17AM Wakes up the duck. _Duck is not so thrilled
    • 03:10:29AM Hot air to the face..._either from the duck or the CPU exhaust
    • 03:27:05AM Manages to find the fix
    • 03:29:30AM Figures out that his phone survived the 6 inch dunk
    • 03:37:15AM Face hits the pillow as he contemplates throwing his phone out of the window

    Kevin Configures De-duplication in Squadcast

    Kevin saw that his alerts were pouring in from Prometheus. He realises that he can't keep dunking his phone in the coffee when alerts flood in.

    He decides to deal with the alert noise once and for all after resolving the prod issue.

    He manages to configure deduplication rules on his platform.

    Prometheus was complaining about deployment rolling updates and some completely unrelated CPU usage issues every 10 seconds or so. He executes a runbook and fixes both the issues (apparently this happens once every month).

    Now he rolls up his sleeves and decides to configure de-duplication for his alerts.

    • For deployment issues, he decides to group and de-duplicate alerts based on the impacted services.
    • For CPU Usage related issues, he decides to group and de-duplicate alerts based on the impacted services but create a new alert if the same event had already occurred 50 times.

    He sees that the alert payload for one specific alert was to do with the deployment of that service.

    He writes a rule to de-duplicate the incident for deployment errors.

    He writes a similar rule for the CPU Usage based alerts and adds another one to fire this incident again only if it has occurred 50 times in a row.

    Rule that Kevin used for this:

    At least he won't hate Thriller now + No phone dunking + No coffee wastage + most importantly, No more alert noise!!!

    TL;DR

    Kevin finally manages to configure de-duplication rules for his Prometheus alerts and sets severities for incidents to get woken up for just the really _really_ important ones.

    Kevin is smart. Be like Kevin.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Prakya Vasudevan
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Best Practices in Incident Management
    Best Practices in Incident Management
    May 7, 2020
    Configure an Intuitive Service Dashboard & Reduce Response Time
    Configure an Intuitive Service Dashboard & Reduce Response Time
    April 30, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.