📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
On-Call
Optimizing your alerts to reduce Alert Noise

Optimizing your alerts to reduce Alert Noise

April 13, 2020
Optimizing your alerts to reduce Alert Noise
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

The word noise implies something unpleasant and unwanted. You combine that with on-call and it adds a factor of annoyance to the already overwhelming process. 

And this feeling doesn’t change whether you’re an old hand or just starting with your on-call duties. It’s difficult to stay motivated to be on top of things, especially when you get a ping or your phone rings for an incident that should not have paged you in the first place. 

Sometimes, the louder-than-life phone call is just about a “CPU has hit 50% usage” alert that you shouldn’t even be worried about. The frequency of informational alerts can drown out the valid critical ones. This is called Alert Fatigue.

This post outlines the different ways you can follow to minimize alert fatigue to ensure that you don’t get woken up for alerts that can wait.

Alert noise reduction at the monitoring system

Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start.

  • Setting the right alerts

Collecting metrics is an absolutely important part of improving your observability, and hence reliability. However, just because your sophisticated monitoring / observability platform can monitor 273 parameters, you need not set up alerts on all of them. Set up meaningful alerts which are core to your system reliability and collect the rest as just “not-alerting” data which can be used for preemptive analysis. So, this way, only the alerts that need immediate action should trigger a notification from your on-call tool, while other alerts are just recorded for adding context. 

You can always check your SLO metrics, on-call command center or your monitoring dashboard (as frequently as it makes sense to you) to check the overall health of the system.

  • Setting the right threshold

Even if you set up the right alerts, you end up getting a lot of alerts only to see that it has gone back to normal quickly (mostly due to flapping) or due to a temporary spike in user behaviors during the most active part of the day.

In these cases, you can observe the behavior for a while and modify the threshold value to trigger alerts which are slightly higher than the usual flapping values. Let's say you have set up some CPU alerts at 70% but you see this value regularly flap somewhere in the range of 69.8% to 70.8%, you can modify the value safely to 71% or even 72%, so that you don't get unnecessary alerts due to flapping or temporary spikes.

It’s also a good idea to set up incremental alerts. For the same example above, having an alert for when the CPU usage hits 80% would let you know that something is not normal if an alert hits. This could be due to a sudden increase in your users or system load for which you’ll need to scale your infrastructure. If you’re consistently being hit with these incremental alerts, it’s a clear indication of the urgency of action needed.

Alert noise reduction with your on-call alerting tool

Setting the right alerts and the right threshold values at your monitoring tool will reduce a lot of noise. Another layer of noise reduction and alert optimization can be done on your on-call tool. 

Here we focus on Squadcast specific features that can help you with alert noise reduction. However, you should find similar options in other incident management or on-call tools that you’re currently using. If you’re yet to decide on an on-call tool, it would be wise to check if reducing alert noise is a possibility before you decide on one.

  • Merging Duplicate Incidents

In most cases, the same alerts come in repeatedly and if you’ve set this type of alert to notify your on-call team, it can get annoying very quickly. 

For example, let's say there is a prometheus alert rule that checks for disk usage every hour and triggers an alert if it is above 50% and there is another rule which triggers an alert if the disk usage is above 70%.

So, the first rule is to give you a heads up that the disk usage has crossed the halfway mark and you should get ready in a few days / weeks to clean up log files to either free up or add more storage capacity. The second rule is to tell you that you need to do the cleanup / capacity addition immediately (few hours) in-order to maintain your system reliability.

But having the hourly alert for 50% disk usage till it crosses the 70% mark will be very annoying and to be honest, not helpful especially if it takes more than a couple days to reach the next level. You need to define the deduplication rules, so the on-call system knows how to merge duplicate events and notify only on the first time.

However, having an hourly alert for the 50% disk usage till it crosses the 70% mark is not just annoying but also not very helpful. To ensure that these warning alerts are not constantly calling the on-call engineer, you can configure your deduplication rules. 

In Squadcast, for each monitoring tool of a service, you can set up deduplication rules based on any key-value pairs in the alert JSON. This can be based on the incident title, description, hostname or any other information available. These rules are specifically defined by you based on monitoring needs and Squadcast provides the platform to configure it your way. It allows for you to use any regex or logical operations and also allows you to combine multiple operations.

  • Setup Tagging to route incidents to the right person(s)

Each service has its own team and there is an escalation matrix associated with it. However, not all alerts are equal; some are less important, some are critical, some need people from different teams, some alerts need to be sent to the customer facing team, some require the management involvement and more.

So, apart from the default escalation policy associated with a service, you can use our Tagging rules as an engine to classify and automatically route them to the right responder.Similar to the deduplication rules above, you can setup key:value pair tags based on the alert JSON and add any color to that tag. You can then use the tag as a means to override the default escalation policy and replace it with a user, a different escalation policy or a squad.

This opens up a lot of possibilities in the way you handle incident management today. Just to explain the extent of flexibility this provides, here are a few examples:

Example 1:

Service: Infrastructure (SRE)
Escalation Policy: 
1st Layer - Primary on-call person(s)
2nd Layer - Secondary on-call person(s)
3rd Layer - The entire SRE squad
4th Layer - Management

Let's say a CPU alert of 70% usage is received for your backend or billing systems. Note, that this is a high severity incident and is definitely not the same as the “CPU usage above 50%” alert. Here, your application is not able to serve your users and the billing portal isn't functioning. This needs to be resolved immediately and you’ll need an SME involved. In this case, waiting for it to progress as per the on-call escalation would only delay resolution and cause more negative customer impact . You can set up your tagging and routing rules to accommodate such high severity scenarios. Here’s what the Tagging and Routing rules would look like.

Tagging Rules:

If,
(payload.meta.cpu >= 70 && re(payload.meta.hostname, "^backend-server.*))
then setup tags,
severity:critical (color:Red)
notify:sre-team
If,
(payload.meta.cpu >= 70 && re(payload.meta.hostname, "^billing.*")")
then setup tags,
severity:critical (color:Red)
Notify:billing-critical-escalation


Routing Rules:

If,
(tags.severity = "critical" && tags.notify = "sre-team")
then route incident to,
sre-squad
If,
(tags.severity = "critical" && tags.notify = "billing-critical-escalation")
then route incident to,
Critical Billing Escalation policy


In this example, if the backend server reaches a critical level of CPU usage, we are notifying the entire SRE squad immediately. If this is the case for the billing server, we are notifying a Critical Billing Escalation policy, which might be different from the default escalation policy for the service like in the example escalation policy stated above.

Example 2:

In Example 1, we have seen how the entire team is notified in case of a critical incident. In this example, we can see the implementation of a similar solution for less severe incidents. In cases like these, we can choose to notify just 1 person instead of the entire escalation policy or a team.

This example is an actual use case practised by us within Squadcast. 

We set up our MongoDB Atlas alerts, specifically for query targeting:

If the query targeting value is less than 2000, the Tag “severity:low” is attached to the incident and it is automatically routed to the junior engineer responsible for optimizing the database queries.

If the query targeting value is above 2000, the Tag “severity:high” is attached to the incident and it is automatically routed to the senior engineer who will then optimize the complex database queries.

Tagging Rules:

Routing Rules:

These are just two of many ways you can choose to use Tagging and Routing rules. This will help you streamline your incident response process and get your MTTR down significantly.

  • Suppress not-so important incidents

If you still want to get some alerts sent to your on-call tool, alerts which are good to be recorded but need not alert anybody, you can set up suppression rules in Squadcast.

You can define a suppression rule based on the content of the message or description of the incident. Any incident for that specific service matching the configured rules will be suppressed and nobody will be notified. This will still be recorded in Squadcast for future reference.

Similarly, you can set up maintenance mode (one-time or recurring) for a service and any alerts for the service during such maintenance windows will be automatically suppressed.

We hope these practices help you reduce alert noise and improve your on-call experience. We’d love to hear from you on other best practices that can be followed to better on-call.

Written By:
April 13, 2020
Raghu Chinnannan
Raghu Chinnannan
April 13, 2020
On-Call
Monitoring
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Optimizing your alerts to reduce Alert Noise

Apr 13, 2020
Last Updated:
October 4, 2024
Share this post:
Optimizing your alerts to reduce Alert Noise

Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start. This post outlines some of the best practices that help you alert noise reduction and improve your on-call experience.

Table of Contents:

    The word noise implies something unpleasant and unwanted. You combine that with on-call and it adds a factor of annoyance to the already overwhelming process. 

    And this feeling doesn’t change whether you’re an old hand or just starting with your on-call duties. It’s difficult to stay motivated to be on top of things, especially when you get a ping or your phone rings for an incident that should not have paged you in the first place. 

    Sometimes, the louder-than-life phone call is just about a “CPU has hit 50% usage” alert that you shouldn’t even be worried about. The frequency of informational alerts can drown out the valid critical ones. This is called Alert Fatigue.

    This post outlines the different ways you can follow to minimize alert fatigue to ensure that you don’t get woken up for alerts that can wait.

    Alert noise reduction at the monitoring system

    Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start.

    • Setting the right alerts

    Collecting metrics is an absolutely important part of improving your observability, and hence reliability. However, just because your sophisticated monitoring / observability platform can monitor 273 parameters, you need not set up alerts on all of them. Set up meaningful alerts which are core to your system reliability and collect the rest as just “not-alerting” data which can be used for preemptive analysis. So, this way, only the alerts that need immediate action should trigger a notification from your on-call tool, while other alerts are just recorded for adding context. 

    You can always check your SLO metrics, on-call command center or your monitoring dashboard (as frequently as it makes sense to you) to check the overall health of the system.

    • Setting the right threshold

    Even if you set up the right alerts, you end up getting a lot of alerts only to see that it has gone back to normal quickly (mostly due to flapping) or due to a temporary spike in user behaviors during the most active part of the day.

    In these cases, you can observe the behavior for a while and modify the threshold value to trigger alerts which are slightly higher than the usual flapping values. Let's say you have set up some CPU alerts at 70% but you see this value regularly flap somewhere in the range of 69.8% to 70.8%, you can modify the value safely to 71% or even 72%, so that you don't get unnecessary alerts due to flapping or temporary spikes.

    It’s also a good idea to set up incremental alerts. For the same example above, having an alert for when the CPU usage hits 80% would let you know that something is not normal if an alert hits. This could be due to a sudden increase in your users or system load for which you’ll need to scale your infrastructure. If you’re consistently being hit with these incremental alerts, it’s a clear indication of the urgency of action needed.

    Alert noise reduction with your on-call alerting tool

    Setting the right alerts and the right threshold values at your monitoring tool will reduce a lot of noise. Another layer of noise reduction and alert optimization can be done on your on-call tool. 

    Here we focus on Squadcast specific features that can help you with alert noise reduction. However, you should find similar options in other incident management or on-call tools that you’re currently using. If you’re yet to decide on an on-call tool, it would be wise to check if reducing alert noise is a possibility before you decide on one.

    • Merging Duplicate Incidents

    In most cases, the same alerts come in repeatedly and if you’ve set this type of alert to notify your on-call team, it can get annoying very quickly. 

    For example, let's say there is a prometheus alert rule that checks for disk usage every hour and triggers an alert if it is above 50% and there is another rule which triggers an alert if the disk usage is above 70%.

    So, the first rule is to give you a heads up that the disk usage has crossed the halfway mark and you should get ready in a few days / weeks to clean up log files to either free up or add more storage capacity. The second rule is to tell you that you need to do the cleanup / capacity addition immediately (few hours) in-order to maintain your system reliability.

    But having the hourly alert for 50% disk usage till it crosses the 70% mark will be very annoying and to be honest, not helpful especially if it takes more than a couple days to reach the next level. You need to define the deduplication rules, so the on-call system knows how to merge duplicate events and notify only on the first time.

    However, having an hourly alert for the 50% disk usage till it crosses the 70% mark is not just annoying but also not very helpful. To ensure that these warning alerts are not constantly calling the on-call engineer, you can configure your deduplication rules. 

    In Squadcast, for each monitoring tool of a service, you can set up deduplication rules based on any key-value pairs in the alert JSON. This can be based on the incident title, description, hostname or any other information available. These rules are specifically defined by you based on monitoring needs and Squadcast provides the platform to configure it your way. It allows for you to use any regex or logical operations and also allows you to combine multiple operations.

    • Setup Tagging to route incidents to the right person(s)

    Each service has its own team and there is an escalation matrix associated with it. However, not all alerts are equal; some are less important, some are critical, some need people from different teams, some alerts need to be sent to the customer facing team, some require the management involvement and more.

    So, apart from the default escalation policy associated with a service, you can use our Tagging rules as an engine to classify and automatically route them to the right responder.Similar to the deduplication rules above, you can setup key:value pair tags based on the alert JSON and add any color to that tag. You can then use the tag as a means to override the default escalation policy and replace it with a user, a different escalation policy or a squad.

    This opens up a lot of possibilities in the way you handle incident management today. Just to explain the extent of flexibility this provides, here are a few examples:

    Example 1:

    Service: Infrastructure (SRE)
    Escalation Policy: 
    1st Layer - Primary on-call person(s)
    2nd Layer - Secondary on-call person(s)
    3rd Layer - The entire SRE squad
    4th Layer - Management

    Let's say a CPU alert of 70% usage is received for your backend or billing systems. Note, that this is a high severity incident and is definitely not the same as the “CPU usage above 50%” alert. Here, your application is not able to serve your users and the billing portal isn't functioning. This needs to be resolved immediately and you’ll need an SME involved. In this case, waiting for it to progress as per the on-call escalation would only delay resolution and cause more negative customer impact . You can set up your tagging and routing rules to accommodate such high severity scenarios. Here’s what the Tagging and Routing rules would look like.

    Tagging Rules:

    If,
    (payload.meta.cpu >= 70 && re(payload.meta.hostname, "^backend-server.*))
    then setup tags,
    severity:critical (color:Red)
    notify:sre-team
    
    If,
    (payload.meta.cpu >= 70 && re(payload.meta.hostname, "^billing.*")")
    then setup tags,
    severity:critical (color:Red)
    Notify:billing-critical-escalation
    


    Routing Rules:

    If,
    (tags.severity = "critical" && tags.notify = "sre-team")
    then route incident to,
    sre-squad
    
    If,
    (tags.severity = "critical" && tags.notify = "billing-critical-escalation")
    then route incident to,
    Critical Billing Escalation policy
    


    In this example, if the backend server reaches a critical level of CPU usage, we are notifying the entire SRE squad immediately. If this is the case for the billing server, we are notifying a Critical Billing Escalation policy, which might be different from the default escalation policy for the service like in the example escalation policy stated above.

    Example 2:

    In Example 1, we have seen how the entire team is notified in case of a critical incident. In this example, we can see the implementation of a similar solution for less severe incidents. In cases like these, we can choose to notify just 1 person instead of the entire escalation policy or a team.

    This example is an actual use case practised by us within Squadcast. 

    We set up our MongoDB Atlas alerts, specifically for query targeting:

    If the query targeting value is less than 2000, the Tag “severity:low” is attached to the incident and it is automatically routed to the junior engineer responsible for optimizing the database queries.

    If the query targeting value is above 2000, the Tag “severity:high” is attached to the incident and it is automatically routed to the senior engineer who will then optimize the complex database queries.

    Tagging Rules:

    Routing Rules:

    These are just two of many ways you can choose to use Tagging and Routing rules. This will help you streamline your incident response process and get your MTTR down significantly.

    • Suppress not-so important incidents

    If you still want to get some alerts sent to your on-call tool, alerts which are good to be recorded but need not alert anybody, you can set up suppression rules in Squadcast.

    You can define a suppression rule based on the content of the message or description of the incident. Any incident for that specific service matching the configured rules will be suppressed and nobody will be notified. This will still be recorded in Squadcast for future reference.

    Similarly, you can set up maintenance mode (one-time or recurring) for a service and any alerts for the service during such maintenance windows will be automatically suppressed.

    We hope these practices help you reduce alert noise and improve your on-call experience. We’d love to hear from you on other best practices that can be followed to better on-call.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    April 13, 2020
    April 13, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Raghu Chinnannan
    What you should know about Squadcast + Grafana Integration
    What you should know about Squadcast + Grafana Integration
    April 2, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.