Got a DevOps horror story? Tell us about your worst on-call nightmares this Halloween and get featured! Click Here
Blog
Best Practices
Faster Incident Resolution with Context Rich Alerts

Faster Incident Resolution with Context Rich Alerts

June 9, 2021
Faster Incident Resolution with Context Rich Alerts
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure. Even though modern monitoring tools and incident management platforms provide context around each alert, there is still room for improvement. A relatively simple solution is to add labels to your alert payloads.

As an on-call engineer, you may have faced a situation where a major alert took a long time to triage, often because the alert payload was missing crucial information like hostname/cluster-details etc. in a Kubernetes setup.

In this blog, let's understand how we can add labels to important information within the payload so as to reduce MTTR (Mean time to respond).

We’ll also explore how Prometheus Alert manager and Squadcast with Routing, and Tagging rules can ensure that Alert Payloads with specific labels are sent to the concerned engineers for faster remediation of issues.

Note: A Payload Label can be used to classify the Payload data and identify crucial information. While we cover a Kubernetes specific example in this blog, this can be done with other monitoring tools as well.

The below screenshot is an example of a basic Payload (without any labels).

As an on-call engineer, you will need more details about the alert such as:

  • IP address / hostname
  • Cluster identification, etc.

Since this info. is not available in the payload, you have to manually fetch the IP address for troubleshooting the issue.

Your life could’ve been simpler if you were given details such as IP addr., hostname, application name, severity level, environment name, etc., within the alert itself. You could have ignored the alert if it came from a test/staging environment as it would have had environment related labels in the payload.

There is a relatively simple way to add labels to the payload using your preferred monitoring tool. In the following example, we will be using Prometheus Alert Manager to make context-rich alert payloads.

Below screenshot is an example of a configuration file in Prometheus Alert manager that has labels for context-rich payloads.

Prometheus Alert manager config file

apiVersion
:
apps/v1

kind
:
Deployment

metadata
:
  name
:
pubsub

  namespace
:
laddertruck

  labels
:
    name
:
"laddertruck"

    language
:
"ruby"

    language-version
:
"3.0.0"

    framework
:
"rails"

    framework-version
:
"5.2.1"

    team
:
"xyz"

    developed-by
:
"diane"

    service-owner
:
"john"

Note the various labels mentioned above.

How to decide which labels to add to the payload?

Naming the labels will depend on the type of technology stack and on-call team you have in place. The type of labels to be used should be decided by the on-call team since they are the first responders when a critical outage occurs.

The labels shown below are some of the common ones your team can use to get started with:

  • owner: This tag identifies the owner of the service.
  • language: The programming language the service is written in.
  • framework: The framework on which the service is built. This is vital if there are multiple services written in the same language but utilising different frameworks.

So now, let's setup an Alerting rule in the Prometheus Alert manager:

- alert
:
ContainerMemoryUsage

  expr
:
(sum(container_memory_working_set_bytes{namespace="default"}) BY (instance, name) / sum(container_spec_memory_limit_bytes{namespace="default"} > 0) BY (instance, name) * 100) > 90

  for
:
1m

  labels
:
    severity
:
warning

  annotations
:
    summary
:
"Container Memory usage (instance {{ $labels.instance }})"

    description
:
"Container Memory usage is above 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

In the above rule, we are defining the conditions for severity in the alerting rule.

So now, when an alert is sent to Squadcast from Alertmanager, all the relevant information will be embedded in the payload, such as severity, deployment and other labels mentioned in the Prometheus configuration file. We can make use of Squadcast routing rules to efficiently manage/route the incident to the concerned person/ team.

Furthermore, we can make use of the annotations option in alertmanager to send much more detailed and lengthy information in the alert payload. In organisations that are reliant on playbooks / runbooks on-call engineers can start troubleshooting right away rather than searching for the relevant playbook when an incident occurs.

Image Source

As seen here, the alert notification from Prometheus contains a link to the internal runbook.

Configuring Squadcast with help of labels

In the screenshot below you can see how to configure Squadcast to route alerts based on the labels attached to the incident payload. Here the service which we are configuring is called “K8s Cluster Monitoring”.

‘Tagging Rules’ is used to define the labels that will be processed by Squadcast. Once the ‘tags’ have been defined, Routing Rules can be used to send alerts to the concerned person, team or escalation policy.

All the labels defined in the payload are now recognised as ‘tags’ in Squadcast. Once this step is complete we can define custom routing rules based on the ‘tags’.

In the screenshot above we have created three tags based on the following criteria: “service-owner”, “squad” and “deployed-by”.

Now that these tags have been defined, we can go on to create granular incident routing rules for the service. In the screenshot above we are creating a routing rule where if the alert payload has ‘serviceowner’ defined as ‘john’ the alert notification is sent directly to John.

Above we can see the routing rule being created when the alert payload has the label ‘team’ in it. We can see the alert notification getting routed to the specific team.

In this instance the alert will be routed to the person who deployed the feature (‘Diane’)

Previously, we have seen alerts getting routed to specific individuals however it is also possible to route alers to predefined escalation policies. In the screenshot above we see an example of the same.

Choose multiple tags and boolean operators (‘AND’) to make routing rules as specific as possible and to cover as many possible scenarios.

In Squadcast, Routing Rules are executed in a top-down approach. If the first rule is executed, then the remaining rules are automatically ignored. The Execution Priority feature helps in defining the order in which the rules will be implemented.

Other Benefits of context-rich alerts

Having contextual information around each payload is of great help during post-mortems / reviews after major outages since detailed incident timelines are automatically created. While the concept of context-rich alert payloads may seem simple but in the long term can help improve reliability of your system.

Below we can see an instance of an incident in Squadcast with their associated labels. With this rule-based, auto-tagging system you can now define customised tags based on incident payloads, that get automatically assigned to incidents when they are triggered.

Conclusion

In many cases it is not feasible to reduce the ad-hoc complexity of your existing architecture. This is where the combination of ‘context rich alerts’ + ‘intelligent routing’ helps to drastically reduce the MTTA and MTTR.

The example provided above is just the tip of the iceberg - you can create custom labels and routing rules as well. As your infrastructure scales up with new users and dependencies, your MTTR will still be within acceptable limits thanks to better labeling and routing.

What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Written By:
Roshan Shetty
Nir Sharma
Roshan Shetty
Nir Sharma
June 9, 2021
Best Practices
Incident Response
On-Call
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Faster Incident Resolution with Context Rich Alerts

Jun 9, 2021
Last Updated:
October 4, 2024
Share this post:
Faster Incident Resolution with Context Rich Alerts

Labelling your alert payloads although simple can significantly improve the time it takes for your team to respond to incidents. In this blog learn how Squadcast's auto-tagging feature can be a game changer by enabling intelligent labelling & routing of alerts to ultimately reduce your MTTR.

Table of Contents:

    A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure. Even though modern monitoring tools and incident management platforms provide context around each alert, there is still room for improvement. A relatively simple solution is to add labels to your alert payloads.

    As an on-call engineer, you may have faced a situation where a major alert took a long time to triage, often because the alert payload was missing crucial information like hostname/cluster-details etc. in a Kubernetes setup.

    In this blog, let's understand how we can add labels to important information within the payload so as to reduce MTTR (Mean time to respond).

    We’ll also explore how Prometheus Alert manager and Squadcast with Routing, and Tagging rules can ensure that Alert Payloads with specific labels are sent to the concerned engineers for faster remediation of issues.

    Note: A Payload Label can be used to classify the Payload data and identify crucial information. While we cover a Kubernetes specific example in this blog, this can be done with other monitoring tools as well.

    The below screenshot is an example of a basic Payload (without any labels).

    As an on-call engineer, you will need more details about the alert such as:

    • IP address / hostname
    • Cluster identification, etc.

    Since this info. is not available in the payload, you have to manually fetch the IP address for troubleshooting the issue.

    Your life could’ve been simpler if you were given details such as IP addr., hostname, application name, severity level, environment name, etc., within the alert itself. You could have ignored the alert if it came from a test/staging environment as it would have had environment related labels in the payload.

    There is a relatively simple way to add labels to the payload using your preferred monitoring tool. In the following example, we will be using Prometheus Alert Manager to make context-rich alert payloads.

    Below screenshot is an example of a configuration file in Prometheus Alert manager that has labels for context-rich payloads.

    Prometheus Alert manager config file

    apiVersion
    :
    apps/v1

    kind
    :
    Deployment

    metadata
    :
      name
    :
    pubsub

      namespace
    :
    laddertruck

      labels
    :
        name
    :
    "laddertruck"

        language
    :
    "ruby"

        language-version
    :
    "3.0.0"

        framework
    :
    "rails"

        framework-version
    :
    "5.2.1"

        team
    :
    "xyz"

        developed-by
    :
    "diane"

        service-owner
    :
    "john"

    Note the various labels mentioned above.

    How to decide which labels to add to the payload?

    Naming the labels will depend on the type of technology stack and on-call team you have in place. The type of labels to be used should be decided by the on-call team since they are the first responders when a critical outage occurs.

    The labels shown below are some of the common ones your team can use to get started with:

    • owner: This tag identifies the owner of the service.
    • language: The programming language the service is written in.
    • framework: The framework on which the service is built. This is vital if there are multiple services written in the same language but utilising different frameworks.

    So now, let's setup an Alerting rule in the Prometheus Alert manager:

    - alert
    :
    ContainerMemoryUsage

      expr
    :
    (sum(container_memory_working_set_bytes{namespace="default"}) BY (instance, name) / sum(container_spec_memory_limit_bytes{namespace="default"} > 0) BY (instance, name) * 100) > 90

      for
    :
    1m

      labels
    :
        severity
    :
    warning

      annotations
    :
        summary
    :
    "Container Memory usage (instance {{ $labels.instance }})"

        description
    :
    "Container Memory usage is above 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

    In the above rule, we are defining the conditions for severity in the alerting rule.

    So now, when an alert is sent to Squadcast from Alertmanager, all the relevant information will be embedded in the payload, such as severity, deployment and other labels mentioned in the Prometheus configuration file. We can make use of Squadcast routing rules to efficiently manage/route the incident to the concerned person/ team.

    Furthermore, we can make use of the annotations option in alertmanager to send much more detailed and lengthy information in the alert payload. In organisations that are reliant on playbooks / runbooks on-call engineers can start troubleshooting right away rather than searching for the relevant playbook when an incident occurs.

    Image Source

    As seen here, the alert notification from Prometheus contains a link to the internal runbook.

    Configuring Squadcast with help of labels

    In the screenshot below you can see how to configure Squadcast to route alerts based on the labels attached to the incident payload. Here the service which we are configuring is called “K8s Cluster Monitoring”.

    ‘Tagging Rules’ is used to define the labels that will be processed by Squadcast. Once the ‘tags’ have been defined, Routing Rules can be used to send alerts to the concerned person, team or escalation policy.

    All the labels defined in the payload are now recognised as ‘tags’ in Squadcast. Once this step is complete we can define custom routing rules based on the ‘tags’.

    In the screenshot above we have created three tags based on the following criteria: “service-owner”, “squad” and “deployed-by”.

    Now that these tags have been defined, we can go on to create granular incident routing rules for the service. In the screenshot above we are creating a routing rule where if the alert payload has ‘serviceowner’ defined as ‘john’ the alert notification is sent directly to John.

    Above we can see the routing rule being created when the alert payload has the label ‘team’ in it. We can see the alert notification getting routed to the specific team.

    In this instance the alert will be routed to the person who deployed the feature (‘Diane’)

    Previously, we have seen alerts getting routed to specific individuals however it is also possible to route alers to predefined escalation policies. In the screenshot above we see an example of the same.

    Choose multiple tags and boolean operators (‘AND’) to make routing rules as specific as possible and to cover as many possible scenarios.

    In Squadcast, Routing Rules are executed in a top-down approach. If the first rule is executed, then the remaining rules are automatically ignored. The Execution Priority feature helps in defining the order in which the rules will be implemented.

    Other Benefits of context-rich alerts

    Having contextual information around each payload is of great help during post-mortems / reviews after major outages since detailed incident timelines are automatically created. While the concept of context-rich alert payloads may seem simple but in the long term can help improve reliability of your system.

    Below we can see an instance of an incident in Squadcast with their associated labels. With this rule-based, auto-tagging system you can now define customised tags based on incident payloads, that get automatically assigned to incidents when they are triggered.

    Conclusion

    In many cases it is not feasible to reduce the ad-hoc complexity of your existing architecture. This is where the combination of ‘context rich alerts’ + ‘intelligent routing’ helps to drastically reduce the MTTA and MTTR.

    The example provided above is just the tip of the iceberg - you can create custom labels and routing rules as well. As your infrastructure scales up with new users and dependencies, your MTTR will still be within acceptable limits thanks to better labeling and routing.

    What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Roshan Shetty
    What are Canary Deployments and Why are they Important?
    What are Canary Deployments and Why are they Important?
    August 25, 2022
    Classifying Severity Levels for Your Organization
    Classifying Severity Levels for Your Organization
    July 5, 2022
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    April 5, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.