A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure. Even though modern monitoring tools and incident management platforms provide context around each alert, there is still room for improvement. A relatively simple solution is to add labels to your alert payloads.
As an on-call engineer, you may have faced a situation where a major alert took a long time to triage, often because the alert payload was missing crucial information like hostname/cluster-details etc. in a Kubernetes setup.
In this blog, let's understand how we can add labels to important information within the payload so as to reduce MTTR (Mean time to respond).
We’ll also explore how Prometheus Alert manager and Squadcast with Routing, and Tagging rules can ensure that Alert Payloads with specific labels are sent to the concerned engineers for faster remediation of issues.
Note: A Payload Label can be used to classify the Payload data and identify crucial information. While we cover a Kubernetes specific example in this blog, this can be done with other monitoring tools as well.
The below screenshot is an example of a basic Payload (without any labels).
As an on-call engineer, you will need more details about the alert such as:
Since this info. is not available in the payload, you have to manually fetch the IP address for troubleshooting the issue.
Your life could’ve been simpler if you were given details such as IP addr., hostname, application name, severity level, environment name, etc., within the alert itself. You could have ignored the alert if it came from a test/staging environment as it would have had environment related labels in the payload.
There is a relatively simple way to add labels to the payload using your preferred monitoring tool. In the following example, we will be using Prometheus Alert Manager to make context-rich alert payloads.
Below screenshot is an example of a configuration file in Prometheus Alert manager that has labels for context-rich payloads.
Prometheus Alert manager config file
Note the various labels mentioned above.
Naming the labels will depend on the type of technology stack and on-call team you have in place. The type of labels to be used should be decided by the on-call team since they are the first responders when a critical outage occurs.
The labels shown below are some of the common ones your team can use to get started with:
So now, let's setup an Alerting rule in the Prometheus Alert manager:
In the above rule, we are defining the conditions for severity in the alerting rule.
So now, when an alert is sent to Squadcast from Alertmanager, all the relevant information will be embedded in the payload, such as severity, deployment and other labels mentioned in the Prometheus configuration file. We can make use of Squadcast routing rules to efficiently manage/route the incident to the concerned person/ team.
Furthermore, we can make use of the annotations option in alertmanager to send much more detailed and lengthy information in the alert payload. In organisations that are reliant on playbooks / runbooks on-call engineers can start troubleshooting right away rather than searching for the relevant playbook when an incident occurs.
As seen here, the alert notification from Prometheus contains a link to the internal runbook.
In the screenshot below you can see how to configure Squadcast to route alerts based on the labels attached to the incident payload. Here the service which we are configuring is called “K8s Cluster Monitoring”.
‘Tagging Rules’ is used to define the labels that will be processed by Squadcast. Once the ‘tags’ have been defined, Routing Rules can be used to send alerts to the concerned person, team or escalation policy.
All the labels defined in the payload are now recognised as ‘tags’ in Squadcast. Once this step is complete we can define custom routing rules based on the ‘tags’.
In the screenshot above we have created three tags based on the following criteria: “service-owner”, “squad” and “deployed-by”.
Now that these tags have been defined, we can go on to create granular incident routing rules for the service. In the screenshot above we are creating a routing rule where if the alert payload has ‘serviceowner’ defined as ‘john’ the alert notification is sent directly to John.
Above we can see the routing rule being created when the alert payload has the label ‘team’ in it. We can see the alert notification getting routed to the specific team.
In this instance the alert will be routed to the person who deployed the feature (‘Diane’)
Previously, we have seen alerts getting routed to specific individuals however it is also possible to route alers to predefined escalation policies. In the screenshot above we see an example of the same.
Choose multiple tags and boolean operators (‘AND’) to make routing rules as specific as possible and to cover as many possible scenarios.
In Squadcast, Routing Rules are executed in a top-down approach. If the first rule is executed, then the remaining rules are automatically ignored. The Execution Priority feature helps in defining the order in which the rules will be implemented.
Having contextual information around each payload is of great help during post-mortems / reviews after major outages since detailed incident timelines are automatically created. While the concept of context-rich alert payloads may seem simple but in the long term can help improve reliability of your system.
Below we can see an instance of an incident in Squadcast with their associated labels. With this rule-based, auto-tagging system you can now define customised tags based on incident payloads, that get automatically assigned to incidents when they are triggered.
In many cases it is not feasible to reduce the ad-hoc complexity of your existing architecture. This is where the combination of ‘context rich alerts’ + ‘intelligent routing’ helps to drastically reduce the MTTA and MTTR.
The example provided above is just the tip of the iceberg - you can create custom labels and routing rules as well. As your infrastructure scales up with new users and dependencies, your MTTR will still be within acceptable limits thanks to better labeling and routing.
What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.