Are you an SRE or On-call engineer struggling to manage toil?
Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also at the business level, toil doesn't add any functional value towards growth and productivity.
However, toil can be tackled with simple but effective automation strategies across every stage of incident management process.
In this blog, we dig deeper into how to reduce toil by defining better IT alerting strategies within an alert management system.
Google’s SRE workbook defines toil as,
"the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows."
For reducing toil, first, we should learn the characteristics of toil (identify) and calculate the time taken to resolve incidents manually (measure toil).
Identifying Toil is basically understanding the overall characteristics of a routine task. It can be done by evaluating a task on basis of,
Measuring Toil is simply computing human time spent on each toilsome activity.
It is done by analyzing certain trends:
With these analysis, we can prioritize toil to create the balance between production tasks and routine operational tasks.
Note: In all organizations, the goal is to ensure that toil does not occupy more than 50% of SRE’s time. This is to keep the team focused on production-related functionalities.
Before we look into the elaborate causes of toil let’s get to know the after-effects of toil in short.
Whether it is an incident management task or any activity, if you keep doing the same task repeatedly over a certain time, then often you will be filled with discontent over the job you do.
In some cases, toil even causes an increased attrition rate due to burnout, boredom, alert fatigue among SREs which may eventually slow down the overall development process.
Let's find out ways to reduce toil by first looking into the various causes that contribute to toil.
If alerts are repetitive and need to be resolved manually, then managing those alerts would be a tiring task. If your system notifies you that your web requests at 6 AM are 3x higher than usual, this indicates a good amount of traffic to your website but it doesn’t pose any threat to the architecture. These alerts just provide information about your system performance and need no manual intervention. So spending time on suppressing such trivial alerts can result in missing those important ones that need to be addressed manually. Also, manual suppression of too many alerts can add up to toil.
Automation is key in IT alerting and reducing toil, at every stage of alert configuration. If there is a possibility to automate an alert response, then it must be done on priority basis. This would greatly help in reducing alert noise.
A poorly configured alert system would generate either too many alerts or no alerts. These kinds of alerts are due to sensitivity issues within an architecture.
The sensitivity is of two types, over-sensitivity (marginal sensitivity) and under-sensitivity. Over-sensitivity is a condition when the system sends too many alerts. This occurs when alert conditions are marginal at threshold levels.
For example, when the response time degradation in database service is set exactly at 100ms (absolute value) then even the slightest difference would generate too many alerts. So rather than setting alert conditions to be marginal, we can set up relative values like not less than 50%.
On the other hand, under-sensitivity is a condition when a system does not send any alerts. Which poses a bigger problem. This can happen when a system has an issue that goes undetected. There's a risk of running into a major outage and having no means to get to the root cause. In this case, the system might require re-engineering to scrutinize such sensitivity issues.
Latency, Traffic, Errors, and Saturation are the golden signals of SRE that help in monitoring a system. Other variations such as USE (Utilization, Saturation, and Errors) and RED (Rate, Error, and Durability) can also be used to measure key performances of the architecture.
While setting up alerts, the utilization of database, CPU, and memory have to be estimated and optimized following these vital SRE signals.
For example, say if the average load experienced by the infrastructure is 1.5x higher than the normal rotation per second of CPU count, then the system would trigger unusual amounts of alerts. This is due to not having proper optimization in place. So ignoring such basic saturation levels of the system would generate abnormalities which can ultimately result in outages.
Insufficient information on alerts means that the system is going through some difficulty in processing a particular set of instructions and is not alerting specifically about the ongoing situation. Then this would lead to an unusual toil of figuring out where the problem exists and what contributes to an outage.
Let’s say you have received an alert stating “instance i-dk3sldfjsd CPU utilization high". Here, this alert does not convey sufficient information about an incident, like either IP address or hostname. Only with minimal information, the on-call engineer cannot respond to an incident. So s/he might have to open the AWS console to figure out the actual IP location of the server to proceed with the troubleshooting processes. In this scenario, the time taken to logging on to the server and resolving the issue would be substantially high.
While configuring alerts, instead of setting tight thresholds, take a look at the “Trend/Historical Rolling Number” of system performances. This can be done by calculating the rate of change in system performances. And it would give a clear idea to fix the right thresholds. Almost all modern monitoring systems help in recording the rate of change in system performances.
For example, let’s consider instances like the percentage of CPU utilization is consistently greater than (70-80)%, or Server Response Time falls above (4-6) ms and the count of log query stands greater than 100-125, then the alerts can be optimized within the performance range of the system by expressing in terms of percentile values like 95th percentile. This will reduce alerts drastically and helps to stay reliable.
Checkout how Squadcast’s Past Incidents feature that assists incident responders by presenting them with a list of similar past incidents related to the same service they are currently investigating.
Additional Reading: Optimizing your alerts to reduce Alert Noise
With their predictive characteristics, proactive alerts play a vital role in understanding system performances.
Before we expand further on proactive alerts, here’s a quick look at the different kinds of alerts and their implications.
In an alert management system, the foremost step in alerting is to categorize alerts. So that we can monitor the system’s health in a strategic order. There are three types of alert categories,
Additional Reading: Curb alert noise for better productivity: How-To's and Best Practices
In SRE practices, Alerting policy is a set of rules or conditions we define to a monitoring system. This set of rules help in notifying the engineering team when there is a system abnormality. Alerting policies play a vital role in maintaining the performance and health of system architecture.
Alert-as-code is an evolutionary technique that helps in defining all the system alerts or the entire alerting policies in the form of code. This helps to point out the incidents more specifically with a monitoring tool.
This alert-as-code configuration can be done while building the system with infrastructure-as-code architecture.
For better understanding, we would like to cite our Squadcast infrastructure as an example for the alert-as-code configuration. Internally at Squadcast, we use Kube-Prometheus to deploy Prometheus inside our architecture, and with that configuration, we create/modify all the alerting rules for our infrastructure. Here, the use case is that all the changes we have made to the monitoring setup are being version controlled over Git and stored in GitHub.
Also, alert-as-code helps in predictive analysis and root causes analysis to scrutinize the underlying reason for an incident. Some of the other use cases are,
Note: While detecting anomalies, programmatic alerting policy creates alerts only when there is a deviation from the historical performance of the system.
Squadcast has distinctively configurable features that facilitate on-call teams to streamline high-priority alerts and stay productive.
Additional Reading: Alert Intelligence - 11 Tips for Smarter Alert Management
Right alerts with necessary automation strategies would give way to more effective and toil free incident management ecosystems. These practices would greatly help in reducing operational toil and can ultimately enhance the productivity of the team.