Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively.
Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity, priority and severity level classification can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.
In most cases, the impact on the end user is a measure of the severity of an incident. Information about the error that is coming directly from the monitoring tool helps in classifying the severity level. Every organization will have defined levels of severity and procedures that work well for them. To get started with defining severity levels of incidents, we must first understand how to categorize them.
You should ask two major questions:
Identifying the most crucial workflows of your apps or services is one of the first steps in defining severity levels. It aids in the identification of what defines an occurrence. Using "SEV" criteria, we may classify incidents according to their severity. Major incidents are classified with lower SEV ratings and require rapid response.
Every company must understand their own business, team and the kind of SEV-level descriptions that operate best for them. As we move further, we have a table that you may use to define severity levels for your organization.
It may appear as if incident severity and priority are one and the same. Isn't it reasonable to prioritize dealing with a catastrophic event over a minor one? In reality, it's more complicated than that for most businesses.
Once information about the error has been received, the incident commander will assign a level of priority to the incident. It could be P1 (priority level 1) for issues that need to be fixed at the earliest. Severity talks about impact on the user, and priority is the order in which the on-call engineers will work on the issues affecting the infrastructure.
For example, on an e-commerce platform, if the customers are not able to check out their shopping cart, this is an example of a severe issue. In this specific case, it is a high-priority incident as well. On the other hand, if there is a typo in the brand logo or the font size is too large, it is a high-priority incident without being a high-severity incident. Customers can still continue to shop on the website.
Let us consider another example, there is an event that causes your app to crash because it prevents users from doing what they need to do. It has a high severity rating. That incident affects only .01 percent of your users. However, it may not be considered a higher priority if there are other incidents that are affecting a greater number of users.
It's important to know when the two measurements are aligned. There are also situations when they might not be aligned. When something is given a high priority, it doesn't necessarily follow that it is of high severity.
Not all situations are the same, and not all companies manage them in the same manner. In addition to the consequences of an event, you'll need to consider the following when establishing severity levels and the procedures and expectations that go with them.
A reliability platform like Squadcast and an e-commerce platform will have different ways of defining severity. As each of these has users with different requirements and tolerance levels, it is critical to first understand what the user expectations are.
One must take into consideration the following before deciding on severity levels:
At certain times of the week, your customer traffic may be low. If an incident occurs at that time, few of your users will be affected. For example, if the shopping cart of an e-commerce site is not functional for certain hours of the day when the traffic is comparatively low, not many users will be affected.
You may be using a microservice-based architecture that has multiple redundancies and can easily scale up with higher user load. In such a scenario, the failure of one component will not be considered a high-severity incident as it can be easily replaced with a redundant service. For example, if the authentication service goes out, which sometimes cannot be easily replicated, it automatically becomes a high-severity incident since even if the other components are working fine, your users won't be able to use the product.
Since each service has its own specific service-level objective, which determines its functionality, we can use it to determine the severity level. For example, if a particular service’s SLO is transaction rate, if the number of successful transactions goes below a certain threshold, we can classify it as a high-severity incident.
Check out our documentation on SLOs if you wish to know more.
Severity definitions are organization-specific. An incident that is classified as SEV-1 may have a lower severity rating in another organization. There are also instances where certain organizations have just three levels of severity. The general rule that is followed is that the more user journeys/workflows that are affected by the incident, higher will be the severity level.
Some organizations may also categorize severity levels on the basis of SLIs (service-level indicators) or SLOs (service-level objectives ) being affected. The table below lists one of many possible ways to define severity levels.
It is essential to properly classify incident severity levels to get a head start on solving infrastructure issues. Working with previously defined severity levels helps on-call teams to quickly triage major issues. As we have seen in this blog, each organization will have their own specific way of deciding upon the severity and priority of incidents.
As the nature and scale of your infrastructure grows and the needs of your user base evolve over time, you may want to revisit and modify the definitions of severity levels. Continuous learning is an essential part of good incident response. We hope this blog is helpful for you in setting the path for better incident response in your organization.