📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
Classifying Severity Levels for Your Organization

Classifying Severity Levels for Your Organization

July 5, 2022
Classifying Severity Levels for Your Organization
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively.

Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity, priority and severity level classification can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Severity and Priority: How Are They Different?

In most cases, the impact on the end user is a measure of the severity of an incident. Information about the error that is coming directly from the monitoring tool helps in classifying the severity level. Every organization will have defined levels of severity and procedures that work well for them. To get started with defining severity levels of incidents, we must first understand how to categorize them.

You should ask two major questions:

  • Are major workflows now affected?
  • Does it interfere with a user’s ability to complete an essential task?

Identifying the most crucial workflows of your apps or services is one of the first steps in defining severity levels. It aids in the identification of what defines an occurrence. Using "SEV" criteria, we may classify incidents according to their severity. Major incidents are classified with lower SEV ratings and require rapid response.

Every company must understand their own business, team and the kind of SEV-level descriptions that operate best for them. As we move further, we have a table that you may use to define severity levels for your organization.

It may appear as if incident severity and priority are one and the same. Isn't it reasonable to prioritize dealing with a catastrophic event over a minor one? In reality, it's more complicated than that for most businesses.

Once information about the error has been received, the incident commander will assign a level of priority to the incident. It could be P1 (priority level 1) for issues that need to be fixed at the earliest. Severity talks about impact on the user, and priority is the order in which the on-call engineers will work on the issues affecting the infrastructure.

For example, on an e-commerce platform, if the customers are not able to check out their shopping cart, this is an example of a severe issue. In this specific case, it is a high-priority incident as well. On the other hand, if there is a typo in the brand logo or the font size is too large, it is a high-priority incident without being a high-severity incident. Customers can still continue to shop on the website.

Let us consider another example, there is an event that causes your app to crash because it prevents users from doing what they need to do. It has a high severity rating. That incident affects only .01 percent of your users. However, it may not be considered a higher priority if there are other incidents that are affecting a greater number of users.

It's important to know when the two measurements are aligned. There are also situations when they might not be aligned. When something is given a high priority, it doesn't necessarily follow that it is of high severity.

Severity Level Classification for Your Organization

Not all situations are the same, and not all companies manage them in the same manner. In addition to the consequences of an event, you'll need to consider the following when establishing severity levels and the procedures and expectations that go with them.

A reliability platform like Squadcast and an e-commerce platform will have different ways of defining severity. As each of these has users with different requirements and tolerance levels, it is critical to first understand what the user expectations are.

How to Determine Severity Levels?

One must take into consideration the following before deciding on severity levels:

High and low traffic periods for your service

At certain times of the week, your customer traffic may be low. If an incident occurs at that time, few of your users will be affected. For example, if the shopping cart of an e-commerce site is not functional for certain hours of the day when the traffic is comparatively low, not many users will be affected.

The architecture of your infrastructure

You may be using a microservice-based architecture that has multiple redundancies and can easily scale up with higher user load. In such a scenario, the failure of one component will not be considered a high-severity incident as it can be easily replaced with a redundant service. For example, if the authentication service goes out, which sometimes cannot be easily replicated, it automatically becomes a high-severity incident since even if the other components are working fine, your users won't be able to use the product.

Using SLOs to determine severity levels

Since each service has its own specific service-level objective, which determines its functionality, we can use it to determine the severity level. For example, if a particular service’s SLO is transaction rate, if the number of successful transactions goes below a certain threshold, we can classify it as a high-severity incident.

Check out our documentation on SLOs if you wish to know more.

Levels of Severity

Severity definitions are organization-specific. An incident that is classified as SEV-1 may have a lower severity rating in another organization. There are also instances where certain organizations have just three levels of severity. The general rule that is followed is that the more user journeys/workflows that are affected by the incident, higher will be the severity level.

Some organizations may also categorize severity levels on the basis of SLIs (service-level indicators) or SLOs (service-level objectives ) being affected. The table below lists one of many possible ways to define severity levels.

SEV-1 Usually incidents are considered to be SEV-1 if large-scale failures in your infrastructure are occuring that negatively affects most users. Critical services are disrupted or unavailable. Database read/write errors, security breaches and other issues might fall under this umbrella term.

If third-party services (such as Google SSO) are down, users may be unable to sign in, is often considered a level 1 severity issue.
SEV-2 Usually a SEV-2 incident is declared when user experience is severely affected. This can include unacceptably high levels of latency, or a significant breach of SLAs/SLOs. These kinds of incidents have the potential to cause major revenue loss for your organization. Any incident that affects more than 70 percent of the users can be classified as SEV-2.
SEV-3 An occurrence that has just a minimal impact on the infrastructure but nonetheless creates high load or latency issues for your users. This can include unacceptable long website load times, timeouts for shopping carts and other similar issues.
SEV-4 This is an issue that affects customer experience, but doesn't have a major impact on the service's operation. This can include inconsistent load times of pages, display problems in different browsers and similar issues.
SEV-5 Low-level mistakes, such as formatting or display issues that do not impair usability are classified as SEV 5. This can include typos in product descriptions, incorrect colors being displayed in brand logos and other issues of that nature.

Conclusion

It is essential to properly classify incident severity levels to get a head start on solving infrastructure issues. Working with previously defined severity levels helps on-call teams to quickly triage major issues. As we have seen in this blog, each organization will have their own specific way of deciding upon the severity and priority of incidents.

As the nature and scale of your infrastructure grows and the needs of your user base evolve over time, you may want to revisit and modify the definitions of severity levels. Continuous learning is an essential part of good incident response. We hope this blog is helpful for you in setting the path for better incident response in your organization.

Written By:
July 5, 2022
Nir Sharma
Nir Sharma
July 5, 2022
SRE
Incident Management
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Classifying Severity Levels for Your Organization

Jul 5, 2022
Last Updated:
November 20, 2024
Share this post:
Classifying Severity Levels for Your Organization

Streamline incident response with a clear understanding of severity level classification. Learn how to categorize issues effectively and resolve infrastructure problems faster. Click to learn more!

Table of Contents:

    Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively.

    Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity, priority and severity level classification can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

    Severity and Priority: How Are They Different?

    In most cases, the impact on the end user is a measure of the severity of an incident. Information about the error that is coming directly from the monitoring tool helps in classifying the severity level. Every organization will have defined levels of severity and procedures that work well for them. To get started with defining severity levels of incidents, we must first understand how to categorize them.

    You should ask two major questions:

    • Are major workflows now affected?
    • Does it interfere with a user’s ability to complete an essential task?

    Identifying the most crucial workflows of your apps or services is one of the first steps in defining severity levels. It aids in the identification of what defines an occurrence. Using "SEV" criteria, we may classify incidents according to their severity. Major incidents are classified with lower SEV ratings and require rapid response.

    Every company must understand their own business, team and the kind of SEV-level descriptions that operate best for them. As we move further, we have a table that you may use to define severity levels for your organization.

    It may appear as if incident severity and priority are one and the same. Isn't it reasonable to prioritize dealing with a catastrophic event over a minor one? In reality, it's more complicated than that for most businesses.

    Once information about the error has been received, the incident commander will assign a level of priority to the incident. It could be P1 (priority level 1) for issues that need to be fixed at the earliest. Severity talks about impact on the user, and priority is the order in which the on-call engineers will work on the issues affecting the infrastructure.

    For example, on an e-commerce platform, if the customers are not able to check out their shopping cart, this is an example of a severe issue. In this specific case, it is a high-priority incident as well. On the other hand, if there is a typo in the brand logo or the font size is too large, it is a high-priority incident without being a high-severity incident. Customers can still continue to shop on the website.

    Let us consider another example, there is an event that causes your app to crash because it prevents users from doing what they need to do. It has a high severity rating. That incident affects only .01 percent of your users. However, it may not be considered a higher priority if there are other incidents that are affecting a greater number of users.

    It's important to know when the two measurements are aligned. There are also situations when they might not be aligned. When something is given a high priority, it doesn't necessarily follow that it is of high severity.

    Severity Level Classification for Your Organization

    Not all situations are the same, and not all companies manage them in the same manner. In addition to the consequences of an event, you'll need to consider the following when establishing severity levels and the procedures and expectations that go with them.

    A reliability platform like Squadcast and an e-commerce platform will have different ways of defining severity. As each of these has users with different requirements and tolerance levels, it is critical to first understand what the user expectations are.

    How to Determine Severity Levels?

    One must take into consideration the following before deciding on severity levels:

    High and low traffic periods for your service

    At certain times of the week, your customer traffic may be low. If an incident occurs at that time, few of your users will be affected. For example, if the shopping cart of an e-commerce site is not functional for certain hours of the day when the traffic is comparatively low, not many users will be affected.

    The architecture of your infrastructure

    You may be using a microservice-based architecture that has multiple redundancies and can easily scale up with higher user load. In such a scenario, the failure of one component will not be considered a high-severity incident as it can be easily replaced with a redundant service. For example, if the authentication service goes out, which sometimes cannot be easily replicated, it automatically becomes a high-severity incident since even if the other components are working fine, your users won't be able to use the product.

    Using SLOs to determine severity levels

    Since each service has its own specific service-level objective, which determines its functionality, we can use it to determine the severity level. For example, if a particular service’s SLO is transaction rate, if the number of successful transactions goes below a certain threshold, we can classify it as a high-severity incident.

    Check out our documentation on SLOs if you wish to know more.

    Levels of Severity

    Severity definitions are organization-specific. An incident that is classified as SEV-1 may have a lower severity rating in another organization. There are also instances where certain organizations have just three levels of severity. The general rule that is followed is that the more user journeys/workflows that are affected by the incident, higher will be the severity level.

    Some organizations may also categorize severity levels on the basis of SLIs (service-level indicators) or SLOs (service-level objectives ) being affected. The table below lists one of many possible ways to define severity levels.

    SEV-1 Usually incidents are considered to be SEV-1 if large-scale failures in your infrastructure are occuring that negatively affects most users. Critical services are disrupted or unavailable. Database read/write errors, security breaches and other issues might fall under this umbrella term.

    If third-party services (such as Google SSO) are down, users may be unable to sign in, is often considered a level 1 severity issue.
    SEV-2 Usually a SEV-2 incident is declared when user experience is severely affected. This can include unacceptably high levels of latency, or a significant breach of SLAs/SLOs. These kinds of incidents have the potential to cause major revenue loss for your organization. Any incident that affects more than 70 percent of the users can be classified as SEV-2.
    SEV-3 An occurrence that has just a minimal impact on the infrastructure but nonetheless creates high load or latency issues for your users. This can include unacceptable long website load times, timeouts for shopping carts and other similar issues.
    SEV-4 This is an issue that affects customer experience, but doesn't have a major impact on the service's operation. This can include inconsistent load times of pages, display problems in different browsers and similar issues.
    SEV-5 Low-level mistakes, such as formatting or display issues that do not impair usability are classified as SEV 5. This can include typos in product descriptions, incorrect colors being displayed in brand logos and other issues of that nature.

    Conclusion

    It is essential to properly classify incident severity levels to get a head start on solving infrastructure issues. Working with previously defined severity levels helps on-call teams to quickly triage major issues. As we have seen in this blog, each organization will have their own specific way of deciding upon the severity and priority of incidents.

    As the nature and scale of your infrastructure grows and the needs of your user base evolve over time, you may want to revisit and modify the definitions of severity levels. Continuous learning is an essential part of good incident response. We hope this blog is helpful for you in setting the path for better incident response in your organization.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    July 5, 2022
    July 5, 2022
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Nir Sharma
    What are Canary Deployments and Why are they Important?
    What are Canary Deployments and Why are they Important?
    August 25, 2022
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    April 5, 2022
    ServiceNow + Squadcast Integration: Automate IT Ticketing and Project Tracking
    ServiceNow + Squadcast Integration: Automate IT Ticketing and Project Tracking
    March 4, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.