Blog
Incident Response
Better Incident Response: Incident Classification & Setting Severities with Tags

Better Incident Response: Incident Classification & Setting Severities with Tags

February 20, 2020
Better Incident Response: Incident Classification & Setting Severities with Tags
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Implementing an incident classification step in your incident management software  and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident. 

How to implement Incident classification? 

Apart from setting up on-call schedules and adopting best practices on how to handle various kinds of incidents, incident management also has to do with constantly refining processes and benchmarks to ultimately achieve higher system reliability. One way of refining processes is making use of incident classification like that of incident severities. 

Every team has their own unique way of defining severities. But this evolves once they have a basic classification framework for defining the severity of an incident. The most common starting point is the SEV 1 - SEV 5 scale, outlined below: 

  • SEV-1 incidents are those that are critical and have a very large impact on the customer experience. Typically major incidents that cause outages hindering product or service usability for a large percentage of the customers. 
  • SEV-2 incidents are also critical in nature but are less severe in comparison with SEV-1 incidents. Incidents that impact a smaller percentage of customers and impede product usage nevertheless come under SEV-2. 
  • SEV-3 incidents are those that can be minor but may have a significant impact if not addressed immediately. These may be incidents that involve degradation of product stability but may not impact product usage right away. 
  • SEV-4 incidents are minor incidents that indicate that the product is not performing to the required standard but needn’t necessarily impact product usability. 
  • SEV-5 incidents are minor bugs that need to be fixed but don’t affect the product usability. 

However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency, and incident criticality. Many solutions only allow for incident severity as the one form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context. 

There’s a clear opportunity to improve incident response processes with better incident classification. If implemented the right way, this can bring down MTTR significantly and also provide an opportunity to reduce the toil involved with routing manually and also adds more context to an incident during the primary analysis. 

At Squadcast, we chose to add more flexibility to this process by creating a custom rule-based auto-tagging system instead of having just a dropdown to manually select or assign tags. We basically define tags as key-value pairs for eg. the key could be severity and the possible values could be SEV0, SEV1, SEV2, etc. or the key could be Team and the possible values could be Backend, Frontend, Database, etc. With the Tagging and Routing features in Squadcast, you can set  pretty much any kind of custom tags which will be automatically assigned based on the rules you define on top of the attributes being passed in the incident payload. You can then use these tags to set routing rules ensuring that the right responder is notified at the right time to bring down the resolution time. 

Introducing Part 2 of the Kevin Series, we illustrate how to use tags to set severities in Squadcast. We have more use-case based articles lined up to show you other ways to implement incident classification using tags - stay tuned! 

P.S. In case you were wondering, Kevin has previously also set up his own alert deduplication rules to reduce alert noise in Squadcast. 

Severities and Auto-Routing with Incident Tags

It's February 13th on a warm afternoon and Kevin is lazily dreaming about how his date is going to pan out the next day. His dream is suddenly disrupted by a torrent of database incidents that pour in. What's more annoying is that most of them are not particularly critical or even related to the class of issues he generally handles. 

Kevin’s got a new ringtone for incidents. Love Me Do, in keeping with the Valentine spirit.

Also, he works with Kai, who is expected to handle all the low-severity incidents and typically everything that comes in with regard to query optimization.

Kevin realised that he could be spending his time more effectively by

  • Classifying his incidents by assigning the type or class of incidents that they fit into
  • Assigning severity to get to critical incidents faster
  • Automatically route incidents based on tags to ensure that the right responder is alerted

This would allow more time for Kevin’s day dreaming!

Given that they work in a relatively small company where on-call rotations are rather erratic or handled by both when fires happen, he decided to make this process a whole lot better by simply routing more efficiently. 

Plus, anticipating the same barrage of incidents while he’s on his date tomorrow, he decides to take matters into his own hands. He sees that the database incident is a query optimisation based incident. And not even a severe one at that, based on the visited_returned_ratio value in the payload.

	
    {  
      "payload": {    
        "id" : 23,    
        "issue" : "SLOW_QUERY_PERF",    
        "metric" : {      
          "visited_returned_ratio" : 1300.2334,      
          "time_interval" : 10	  
        },    
        "summary" : "Slow query performance",    
        "cluster_name" : "cluster-prod-0-awsumdb",    
        "cluster_id" : 9,    
        "hostnames" : [      
          "rpl0-awsumdb.cluster-prod-0-awsumdb.db.com",      
          "rpl2-awsumdb.cluster-prod-0-awsumdb.db.com"	  
        ],    
        "link" : "",    
        "created" : "2020-02-13T13:00:00.116Z",    
        "status" : "open"  
      }
    }
  

He then writes a rule to auto-add tags to the incident to add more context to it and classify it better

Rule: re(payload.issue, "QUERY") && payload.metric.visited_returned_ratio < 5000


Tags assigned:

  • issueType : optimisation
  • severity : low

Finally, now he's done ensuring that at least the incidents are classified. With a satisfied smug, he sits back and admires his work of art. A quick thought jumps through his head and he rubs his hands in devious mischief.

He now uses routing rules and the issueType tag to automatically route it to the right person going forward. In this case, to Kai. So that Kevin does not get disturbed for these kinds of issues anymore.

Kevin thoughtfully arrives at the conclusion that this is quite possibly the best gift he could give to his single friend on Valentine's day.

Infact, he believes he has cracked the "gifting" secret code for any occasion, for his on-call team members (flaunts an evil grin)

Read More on Severity Level Classification

Written By:
February 20, 2020
Prakya Vasudevan
Prakya Vasudevan
February 20, 2020
Incident Response
Incident Management
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Learn how organizations are using Squadcast
to maintain and improve upon their Reliability metrics
Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds...
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
Alexandre Lessard
System Analyst
Martin do Santos
Platform and Architecture Tech Lead
Sandro Franchi
CTO
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
What our
customers
have to say
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
Alexandre Lessard
System Analyst
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
Martin do Santos
Platform and Architecture Tech Lead
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
Sandro Franchi
CTO
Revamp your Incident Response.
Peak Reliability
Easier, Faster, More Automated with SRE.