Blog
Incident Management
Best Practices in Incident Management

Best Practices in Incident Management

May 7, 2020
Best Practices in Incident Management
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.

Incident Management processes are typically used by SRE, DevOps, NOC and other IT teams to respond to incidents that affect services and work on restoring their uptime. Any team that also follows ITIL and ITSM practices have similar processes in place with slightly different terminologies.

For the purpose of categorizing the different aspects of Incident Management, we can go over the different stages of an Incident Lifecycle.

stages of an Incident Lifecycle.
Note: This does not indicate the various inter-dependencies of each stage in the cycle and is only depicted in a simple format to provide a holistic overview of the Incident Lifecycle.

What is IT incident management?

Incident management is the process of managing an event that disrupts the normal function of a system, network, or process. They can be caused by hardware or software problems and can be a result of a single event or a series of events. While the process can vary depending on the size of the organization, most organizations handle incidents by creating a series of processes that share one goal: to identify the root cause of the incident, and take corrective action.

An organization’s Incident Management process is meant to tie these stages in together seamlessly and cover the entire lifecycle of the incident - from incident trigger to post-incident reviews and postmortems. It is also important to note that these practices are meant to be dynamic and constantly evolving with the people, systems, and architectures.

This post outlines some best practices to keep in mind while implementing or improving your processes.

Incident Detection & Classification: 

  • The initial details you receive about an incident while on-call saves a lot of time in the triage and mitigation process. Configuring the right data fields and Event Tags in order to automate this level of classification is a must.
  • Set up Deduplication rules to group all similar alerts together. This also ensures that your on-call team is not notified for the same incident repeatedly.
  • Send only vital information that can help assist in the remediation, in the details field.
  • Make sure you add any other important data manually as a part of the classification process after the team has been alerted.
incident management api and tags

Incident Alerting: 

  • Make sure to send alerts only for relevant and actionable events even if other events are also being sent into your incident management tool.
  • Make sure you configure Deduplication and Suppression Rules to ensure you do not get notified for un-important alerts. This could otherwise cause severe alert fatigue and also affects your team’s response times and productivity.
Incident Alerting

Incident Prioritization: 

  • A crucial form of incident classification is prioritization. This helps the on-call team understand the severity of the issue at first glance. Configure automation to assign priority to every incident routed to your alerting tool.
  • The prioritization matrix followed by an organization should always be linked to service and customer impact. This gives the on-call team the clarity needed to understand the situation.
Incident Prioritization

Triage and Collaboration: 

  • Configure your incident routing and escalation policies to always reach the right responder. Assign tags to indicate severity or priority and configure routing rules to ensure that the first responder is always the right responder.
  • In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line. Having a platform-specific collaboration space can reduce the time taken to assemble elsewhere to discuss the incident.
  • If you use Slack for this, make sure that there is an assigned channel to have any kind of incident related discussion in order to reduce MTTR.
Triage and Collaboration

Incident Communication: 

  • It’s important to keep both customers and customer-facing internal teams in the know of all mitigation activities. This is easier when you automate all communication updates and manage it from one place.
  • Add in the relevant teams as stakeholders so they can see what’s done to mitigate an incident. Also you can provide additional details on a private status page for internal folks.
  • Maintain a Public Status Page and constantly update it. The first thing a user would do when facing service issues, is to look at your status page. So, always ensure your status page has all the essential information a user would need to understand the impact of the issue.
Incident Communication

Incident Resolution: 

  • Automate as much as possible. Connect your tools to take action directly from within the incident management platform itself. Little steps go a long way.
  • Document any attempts at resolution or mitigation, as soon as you have taken the steps. What you perceive to be a small problem might not be the case for someone else on your team.
  • Maintain a repository of Runbooks and RCAs / Incident Reviews for you and your team to go back and review resolution steps for similar incidents in the future.
Incident Resolution

 Incident Review & Remediation: 

  • Start with an auto-generated incident timeline which has a chronological list of everything that was recorded during a live incident.
  • Drive a collaborative Incident Review process complete with Root-Cause-Analysis (RCA) to get to a fine-grained understanding of any incident as quickly as possible.
  • Always run an Incident Review process for Medium and High severity incidents. Remember to be Blameless. At the time of a crisis, it’s important to focus on the `What`, `Why`, `How` and `What Next` rather than the `Who`.
  • Maintain a checklist of tasks that have to be completed for longer-term remediation.
 Incident Review & Remediation

Ensuring that you learn from every incident should be the biggest takeaway from your Incident Response process.

Written By:
May 7, 2020
Prakya Vasudevan
Prakya Vasudevan
May 7, 2020
Incident Management
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Learn how organizations are using Squadcast
to maintain and improve upon their Reliability metrics
Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds...
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
Alexandre Lessard
System Analyst
Martin do Santos
Platform and Architecture Tech Lead
Sandro Franchi
CTO
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
What our
customers
have to say
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
Alexandre Lessard
System Analyst
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
Martin do Santos
Platform and Architecture Tech Lead
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
Sandro Franchi
CTO
Revamp your Incident Response.
Peak Reliability
Easier, Faster, More Automated with SRE.