Got a DevOps horror story? Tell us about your worst on-call nightmares this Halloween and get featured! Click Here
Blog
Incident Management
Better incident management while working remotely: The Squadcast way

Better incident management while working remotely: The Squadcast way

January 7, 2021
Better incident management while working remotely: The Squadcast way
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

With the onset of remote work due to Covid-19, remote incident management has become the norm for businesses worldwide. Organisations that were earlier used to having war rooms now find themselves having to coordinate teams through Slack, MS Teams or other collaboration tools. This unexpected and unplanned transition has created a unique set of problems.

Now that we have had a few months of experience in dealing with incident management remotely, here are some best practices we found to be effective. While these best practices are already recommended for effective incident management; in times of remote working, we believe this list is a great starting point to stay on top and prevent major outages.

In this blog, we list some of the ideas that you can implement immediately including better communication among stakeholders, having detailed plans to deal with outages, documenting and learning from past failures. Here are the ways you can make this transition work in your favor and ensure that on-call remains as stress-free as possible.

Have a strong communication plan:

This includes using Slack, MS Teams or any other collaboration tool to communicate the incidents. Having a contingency plan in place if your usual communication software goes down is essential. No one wants to spend hours making calls on phones to fix issues. A remote incident management team is like a pit-stop crew but situated miles apart and sometimes in different timezones.

The recent outage of Slack in the first week of 2021, underlines how important it is to keep communication channels open. Private status pages are invaluable to the engineers already working on fixing the issue (especially in larger teams). It also helps your PR and communications team by providing an accurate picture of the size of the outage and the progress being done. The public status page lets your customers know if parts of your product are still operational and indicate the progress being made on returning to full-functionality.

Have an information repository of your system in hand:

Earlier if you needed any piece of information about your system it was as simple as moving a few desks over and asking the concerned person. Now, if that person is unavailable on Slack, the information you need to quickly fix the outage is hard to get. Having a centralized information system with all the essential information is invaluable. Too many organisations before the pandemic hit, had their important information down in post-it notes stuck all over the place. Needless to say, this won't work when your team is working remotely. You need to have a searchable repository of vital information to save precious time and effort.

Have dry-runs/simulations of catastrophic failures:

Having a dry-run or simulation to see how effectively your team can handle a severe failure while remote is a good idea. It can potentially provide effective insights into areas of improvement in your incident response strategy.

Automate more:

There are things that are quick fixes or easy to tackle when you are physically present in the office. These may be scripts that are run manually or meetings that can be avoided. Reducing toilsome activities is a long term goal that assumes greater importance when working remotely. Burnout from working remotely is a serious issue and tackling toil with automation should be high priority. Automation should ideally include running scripts, monitoring clusters, scheduling maintenance, and the auto-configuration of cloud-based virtual machines when the need arises.

Having detailed runbooks will be of great help when a major incident occurs. Automated runbooks can be a game changer when it comes to diagnosing and fixing systems that have gone offline. Whether you are using Ansible, Rundeck or any other tool, even the simplest runbook is better than fixing things manually and starting from scratch every time. You can read more about runbooks in this blog.

Fight Alert Fatigue (even more proactively):

Remote alert fatigue is perhaps significantly more damaging than normal alert fatigue. Configuring monitoring tools and tweaking alerting thresholds plays a very important role in reducing alert noise. Additionally, our team tackles alert fatigue by taking proactive steps to reduce alert noise by creating deduplication rules, event routing and tagging rules. Having mandatory off days for on-call engineers to avoid burnout also helps considerably.

Coordinate with dev teams before deployment:

Monitor your infrastructure during major deployments. Have rollbacks in place if things go wrong. As the most catastrophic failures can happen during deployments, you need a way to monitor system health during that time and initiate rollbacks if required.

Have a clear incident chain of command and roles:

Have you planned for contingencies when your usual leadership is on leave or unreachable? An incident chain of command mitigates any last moment confusion in a time sensitive and stressful situation.

Invest in an incident management platform:

If you haven't done it already, a dedicated incident management platform will go a long way in making on-call less stressful with the help of features like escalation policies and alert deduplication rules. Furthermore, many such platforms have dashboards that let you track the performance of your on-call team as well as the quality of service. There are still on-call teams that use spreadsheets to track schedules. While this was manageable (though not recommended) in pre-covid times, the situation now requires more clarity and efficiency. Easy to use On-call schedules in incident management platforms can be a great help for your team in planning their workload. Since engineers know beforehand whether they will be on-call they can plan their other activities accordingly. A healthy rotation in on-call schedules also helps prevent burnout.

After a major outage occurs, automated incident timelines are invaluable for remote teams to figure out measures that were taken to fix things. At Squadcast, we rely on the automated incident timeline to have a real-time view of the progress towards incident resolution. Automated timelines are also of great help when creating incident postmortems subsequently. It becomes much easier to figure out the strengths and weaknesses of your on-call response if you are armed with a detailed timeline of events.

As stated earlier, an incident response team during a major outage is like the pit-crew of a Formula1 team - trying to get as much done in the shortest amount of time possible. Like a pit crew, incident management teams will do their best work when each member knows the things he/she needs to be looking after.

We hope this list is as useful to you as it has been to us. Though this is not an exhaustive list of best practices for managing incidents while working remotely, we would love to hear from you. What other practices or ways of working helped you tackle incidents remotely? Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Written By:
January 7, 2021
Nir Sharma
Nir Sharma
January 7, 2021
Incident Management
On-Call
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Better incident management while working remotely: The Squadcast way

Jan 7, 2021
Last Updated:
October 4, 2024
Share this post:
Better incident management while working remotely: The Squadcast way

As the pandemic wears on, remote incident management has become the norm worldwide for businesses. Here we share some best practices that helped us to address remote incidents and make on-call less stressful.

Table of Contents:

    With the onset of remote work due to Covid-19, remote incident management has become the norm for businesses worldwide. Organisations that were earlier used to having war rooms now find themselves having to coordinate teams through Slack, MS Teams or other collaboration tools. This unexpected and unplanned transition has created a unique set of problems.

    Now that we have had a few months of experience in dealing with incident management remotely, here are some best practices we found to be effective. While these best practices are already recommended for effective incident management; in times of remote working, we believe this list is a great starting point to stay on top and prevent major outages.

    In this blog, we list some of the ideas that you can implement immediately including better communication among stakeholders, having detailed plans to deal with outages, documenting and learning from past failures. Here are the ways you can make this transition work in your favor and ensure that on-call remains as stress-free as possible.

    Have a strong communication plan:

    This includes using Slack, MS Teams or any other collaboration tool to communicate the incidents. Having a contingency plan in place if your usual communication software goes down is essential. No one wants to spend hours making calls on phones to fix issues. A remote incident management team is like a pit-stop crew but situated miles apart and sometimes in different timezones.

    The recent outage of Slack in the first week of 2021, underlines how important it is to keep communication channels open. Private status pages are invaluable to the engineers already working on fixing the issue (especially in larger teams). It also helps your PR and communications team by providing an accurate picture of the size of the outage and the progress being done. The public status page lets your customers know if parts of your product are still operational and indicate the progress being made on returning to full-functionality.

    Have an information repository of your system in hand:

    Earlier if you needed any piece of information about your system it was as simple as moving a few desks over and asking the concerned person. Now, if that person is unavailable on Slack, the information you need to quickly fix the outage is hard to get. Having a centralized information system with all the essential information is invaluable. Too many organisations before the pandemic hit, had their important information down in post-it notes stuck all over the place. Needless to say, this won't work when your team is working remotely. You need to have a searchable repository of vital information to save precious time and effort.

    Have dry-runs/simulations of catastrophic failures:

    Having a dry-run or simulation to see how effectively your team can handle a severe failure while remote is a good idea. It can potentially provide effective insights into areas of improvement in your incident response strategy.

    Automate more:

    There are things that are quick fixes or easy to tackle when you are physically present in the office. These may be scripts that are run manually or meetings that can be avoided. Reducing toilsome activities is a long term goal that assumes greater importance when working remotely. Burnout from working remotely is a serious issue and tackling toil with automation should be high priority. Automation should ideally include running scripts, monitoring clusters, scheduling maintenance, and the auto-configuration of cloud-based virtual machines when the need arises.

    Having detailed runbooks will be of great help when a major incident occurs. Automated runbooks can be a game changer when it comes to diagnosing and fixing systems that have gone offline. Whether you are using Ansible, Rundeck or any other tool, even the simplest runbook is better than fixing things manually and starting from scratch every time. You can read more about runbooks in this blog.

    Fight Alert Fatigue (even more proactively):

    Remote alert fatigue is perhaps significantly more damaging than normal alert fatigue. Configuring monitoring tools and tweaking alerting thresholds plays a very important role in reducing alert noise. Additionally, our team tackles alert fatigue by taking proactive steps to reduce alert noise by creating deduplication rules, event routing and tagging rules. Having mandatory off days for on-call engineers to avoid burnout also helps considerably.

    Coordinate with dev teams before deployment:

    Monitor your infrastructure during major deployments. Have rollbacks in place if things go wrong. As the most catastrophic failures can happen during deployments, you need a way to monitor system health during that time and initiate rollbacks if required.

    Have a clear incident chain of command and roles:

    Have you planned for contingencies when your usual leadership is on leave or unreachable? An incident chain of command mitigates any last moment confusion in a time sensitive and stressful situation.

    Invest in an incident management platform:

    If you haven't done it already, a dedicated incident management platform will go a long way in making on-call less stressful with the help of features like escalation policies and alert deduplication rules. Furthermore, many such platforms have dashboards that let you track the performance of your on-call team as well as the quality of service. There are still on-call teams that use spreadsheets to track schedules. While this was manageable (though not recommended) in pre-covid times, the situation now requires more clarity and efficiency. Easy to use On-call schedules in incident management platforms can be a great help for your team in planning their workload. Since engineers know beforehand whether they will be on-call they can plan their other activities accordingly. A healthy rotation in on-call schedules also helps prevent burnout.

    After a major outage occurs, automated incident timelines are invaluable for remote teams to figure out measures that were taken to fix things. At Squadcast, we rely on the automated incident timeline to have a real-time view of the progress towards incident resolution. Automated timelines are also of great help when creating incident postmortems subsequently. It becomes much easier to figure out the strengths and weaknesses of your on-call response if you are armed with a detailed timeline of events.

    As stated earlier, an incident response team during a major outage is like the pit-crew of a Formula1 team - trying to get as much done in the shortest amount of time possible. Like a pit crew, incident management teams will do their best work when each member knows the things he/she needs to be looking after.

    We hope this list is as useful to you as it has been to us. Though this is not an exhaustive list of best practices for managing incidents while working remotely, we would love to hear from you. What other practices or ways of working helped you tackle incidents remotely? Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    January 7, 2021
    January 7, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Nir Sharma
    What are Canary Deployments and Why are they Important?
    What are Canary Deployments and Why are they Important?
    August 25, 2022
    Classifying Severity Levels for Your Organization
    Classifying Severity Levels for Your Organization
    July 5, 2022
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises
    April 5, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.