📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
On-Call
How to avoid on-call burnout

How to avoid on-call burnout

December 20, 2019
How to avoid on-call burnout
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Why is on-call so stressful?

It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays.

You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P).

Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization. So, it's incredibly important to do everything in your power to make it easier for your on-call team. Because that necessarily means a host of goodness in your overall engineering team.

Don't leave your on-call team feeling like this sorry little Charmander.

What can you do to make on-call easier?

The first course of action is to define a framework with a good set of rules to be followed. Especially around the holiday season, you can make a pre-holiday checklist.

Create sensible Schedules and Rotations with more people to share the load:

In most cases, the stress of on-call falls on just a few engineers. On-call burnout is a serious issue in the SRE and DevOps world and more so around the holidays given the small list of people willing to be on-call at this time (or in the case of startups, just one or two).

To start with, expand your on-call team so that the stress doesn’t fall on just a few. It’s important for everyone to have their vacation time off and distributing the load to a larger team will go a long way.

Have a foolproof system in place to override those Schedules / Rotations when needed:

You can automatically override schedules in case the alert/incident is clearly meant for a specific person or team. Or if it is obvious what action can be taken on the alerts in case of immediate resolution. This can be done using custom automated Incident Tags to help route notifications directly to the relevant folks or to help trigger pre-defined actions or scripts.

Overriding Schedules with Automated Incident Tags to Route to the right responders in Squadcast.

Use “Vacation Mode” to hand-off on-call shifts for both planned & unplanned time off:

Schedules and rotations bring in some order to on-call but it still does not take care of people taking time off. Having the ability to let someone take over your shift in case of emergencies or planned vacation is a boon. It’s important that the on-call schedules accurately reflect this.

Some best practices here would be -

  1. To let your team know well in advance before a planned vacation so that the necessary changes can be made to the on-call schedules and rotations.
  2. If you are the primary on-call for any services or systems, ensure that you set yourself on Vacation Mode and find / request that someone else take your on-call shift before your vacation begins.
  3. Make sure you would do the same favour for someone else when they need it, if you are available. Track your on-call hours as well as those of others on the team so that you are not overburdened.
  4. If you have an emergency and need someone else to take your on-call shift on short-notice, ensure that you ask them if they have the bandwidth to do that for you. Ideally, pick someone who hasn’t been on-call for a while.

Using Vacation Mode for On-call Schedules in Squadcast.

Following a “No Deploys” practice for your Engineering teams during the weekends and holidays:

This is forked from the essential No Deploy Fridays practice that is common knowledge in the on-call community. In today's world, it should be possible for your infrastructure to recognise a failed deploy and roll back automatically. While this may not be the case for all systems and teams, the least that can be done is to ensure that you have these practices in place that help teams quickly recognize whenever an error occurs.

It is general practice to be available for at least a full working day post new deploys. Simply to be aware of how the push is functioning and to be able to quickly respond if it isn’t.

“Always code as if the guy who ends up maintaining, or testing your code will be a violent psychopath who knows where you live.” ~ Dave Carhart

Making Incidents Context Rich:

Half the stress of on-call stems from having little to no information for why something went down. Plenty of hours are spent on looking for more context for an incident than actually resolving it, leading to higher Mean-Time-To-Resolve (MTTR).

How can add context to incidents?

  • Make sure your incidents are attached with all relevant Tags either automatically or manually - Example: Backend issue/ Frontend issue; Severity: High / Low.
  • Make sure severities are clearly defined and updated for every incident. With this level of clarity, your on-call team will be able to understand if something needs to be done immediately or if they have some time to find a fix post their holidays.
  • On-call teams struggle with switching between the various tools to find the information they need. One way to fix this would be to configure your alert source integrations within your incident management tool carefully, so that useful contextual info is automatically added to every incident. For example, your knowledge base or runbooks or any useful information from your monitoring, logging, tracing or visualization tools can add significant context to an incident to make faster decisions on how to react. This could be time series data, or graphs, or post-mortems of similar incidents addressed in the past.

Tagging Incidents to make them more context-rich in Squadcast.

Proactive Incident Management using SLOs and Errors budgets :

A proactive incident management approach entails understanding incidents that are likely to occur and having a plan in place. On the other hand, a reactive incident management approach means scrambling to find the right things to do when an incident occurs, because you are taken by surprise. One useful method of having a proactive incident management approach as opposed to being reactive is by understanding trends from your Service Level Objectives (SLOs) and error budget graphs. By correlating the consumption of your error budget with the incidents that have occurred, you should be able to predict potential customer impacting  downtimes.

Based on the types of incidents that have occurred, you can then formulate automatable scripts to resolve and mitigate.

Having a Resolution & Remediation Plan in place:

There are many reasons why services fail. Some are known, some are unknown. It is easier to fight fires knowing there’s always a solution.

The first step of incident resolution is to ensure that you minimize customer impact as soon as possible. The next step is to figure out a longer term remediation for the incident and this comes from a practice of maintaining playbooks or creating a knowledge base for different types of incidents that can guide on-call folks.

Squadcast Actions: It’s always good to have a predefined remediation plan in place. Make sure you integrate all the tools that you would use to take action like your CI/CD platform or infrastructure automation tools so that you can take said actions immediately and directly from your incident management platform when an incident occurs. For example, you can rollback a feature to its previous version or rebuild a project in response to an alert which is firing. If you have these things established then in most cases, you should be able to ensure that an incident is taken care of before your customer is impacted by it.

Runbooks: In cases where you already know the resolution steps for an incident, having an executable script can save you a lot of time. With runbooks, resolution is just a click away compared to otherwise doing it in a manual and repetitive fashion.

Using Squadcast Actions to reduce your MTTR.

Using Squadcast Runbooks for faster recovery.

There are plenty of ways to reduce on call burnout and make your on-call experience better but understanding why these things are important and communicating this to the broader engineering team is crucial. It's important to know that the sanity of your on-call team reflects the health of your systems and the culture of your organization as a whole.

So it becomes a prime responsibility of the entire team to make on-call folks have a good experience. Let’s take this to heart today and improve the way we do incident management!

Written By:
December 20, 2019
Prakya Vasudevan
Prakya Vasudevan
December 20, 2019
On-Call
Product Updates
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

How to avoid on-call burnout

Dec 20, 2019
Last Updated:
November 20, 2024
Share this post:
How to avoid on-call burnout

Incident management is stressful and can lead to on call burnout. Even more so, during the holidays. This is a checklist of things to watch out for to make sure your on-call team remains calm if an incident were to occur.

Table of Contents:

    Why is on-call so stressful?

    It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays.

    You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P).

    Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization. So, it's incredibly important to do everything in your power to make it easier for your on-call team. Because that necessarily means a host of goodness in your overall engineering team.

    Don't leave your on-call team feeling like this sorry little Charmander.

    What can you do to make on-call easier?

    The first course of action is to define a framework with a good set of rules to be followed. Especially around the holiday season, you can make a pre-holiday checklist.

    Create sensible Schedules and Rotations with more people to share the load:

    In most cases, the stress of on-call falls on just a few engineers. On-call burnout is a serious issue in the SRE and DevOps world and more so around the holidays given the small list of people willing to be on-call at this time (or in the case of startups, just one or two).

    To start with, expand your on-call team so that the stress doesn’t fall on just a few. It’s important for everyone to have their vacation time off and distributing the load to a larger team will go a long way.

    Have a foolproof system in place to override those Schedules / Rotations when needed:

    You can automatically override schedules in case the alert/incident is clearly meant for a specific person or team. Or if it is obvious what action can be taken on the alerts in case of immediate resolution. This can be done using custom automated Incident Tags to help route notifications directly to the relevant folks or to help trigger pre-defined actions or scripts.

    Overriding Schedules with Automated Incident Tags to Route to the right responders in Squadcast.

    Use “Vacation Mode” to hand-off on-call shifts for both planned & unplanned time off:

    Schedules and rotations bring in some order to on-call but it still does not take care of people taking time off. Having the ability to let someone take over your shift in case of emergencies or planned vacation is a boon. It’s important that the on-call schedules accurately reflect this.

    Some best practices here would be -

    1. To let your team know well in advance before a planned vacation so that the necessary changes can be made to the on-call schedules and rotations.
    2. If you are the primary on-call for any services or systems, ensure that you set yourself on Vacation Mode and find / request that someone else take your on-call shift before your vacation begins.
    3. Make sure you would do the same favour for someone else when they need it, if you are available. Track your on-call hours as well as those of others on the team so that you are not overburdened.
    4. If you have an emergency and need someone else to take your on-call shift on short-notice, ensure that you ask them if they have the bandwidth to do that for you. Ideally, pick someone who hasn’t been on-call for a while.

    Using Vacation Mode for On-call Schedules in Squadcast.

    Following a “No Deploys” practice for your Engineering teams during the weekends and holidays:

    This is forked from the essential No Deploy Fridays practice that is common knowledge in the on-call community. In today's world, it should be possible for your infrastructure to recognise a failed deploy and roll back automatically. While this may not be the case for all systems and teams, the least that can be done is to ensure that you have these practices in place that help teams quickly recognize whenever an error occurs.

    It is general practice to be available for at least a full working day post new deploys. Simply to be aware of how the push is functioning and to be able to quickly respond if it isn’t.

    “Always code as if the guy who ends up maintaining, or testing your code will be a violent psychopath who knows where you live.” ~ Dave Carhart

    Making Incidents Context Rich:

    Half the stress of on-call stems from having little to no information for why something went down. Plenty of hours are spent on looking for more context for an incident than actually resolving it, leading to higher Mean-Time-To-Resolve (MTTR).

    How can add context to incidents?

    • Make sure your incidents are attached with all relevant Tags either automatically or manually - Example: Backend issue/ Frontend issue; Severity: High / Low.
    • Make sure severities are clearly defined and updated for every incident. With this level of clarity, your on-call team will be able to understand if something needs to be done immediately or if they have some time to find a fix post their holidays.
    • On-call teams struggle with switching between the various tools to find the information they need. One way to fix this would be to configure your alert source integrations within your incident management tool carefully, so that useful contextual info is automatically added to every incident. For example, your knowledge base or runbooks or any useful information from your monitoring, logging, tracing or visualization tools can add significant context to an incident to make faster decisions on how to react. This could be time series data, or graphs, or post-mortems of similar incidents addressed in the past.

    Tagging Incidents to make them more context-rich in Squadcast.

    Proactive Incident Management using SLOs and Errors budgets :

    A proactive incident management approach entails understanding incidents that are likely to occur and having a plan in place. On the other hand, a reactive incident management approach means scrambling to find the right things to do when an incident occurs, because you are taken by surprise. One useful method of having a proactive incident management approach as opposed to being reactive is by understanding trends from your Service Level Objectives (SLOs) and error budget graphs. By correlating the consumption of your error budget with the incidents that have occurred, you should be able to predict potential customer impacting  downtimes.

    Based on the types of incidents that have occurred, you can then formulate automatable scripts to resolve and mitigate.

    Having a Resolution & Remediation Plan in place:

    There are many reasons why services fail. Some are known, some are unknown. It is easier to fight fires knowing there’s always a solution.

    The first step of incident resolution is to ensure that you minimize customer impact as soon as possible. The next step is to figure out a longer term remediation for the incident and this comes from a practice of maintaining playbooks or creating a knowledge base for different types of incidents that can guide on-call folks.

    Squadcast Actions: It’s always good to have a predefined remediation plan in place. Make sure you integrate all the tools that you would use to take action like your CI/CD platform or infrastructure automation tools so that you can take said actions immediately and directly from your incident management platform when an incident occurs. For example, you can rollback a feature to its previous version or rebuild a project in response to an alert which is firing. If you have these things established then in most cases, you should be able to ensure that an incident is taken care of before your customer is impacted by it.

    Runbooks: In cases where you already know the resolution steps for an incident, having an executable script can save you a lot of time. With runbooks, resolution is just a click away compared to otherwise doing it in a manual and repetitive fashion.

    Using Squadcast Actions to reduce your MTTR.

    Using Squadcast Runbooks for faster recovery.

    There are plenty of ways to reduce on call burnout and make your on-call experience better but understanding why these things are important and communicating this to the broader engineering team is crucial. It's important to know that the sanity of your on-call team reflects the health of your systems and the culture of your organization as a whole.

    So it becomes a prime responsibility of the entire team to make on-call folks have a good experience. Let’s take this to heart today and improve the way we do incident management!

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    December 20, 2019
    December 20, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Prakya Vasudevan
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Best Practices in Incident Management
    Best Practices in Incident Management
    May 7, 2020
    Configure an Intuitive Service Dashboard & Reduce Response Time
    Configure an Intuitive Service Dashboard & Reduce Response Time
    April 30, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.