📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Chapter
10

Best Practices for On-Call Scheduling Software

March 27, 2024
10
 min read

In the fast-paced and ever-changing Information technology sector, 24-hour on-call service is necessary for any business that counts on uninterrupted application and service delivery. Technical issues can arise anytime, and having the right people on call is needed to ensure that problems are resolved promptly and effectively.

At a global scale where operations span the full spectrum of time zones, with multiple systems hosted at many sites, having the right tool to manage and organize your on-call schedule and teams is essential.

Properly coordinating the resources required to effectively execute on-call service at this scale is a real challenge. On-call scheduling software can help organizations sort through the complexities, improve operations, and deliver better user experience.

This article will discuss the benefits of on-call scheduling software and explore schedule design and best practices for managing on-call rotations at scale.

Summary of key on-call scheduling software concepts

The table below summarizes five important concepts related to on-call scheduling software and 24-hour on-call service.

Concept Description
Schedule design A schedule should consider customers' locations and time zones. With that information in place, you can start forming a team of experts with the skills and knowledge of the systems that are most prone to incidents. Historical data from previous incidents will help identify these areas.
Staff availability and schedule rotation Recurring on-call schedules can improve staff management by reducing confusion and uncertainty. They also help anticipate upcoming absences and emergencies, ensuring the right people are available to handle issues.
Incident detection Effective and accurate alerts are critical to detecting incidents. Without them, SREs are left to monitor systems on their own and could easily miss a serious problem. Always ensure you have alerts configured and that your monitoring and alerting system is working as expected.
Special occasions On-call teams receive more calls during peak periods such as holidays, End-of-Month and End-of-Year, Black Friday, seasonal sales, and other events. This is due to increased traffic and volume of transactions in systems, which can lead to incidents.
Escalations As a standby manager, you must know when, how, and to whom you must escalate when certain conditions are met for an incident. You should be able to escalate and delegate the incident handling easily and logically.

On-call scheduling in detail

The sections below explore each key aspect of on-call scheduling in detail.

Schedule design

When designing on-call schedules, you need to consider the time zones and skill sets of the team members on call for the given time.

Moreover, fairly balancing the workload among the teams and distributing the on-call duties is essential to avoid burnout and provide an environment promoting work/life balance for the teams.

There are two main approaches for schedule design: “follow-the-sun” and “rotation”.

The best approach for your team depends on whether or not your operations span multiple time zones or reside in one central location and your team size.

{{banner-1="/design/banners"}}

Follow-the-sun schedule

With "follow-the-sun," on-call shifts are structured around a 24-hour day, divided into four six-hour shifts. A team residing in a separate location, belonging to a different timezone, is assigned the next shift.

This way, the teams can provide 24/7 coverage with a rotating schedule, ensuring someone is always available to respond to any issues. This approach allows for a more distributed workload and helps to prevent burnout among team members.

A handover procedure occurs at the end of each shift, and each shift's events are documented in a central repository for follow-up.

For instance, if a company's headquarters are in Chicago, the US site reliability engineering (SRE) team can be on call from 10 am to 4 pm CST. Then, the Sydney team takes over from 8 am to 2 pm Sydney time (4 pm to 10 pm CST), followed by the Singapore team from 11 am to 5 pm Singapore time (10 pm to 4 am CST). Finally, the London team takes the last shift, from 9 am to 3 pm London time (4 am to 10 am CST), before handing it over to the US team at 10 am CST.

See the following table for a better understanding of the concept:

City Time Zone Start Time (Local) End Time (Local) Start Time (CST) End Time (CST)
Chicago CST 10:00 am 4:00 pm 10:00 am 4:00 pm
Sydney AEDT 8:00 am 2:00 pm 4:00 pm 10:00 pm
Singapore SGT 11:00 am 5:00 pm 10:00 pm 4:00 am
London GMT 9:00 am 3:00 pm 4:00 am 10:00 am

Remember the trade-offs of follow-the-sun, i.e., handing over between teams can increase incident response time. Additionally, reaching the previous team after the handover can be challenging if extra information or clarifications are needed.

Rotation schedule

A more traditional approach for non-distributed teams is to work with a sufficiently large team divided into groups. Each group should rotate night shifts among all of its members. An on-call schedule that cycles monthly or quarterly should also be created, depending on the team's preference.

It is imperative to note that extended periods are preferred over shorter ones. Specifically, quarterly rotations are more favorable than monthly ones. This approach reduces fatigue and allows for a more stable sleep cycle, leading to an enhanced work/life balance and substantially improving the quality of life for team members.

Prioritizing the well-being of team members is crucial to ensure optimal performance and productivity.

The following table demonstrates what a non-distributed rotation schedule might look like:

Group Q1 Q2 Q3 Q4
Group A 10am - 4pm 4pm - 10pm 10pm - 4am 4am - 10am
Group B 4pm - 10pm 10pm - 4am 4am - 10am 10am - 4pm
Group C 10pm - 4am 4am - 10am 10am - 4pm 4pm - 10pm
Group D 4am - 10am 10am - 4pm 4pm - 10pm 10pm - 4am

Staff availability and rotation of on-call schedules

Common mistakes and pitfalls when creating on-call schedules and managing standby shifts include:

  • Neglecting time zones.
  • Disregarding team members' preferences, which impair work-life balance.
  • Inadequate communication and documentation of the procedures that the team must follow.

When planning a team project, it's essential to consider the capacity and availability of team members, as well as special occasions that may affect their availability. It's also vital for communication channels to be in place for unexpected events.

Efficient communication will allow for timely adjustments to the project schedule. Keeping these factors in mind will help ensure a successful outcome.

Account for flexibility and individual preferences. Managing remote teams requires anticipating holidays, cultural differences, and absences that can reduce the capacity for specific shifts.

Incident detection

You must always ensure that the right alerts are in place and that your monitoring systems are tested and fully functioning.

Furthermore, you must ensure that the right people are alerted for rapid response without delays. The on-call scheduling software you use must be able to interact with your monitoring systems and utilize data from other sources, such as CI/CD pipeline results and anything else that is supported by your tech stack.

An SRE should implement incident classification, meaning that all incidents are not equal. A scale of severity levels and the ability to assign various tags to an incident is integral for designating incidents to specific responders. With Squadcast, you can use tags to route incident alerts and appoint the right issues to the right teams.

Squadcast On-Call Software
Squadcast’s tagging and escalation features

Escalation

Depending on the type and severity of an incident, different on-call teams can participate in tandem, or more experienced subject matter experts (SMEs) can take over when certain conditions are met.

Consider a scenario where a bank’s website experiences problems and a portion of customers cannot see the balances of their accounts on their dashboard after logging in.

Monitoring software informs the team of HTTP 500 errors and sends alerts to the on-call team of system administrators to investigate.

The administrators quickly discover that the connection between the backend and the database cannot be established due to authentication errors.

At that point, the incident is escalated to a SME — a senior database engineer — who is alerted of the escalation through Squadcast and joins the incident’s task force.

After investigating the logs produced, the engineer discovers that the issue lies with an expired certificate between one of the load balancers and the database servers and proceeds to renew and replace the certificate.

All customers can now see their account balances, and the incident has been resolved.

On-Call Schedule escalation format
Sequence diagram of event escalation

An escalation happens when specific conditions are met. In the case above, on-call system administrators could not determine the root cause of the error and needed the SME to help resolve the incident.

Policies and procedures must be in place to ensure smooth incident escalations. It is vital to have a variety of communication methods, ensuring team members are notified reliably and on time.

Choosing the right solution for you

Now that we can understand the usefulness and importance of using the right software for on-call scheduling and with many open-source tools out there seeming like a good choice, let's investigate the limitations of open-source on-call software that can turn into pain points if not considered.

With open-source software, there needs to be more guaranteed support. Especially in the case of specialized software such as on-call scheduling software, the community of developers behind it needs members and activity. We often see projects being abandoned or limited in functionality and ridden with bugs for years before another version is released.

Squadcast bypasses these issues by integrating with an extended variety of applications; you can rest assured that you will be in total control of your on-call scheduling, no matter your tech stack. With enterprise-grade features, an active and growing community, and professional support, the platform is continuously improving and staying up-to-date.

9 best practices for on-call scheduling software and schedule design

Consider the following practices to ensure the software's successful implementation and operational adoption.

Communicate clearly to build shared understanding

Team members should clearly understand the on-call software and its functions. A centralized platform for communication and documentation can streamline the process and improve team productivity.

Create fair rotations

Ensure on-call duties are distributed evenly among team members. Implementing a fair and predictable rotation system to avoid burnout by balancing the workload is essential.

Plan sufficient coverage

Team members should understand the on-call software and its functions well. Using a centralized platform for communication and documentation can streamline the process and improve productivity in resolving incidents.

Be flexible

Prioritize being flexible regarding individual team members' preferences and strive to maintain a work-life balance while ensuring adequate coverage. A positive work environment increases team satisfaction and reinforces the operability of the on-call schedule.

Have escalation plans

Create a clear incident escalation path and ensure on-call engineers know the procedures. Test regularly to identify gaps and facilitate confidence in the escalation process.

Set response time targets

Set acceptable, pre-agreed response time targets for different incidents and hold team members accountable for meeting them. These targets must be realistic and aligned with the customer's and management's expectations.

Maintain documentation

Maintain up-to-date documentation for systems, services, and incident resolution runbooks to reduce mean time to resolution (MTTR).

Invest in training and skill development

Train yourself and your team on the methods and tools used for incident resolution and keep up to date with the latest developments.

Conduct post-mortems

Conduct post-incident reviews, or post-mortems as they are called, to learn from each incident and improve your response. Share the knowledge gained with the team and implement safeguards to prevent recurring incidents.

Implement tools and monitoring

Utilize practical monitoring tools and alerting systems with proper configurations. Use specialized on-call management software or services.

{{banner-2="/design/banners"}}

Conclusion

SRE professionals know how critical it is to have 24/7 on-call staff ready to address technical issues promptly.

This is where on-call schedule software comes in handy. Using the right tools, you can efficiently manage and organize your on-call staff, ensuring the right people are available to handle any issues.

With the ability to design schedules based on customers' locations and time zones, rotate staff availability, and set up alerts and escalation policies, you can reduce staff confusion and uncertainty, leading to faster and more effective resolution of incidents.

Using the right on-call schedule software can help you anticipate upcoming staff absences and emergencies and prepare for special occasions that may increase the volume of system transactions and escalate the possibility of incidents.

By following best practices and using on-call schedule software, you can ensure uninterrupted customer service delivery integration with existing processes and applications in your tech stack and avoid common mistakes and pitfalls while managing standby shifts and handling incidents.

Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
ant-design-linkedIN
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now
ant-design-linkedIN
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024