In the fast-paced and ever-changing Information technology sector, 24-hour on-call service is necessary for any business that counts on uninterrupted application and service delivery. Technical issues can arise anytime, and having the right people on call is needed to ensure that problems are resolved promptly and effectively.
At a global scale where operations span the full spectrum of time zones, with multiple systems hosted at many sites, having the right tool to manage and organize your on-call schedule and teams is essential.
Properly coordinating the resources required to effectively execute on-call service at this scale is a real challenge. On-call scheduling software can help organizations sort through the complexities, improve operations, and deliver better user experience.
This article will discuss the benefits of on-call scheduling software and explore schedule design and best practices for managing on-call rotations at scale.
Summary of key on-call scheduling software concepts
The table below summarizes five important concepts related to on-call scheduling software and 24-hour on-call service.
On-call scheduling in detail
The sections below explore each key aspect of on-call scheduling in detail.
Schedule design
When designing on-call schedules, you need to consider the time zones and skill sets of the team members on call for the given time.
Moreover, fairly balancing the workload among the teams and distributing the on-call duties is essential to avoid burnout and provide an environment promoting work/life balance for the teams.
There are two main approaches for schedule design: “follow-the-sun” and “rotation”.
The best approach for your team depends on whether or not your operations span multiple time zones or reside in one central location and your team size.
{{banner-1="/design/banners"}}
Follow-the-sun schedule
With "follow-the-sun," on-call shifts are structured around a 24-hour day, divided into four six-hour shifts. A team residing in a separate location, belonging to a different timezone, is assigned the next shift.
This way, the teams can provide 24/7 coverage with a rotating schedule, ensuring someone is always available to respond to any issues. This approach allows for a more distributed workload and helps to prevent burnout among team members.
A handover procedure occurs at the end of each shift, and each shift's events are documented in a central repository for follow-up.
For instance, if a company's headquarters are in Chicago, the US site reliability engineering (SRE) team can be on call from 10 am to 4 pm CST. Then, the Sydney team takes over from 8 am to 2 pm Sydney time (4 pm to 10 pm CST), followed by the Singapore team from 11 am to 5 pm Singapore time (10 pm to 4 am CST). Finally, the London team takes the last shift, from 9 am to 3 pm London time (4 am to 10 am CST), before handing it over to the US team at 10 am CST.
See the following table for a better understanding of the concept:
Remember the trade-offs of follow-the-sun, i.e., handing over between teams can increase incident response time. Additionally, reaching the previous team after the handover can be challenging if extra information or clarifications are needed.
Rotation schedule
A more traditional approach for non-distributed teams is to work with a sufficiently large team divided into groups. Each group should rotate night shifts among all of its members. An on-call schedule that cycles monthly or quarterly should also be created, depending on the team's preference.
It is imperative to note that extended periods are preferred over shorter ones. Specifically, quarterly rotations are more favorable than monthly ones. This approach reduces fatigue and allows for a more stable sleep cycle, leading to an enhanced work/life balance and substantially improving the quality of life for team members.
Prioritizing the well-being of team members is crucial to ensure optimal performance and productivity.
The following table demonstrates what a non-distributed rotation schedule might look like:
Staff availability and rotation of on-call schedules
Common mistakes and pitfalls when creating on-call schedules and managing standby shifts include:
- Neglecting time zones.
- Disregarding team members' preferences, which impair work-life balance.
- Inadequate communication and documentation of the procedures that the team must follow.
When planning a team project, it's essential to consider the capacity and availability of team members, as well as special occasions that may affect their availability. It's also vital for communication channels to be in place for unexpected events.
Efficient communication will allow for timely adjustments to the project schedule. Keeping these factors in mind will help ensure a successful outcome.
Account for flexibility and individual preferences. Managing remote teams requires anticipating holidays, cultural differences, and absences that can reduce the capacity for specific shifts.
Incident detection
You must always ensure that the right alerts are in place and that your monitoring systems are tested and fully functioning.
Furthermore, you must ensure that the right people are alerted for rapid response without delays. The on-call scheduling software you use must be able to interact with your monitoring systems and utilize data from other sources, such as CI/CD pipeline results and anything else that is supported by your tech stack.
An SRE should implement incident classification, meaning that all incidents are not equal. A scale of severity levels and the ability to assign various tags to an incident is integral for designating incidents to specific responders. With Squadcast, you can use tags to route incident alerts and appoint the right issues to the right teams.
Escalation
Depending on the type and severity of an incident, different on-call teams can participate in tandem, or more experienced subject matter experts (SMEs) can take over when certain conditions are met.
Consider a scenario where a bank’s website experiences problems and a portion of customers cannot see the balances of their accounts on their dashboard after logging in.
Monitoring software informs the team of HTTP 500 errors and sends alerts to the on-call team of system administrators to investigate.
The administrators quickly discover that the connection between the backend and the database cannot be established due to authentication errors.
At that point, the incident is escalated to a SME — a senior database engineer — who is alerted of the escalation through Squadcast and joins the incident’s task force.
After investigating the logs produced, the engineer discovers that the issue lies with an expired certificate between one of the load balancers and the database servers and proceeds to renew and replace the certificate.
All customers can now see their account balances, and the incident has been resolved.
An escalation happens when specific conditions are met. In the case above, on-call system administrators could not determine the root cause of the error and needed the SME to help resolve the incident.
Policies and procedures must be in place to ensure smooth incident escalations. It is vital to have a variety of communication methods, ensuring team members are notified reliably and on time.
Choosing the right solution for you
Now that we can understand the usefulness and importance of using the right software for on-call scheduling and with many open-source tools out there seeming like a good choice, let's investigate the limitations of open-source on-call software that can turn into pain points if not considered.
With open-source software, there needs to be more guaranteed support. Especially in the case of specialized software such as on-call scheduling software, the community of developers behind it needs members and activity. We often see projects being abandoned or limited in functionality and ridden with bugs for years before another version is released.
Squadcast bypasses these issues by integrating with an extended variety of applications; you can rest assured that you will be in total control of your on-call scheduling, no matter your tech stack. With enterprise-grade features, an active and growing community, and professional support, the platform is continuously improving and staying up-to-date.
9 best practices for on-call scheduling software and schedule design
Consider the following practices to ensure the software's successful implementation and operational adoption.
Communicate clearly to build shared understanding
Team members should clearly understand the on-call software and its functions. A centralized platform for communication and documentation can streamline the process and improve team productivity.
Create fair rotations
Ensure on-call duties are distributed evenly among team members. Implementing a fair and predictable rotation system to avoid burnout by balancing the workload is essential.
Plan sufficient coverage
Team members should understand the on-call software and its functions well. Using a centralized platform for communication and documentation can streamline the process and improve productivity in resolving incidents.
Be flexible
Prioritize being flexible regarding individual team members' preferences and strive to maintain a work-life balance while ensuring adequate coverage. A positive work environment increases team satisfaction and reinforces the operability of the on-call schedule.
Have escalation plans
Create a clear incident escalation path and ensure on-call engineers know the procedures. Test regularly to identify gaps and facilitate confidence in the escalation process.
Set response time targets
Set acceptable, pre-agreed response time targets for different incidents and hold team members accountable for meeting them. These targets must be realistic and aligned with the customer's and management's expectations.
Maintain documentation
Maintain up-to-date documentation for systems, services, and incident resolution runbooks to reduce mean time to resolution (MTTR).
Invest in training and skill development
Train yourself and your team on the methods and tools used for incident resolution and keep up to date with the latest developments.
Conduct post-mortems
Conduct post-incident reviews, or post-mortems as they are called, to learn from each incident and improve your response. Share the knowledge gained with the team and implement safeguards to prevent recurring incidents.
Implement tools and monitoring
Utilize practical monitoring tools and alerting systems with proper configurations. Use specialized on-call management software or services.
{{banner-2="/design/banners"}}
Conclusion
SRE professionals know how critical it is to have 24/7 on-call staff ready to address technical issues promptly.
This is where on-call schedule software comes in handy. Using the right tools, you can efficiently manage and organize your on-call staff, ensuring the right people are available to handle any issues.
With the ability to design schedules based on customers' locations and time zones, rotate staff availability, and set up alerts and escalation policies, you can reduce staff confusion and uncertainty, leading to faster and more effective resolution of incidents.
Using the right on-call schedule software can help you anticipate upcoming staff absences and emergencies and prepare for special occasions that may increase the volume of system transactions and escalate the possibility of incidents.
By following best practices and using on-call schedule software, you can ensure uninterrupted customer service delivery integration with existing processes and applications in your tech stack and avoid common mistakes and pitfalls while managing standby shifts and handling incidents.