While the number of active internet users and people consuming digital products has been on the rise for a while, it is actually the combination of increased user expectations and competitive digital experiences that have led organizations to deliver super Reliable products and services.
The bottom line is, customers have the right to seek reliable software, and the right to expect the product to work when they really want it. And it is the responsibility of the organizations to build Reliable products.
But having said that, no software can be 100% reliable. Even achieving 99.9% reliability is a monumental task. As engineering infrastructure grows more complex by the day, the possibility of Incidents becomes inevitable. But triaging and remediating the issues quickly with minimal impact is what will make all of the difference.
Let’s look back at some notable outages from the past that have had a major impact on both businesses and end users alike.
October 2021 - A mega outage took down Facebook, WhatsApp, Messenger, Instagram and Oculus VR…for almost five hours! And no one could use any of those products during those 5 hrs.
November 2021 - A downstream effect of a Google Cloud outage led to outages across multiple GCP products. This also indirectly impacted many non-Google companies.
December 2022 - An incident corresponding to Amazon’s Search issue impacted at least 20% of all global users for almost an entire day.
Jan 2023 - Most recently the Federal Aviation Authority (FAA) suffered an outage due to a failed scheduled maintenance causing 32,578 flights to be delayed and a further 409 to get cancelled together. And needless to say, the monetary impact was massive. Share prices of numerous U.S. air carriers fell steeply in the immediate aftermath.
These are just a few of the major outges that have impacted users on a global scale. In reality, incidents such as these are not uncommon and are far more frequent. While businesses and business owners bear the brunt of such outages, the impact is experienced by end users too, resulting in a poor User/ Customer Experience (UX/CX).
Here are some interesting stats as a result of poor CX/ UX:
And that is why resolving incidents quickly is CRITICAL! But (literally :p) the million dollar question is, how to effectively deal with incidents? Let’s address this by probing into the challenges of Incident Management in the first place
Evolving business and user needs have directly impacted Incident Management practices.
“...you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future.” - Steve McGhee, Reliability Advocate, SRE, Google Cloud
Over the years, the scope of activities associated with Incident Management has only been growing. And most of the evolution that’s taken place can be bucketed into one of the four categories: Technology, People, Process, and Tools.
Now is the ideal time to address issues that are holding engineering teams back from doing Incident Management the right way.
Service Ownership and visibility are the foremost contributing factors preventing engineering teams from maximizing their time at hand during incident triage. This is a result of the adoption of Distributed applications, in particular microservices.
An irrational number of services makes it hard to track service health and their respective owners. Tool sprawl (a great number of tools within the tech stack) makes it even more difficult to track dependencies and ownership.
Achieving a respectable amount of automation is still a distant dream for most incident response teams. Automating their entire infrastructure stack through incident management will make a great deal of a difference in improving MTTA and MTTR.
The tasks that are still manual, with great potential for automation during incident response are:
A poor effort put into collaboration during an incident is a major reason keeping response teams from doing what they do best. The process of informing members within the team, across the team, within the organization, and outside of the organization must be simplified and organized.
Activities that can improve with better collaboration are
One of the most important (and responsible) activities for the response team is to facilitate complete transparency about incident impact, triage, and resolution to internal and external stakeholders as well as business owners. The problems:
Now, the timely question to probe is: what should Engineering teams start doing? And how can organizations support them in their Reliability journey?
The facets of Incident Management today can be broadly classified into 3 categories:
Addressing the difficulties and devising appropriate processes and strategies around these categories can help engineering teams improve their Incident Management by 90%. Certainly sounds ambitious, so let's understand this in more detail.
On-Call is the foundation of a good Reliability practice. Three are two main aspects to On-Call alerting and they are highlighted below.
a. Centralizing Incident Alerting & Monitoring
The crucial aspect of On-Call Alerting is the ability to bring all the alerts into a single/ centralized command centre. This is important because a typical tech stack is made up of multiple alerting tools monitoring different services (or parts of the infrastructure), put in place by different users. An ecosystem that can bring such alerts together will make Incident Management much more organized.
b. On-Call Scheduling & intelligent routing
While organized alerting is a great first step, effective Incident Response is all about having an On-Call Schedule in place and routing alerts to the concerned On-Call responder. And in case of non-resolution or inaction, escalating it to the most appropriate engineer (or user).
While On-Call scheduling and alert routing are the fundamentals, it is Incident Response that gives structure to Incident Management.
a. Alert noise reduction and correlation
Oftentimes, teams get notified of unnecessary events. And more commonly, during the process of resolution, engineers tend to get notified for similar and related alerts, which are better off addressing the collective incident and not just the specific incident. Hence with the right practices in place, incident/alert fatigue can be handled with automation rules for suppressing alerts and deduplicating alerts.
b. Integration & Collaboration
Integrating the infrastructure stack with tools well within the response process can possibly be the simplest and easiest way to organize Incident Response. Collaboration can improve by establishing integrations with:
Engineering Reliability into a product requires the entire organization to adopt the SRE mindset and buy into the idealogy. While On-Call is at one end of the spectrum, we at Squadcast believe that SRE (Site Reliability Engineering) is at the other end of the spectrum.
But what exactly is SRE?
For starters, SRE should not be confused with what DevOps stands for. While DevOps focuses on Principles, SRE emphasizes the focus on Activities instead. SRE is fundamentally about taking an engineering approach to systems operations in order to achieve better reliability and performance. It puts a premium on monitoring, tracking bugs, and creating systems and automation that solve the problem in the long term.
While Google was the birthplace of SRE, many top technology companies such as LinkedIn, Netflix, Amazon, Apple, and Facebook have adopted it and benefitted highly from doing that.
POV: Gartner predicts that, by 2027, 75% of enterprises will use SRE practices organization-wide, up from 10% in 2022.
What difference will SRE make?
Today users are expecting nothing but the very best. And an exclusive focus on SRE practices will help in:
How SRE adds value to the business?
SRE adds a ton of value to any business that is digital-first. Below mentioned are some of the key points:
The bottom line is, Reliability has evolved. You have to be proactive and preventive.
Teams will have to fix things faster and keep getting better at it.
And on that note, let’s look at the different SRE aspects that engineering teams can adopt for better Incident Management:
a. Automated response actions
Automating manual tasks and eliminating toil is one of the fundamental truths on which SRE is built. Be it automating workflows with Runooks, or automating response actions, SRE is a big advocate of automation, and response teams will widely benefit from having this in place.
b. Transparency
SRE advocates for providing complete visibility into the health status of services and this can be achieved by the use of Status Pages. It also puts a premium on the need to have greater transparency and visibility of service ownership within the organization.
c. Blameless culture
During times of an incident, SRE stresses greatly on blaming the process and not the individuals responsible for it. This blameless culture of not blaming individuals for outages goes a long way in fostering a healthy team culture and promoting team harmony. This process of doing RCAs is called Incident Retrospectives or Postmortems.
d. SLO and Error Budget tracking
This is all about using a metric-driven approach to balance Reliability and Innovation. It encourages the use of SLIs to keep track of service health. By actively tracking SLIs, SLOs and Error Budgets can be in check, thus not breaching customer any of the customer SLAs.
To summarize what you’ve just read, Squadcast is the only integrated platform that unites On-Call Alerting and incident Management along with SRE workflows under one roof. Be it setting up On-Call Schedules, leveraging Event Intelligence for Alert Suppression, or Automating Incident Response, we have it all covered.
If these Incident Management workflows align with your needs, feel free to go ahead and Sign up for a 2-week free trial. If you want to know more about Squadcast, then you can schedule a call with our Sales team for a quick demo.