📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
Going from Zero to SRE

Going from Zero to SRE

September 14, 2021
Going from Zero to SRE
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Traditionally, developing applications and running them in production was seen as completely separate worlds, usually being the focus and concern of different teams. This kind of separation gives birth to the proverbial wall that separates development and operations, where developers “throw” their code over the wall and expect operations teams to run and manage them in production. This results in teams having different and conflicting goals: development teams prioritize building and shipping new features while operations teams focus on system stability, where code changes are seen as potential threats.

With systems getting larger and larger, this type of situation does not scale, and a new way of doing things needed to emerge. And in 2009 Patrick Debois coined the term “DevOps”, a set of principles and practices with emphasis on people and culture and the goal of improving collaboration between development and operations teams.

A few years before, facing a similar situation, Google put Benjamin Treynor in charge of a team with the goal of ensuring that Google’s websites were available, reliable, and as serviceable as possible. Since Benjamin was a software engineer he approached an operations problem with a software engineering perspective, giving birth to what is today known as Site Reliability Engineering (SRE): “what happens when you ask a software engineer to design an operations team”. SRE rose to fame around 2016 after Google published the book Site Reliability Engineering: How Google Runs Production Systems, describing a lot of the practices they use to run their systems in production.

SRE is all about running reliable production systems. Big companies like Google, Facebook, or Amazon run large production systems and face many challenges most companies rarely do. Despite that, the way they run their systems can help us run ours. But even armed with all that information, how do we go about starting our SRE journey?

Firefighting

In the beginning, teams are reactive. There are a lot of moving parts, highly complex systems, failures appearing right, left, and center. The world is chaotic and it seems close to impossible to apply any sort of engineering practice to tackle the situation. At this stage, teams don’t have many options and need to keep struggling, responding to a myriad of crises, while trying to reserve some time, energy, and focus to try and improve the situation. Putting brakes on feature development can help stabilize a system, getting teams some relief. Adding people to teams that can focus on automation, can help reduce toil.

This is a difficult stage for teams. It requires a dual perspective since they will need to keep systems running while building a new approach to run operations. This is also a reflection on operations teams being brought late into the product cycle. They can be made aware of a new service or a new set of requirements just before a service needs to go live, or even worse, when something is already live and failing miserably.

Gatekeeping

Usually, one of the first measures that are put in place to deal with firefighting is gatekeeping. The goal is to make every change to production pass through and be approved by the SRE team. On a small scale, this approach can work for a while. But, as systems and teams grow, SRE teams start being a choke point, limiting change and slowing down other teams.

At this stage, SRE teams work to become more engaged with development teams by implementing processes. While this reflects positive engagement, it can derail into an us-versus-them struggle, leading teams to circumvent the processes and SRE teams all together.

Partnering

When there are frameworks in place (e.g. SLOs and Error Budgets) that can be used to measure reliability and make decisions on how to act, SRE teams can remove themselves from the critical paths to production. They can then partner with development teams and work together to meet the desired reliability criteria. At the same time, SRE teams will be continually involved in how to measure the impact on user experience. At this stage, SREs are involved in the process earlier in the cycle and work alongside development teams right from the start, on the reliability of the product.

Relationships between teams become a lot less antagonistic. SRE teams are regularly sought for cooperation since they enable much higher satisfaction and long-term value.

Enabling

When reliability is part of the process, SREs are involved in the full lifecycle of a service, from creation to decommissioning. SREs can act as consultants, bringing consciousness about business goals, reliability, and security. Processes are in place to allow teams to measure how reliable their services are and how to act when necessary. This allows scalability since teams can operate independently while being able to rely on SREs when necessary. Development teams have the ability to pull SREs and SREs don’t need to impose demands. A voluntary, data-supported engagement between teams becomes enjoyable and sustainable.

Enabling SRE teams to become highly functional is a long process. It takes time and some steps are “mandatory”. Starting in firefighting mode is almost inevitable for most organizations. It is then, when they realize that a new approach to manage operations needs to emerge. At this stage, some time has to be set aside so that teams can work on processes and tools to help ease the pain of constant firefighting. A lightweight gatekeeping stage can help bring teams together as well as identifying processes and automation that would pave the way to the partnering stage. Enabling stage is the end goal where teams work together, armed with tools that help measure reliability, and processes that help prioritize work and decide how to proceed when reliability is at risk.

While some short-circuiting can be done to accelerate the process, some stages build on others and use them as foundations. High-level management support can help accelerate the evolution by allowing teams to focus on prioritizing the work that will make them reach the enabling stage. Focus on success stories to help build momentum. A lot of SRE is cultural, a different way of doing things. Successes help build a narrative and motivate people to follow the same path.

Creating an SRE team can seem like a daunting task and there are several topologies that can be adopted. Each organization needs to find what better suits its needs. But there are some guiding principles to help navigate this journey:

  • Start small and iterate -  in the beginning, keep it simple. Select a small group of people to implement SRE practices. Find what works and what doesn’t. Iterate and onboard new practices
  • Find the right people - SRE teams should have a mix of domain knowledge. Either through internal or external hiring, aim to build teams with a diverse set of skills
  • Scan for the right qualities - depending on the scope and topology of the team, the skill set might defer. Some good qualities to scan for are the ability to see the big picture, the ability to troubleshoot,  good communication skills, desire to remove toil, a knack to dig deeper, and do excellent teamwork
  • Training - a lot of SRE work is cultural. The ability to communicate well and provide training to teams is paramount
  • Guidelines and Governance - SRE teams help facilitate the adoption of a reliability mindset. They can help establish guidelines that make it easier for teams to onboard reliability practices

Either through a separate SRE team or any other topology, it’s important to start simple and iterate. Introduce SRE principles, assess, find out what works, and then incorporate. Navigating from Firefighting to Enabling is a journey and it will be slightly different for each organization.

Written By:
September 14, 2021
Ricardo Castro
Ricardo Castro
September 14, 2021
SRE
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Going from Zero to SRE

Sep 14, 2021
Last Updated:
November 20, 2024
Share this post:
Going from Zero to SRE

Establishing a formal SRE practice can be either a 'nice-to-have' or a 'must-have' depending on org size, and team structure among other important factors. In this blog, Ricardo Castro shares his thoughts on the key SRE principles that every organization must incorporate and when they should incorporate in their SRE journey.

Table of Contents:

    Traditionally, developing applications and running them in production was seen as completely separate worlds, usually being the focus and concern of different teams. This kind of separation gives birth to the proverbial wall that separates development and operations, where developers “throw” their code over the wall and expect operations teams to run and manage them in production. This results in teams having different and conflicting goals: development teams prioritize building and shipping new features while operations teams focus on system stability, where code changes are seen as potential threats.

    With systems getting larger and larger, this type of situation does not scale, and a new way of doing things needed to emerge. And in 2009 Patrick Debois coined the term “DevOps”, a set of principles and practices with emphasis on people and culture and the goal of improving collaboration between development and operations teams.

    A few years before, facing a similar situation, Google put Benjamin Treynor in charge of a team with the goal of ensuring that Google’s websites were available, reliable, and as serviceable as possible. Since Benjamin was a software engineer he approached an operations problem with a software engineering perspective, giving birth to what is today known as Site Reliability Engineering (SRE): “what happens when you ask a software engineer to design an operations team”. SRE rose to fame around 2016 after Google published the book Site Reliability Engineering: How Google Runs Production Systems, describing a lot of the practices they use to run their systems in production.

    SRE is all about running reliable production systems. Big companies like Google, Facebook, or Amazon run large production systems and face many challenges most companies rarely do. Despite that, the way they run their systems can help us run ours. But even armed with all that information, how do we go about starting our SRE journey?

    Firefighting

    In the beginning, teams are reactive. There are a lot of moving parts, highly complex systems, failures appearing right, left, and center. The world is chaotic and it seems close to impossible to apply any sort of engineering practice to tackle the situation. At this stage, teams don’t have many options and need to keep struggling, responding to a myriad of crises, while trying to reserve some time, energy, and focus to try and improve the situation. Putting brakes on feature development can help stabilize a system, getting teams some relief. Adding people to teams that can focus on automation, can help reduce toil.

    This is a difficult stage for teams. It requires a dual perspective since they will need to keep systems running while building a new approach to run operations. This is also a reflection on operations teams being brought late into the product cycle. They can be made aware of a new service or a new set of requirements just before a service needs to go live, or even worse, when something is already live and failing miserably.

    Gatekeeping

    Usually, one of the first measures that are put in place to deal with firefighting is gatekeeping. The goal is to make every change to production pass through and be approved by the SRE team. On a small scale, this approach can work for a while. But, as systems and teams grow, SRE teams start being a choke point, limiting change and slowing down other teams.

    At this stage, SRE teams work to become more engaged with development teams by implementing processes. While this reflects positive engagement, it can derail into an us-versus-them struggle, leading teams to circumvent the processes and SRE teams all together.

    Partnering

    When there are frameworks in place (e.g. SLOs and Error Budgets) that can be used to measure reliability and make decisions on how to act, SRE teams can remove themselves from the critical paths to production. They can then partner with development teams and work together to meet the desired reliability criteria. At the same time, SRE teams will be continually involved in how to measure the impact on user experience. At this stage, SREs are involved in the process earlier in the cycle and work alongside development teams right from the start, on the reliability of the product.

    Relationships between teams become a lot less antagonistic. SRE teams are regularly sought for cooperation since they enable much higher satisfaction and long-term value.

    Enabling

    When reliability is part of the process, SREs are involved in the full lifecycle of a service, from creation to decommissioning. SREs can act as consultants, bringing consciousness about business goals, reliability, and security. Processes are in place to allow teams to measure how reliable their services are and how to act when necessary. This allows scalability since teams can operate independently while being able to rely on SREs when necessary. Development teams have the ability to pull SREs and SREs don’t need to impose demands. A voluntary, data-supported engagement between teams becomes enjoyable and sustainable.

    Enabling SRE teams to become highly functional is a long process. It takes time and some steps are “mandatory”. Starting in firefighting mode is almost inevitable for most organizations. It is then, when they realize that a new approach to manage operations needs to emerge. At this stage, some time has to be set aside so that teams can work on processes and tools to help ease the pain of constant firefighting. A lightweight gatekeeping stage can help bring teams together as well as identifying processes and automation that would pave the way to the partnering stage. Enabling stage is the end goal where teams work together, armed with tools that help measure reliability, and processes that help prioritize work and decide how to proceed when reliability is at risk.

    While some short-circuiting can be done to accelerate the process, some stages build on others and use them as foundations. High-level management support can help accelerate the evolution by allowing teams to focus on prioritizing the work that will make them reach the enabling stage. Focus on success stories to help build momentum. A lot of SRE is cultural, a different way of doing things. Successes help build a narrative and motivate people to follow the same path.

    Creating an SRE team can seem like a daunting task and there are several topologies that can be adopted. Each organization needs to find what better suits its needs. But there are some guiding principles to help navigate this journey:

    • Start small and iterate -  in the beginning, keep it simple. Select a small group of people to implement SRE practices. Find what works and what doesn’t. Iterate and onboard new practices
    • Find the right people - SRE teams should have a mix of domain knowledge. Either through internal or external hiring, aim to build teams with a diverse set of skills
    • Scan for the right qualities - depending on the scope and topology of the team, the skill set might defer. Some good qualities to scan for are the ability to see the big picture, the ability to troubleshoot,  good communication skills, desire to remove toil, a knack to dig deeper, and do excellent teamwork
    • Training - a lot of SRE work is cultural. The ability to communicate well and provide training to teams is paramount
    • Guidelines and Governance - SRE teams help facilitate the adoption of a reliability mindset. They can help establish guidelines that make it easier for teams to onboard reliability practices

    Either through a separate SRE team or any other topology, it’s important to start simple and iterate. Introduce SRE principles, assess, find out what works, and then incorporate. Navigating from Firefighting to Enabling is a journey and it will be slightly different for each organization.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    September 14, 2021
    September 14, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Ricardo Castro
    How to Implement Global View and High Availability for Prometheus
    How to Implement Global View and High Availability for Prometheus
    March 11, 2022
    The Critical Role of Observability in SRE
    The Critical Role of Observability in SRE
    December 3, 2021
    How to improve your influence as an SRE
    How to improve your influence as an SRE
    November 10, 2021
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.