Got a DevOps horror story? Tell us about your worst on-call nightmares this Halloween and get featured! Click Here
Blog
SRE
Managing technical risk effectively with Error Budgets

Managing technical risk effectively with Error Budgets

October 14, 2019
Managing technical risk effectively with Error Budgets
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

If this sounds familiar, count yourself as part of the Eternal Conundrum Club of Reliability Engineering. It’s fairly easy for us as humans to communicate the impact of a new feature release as opposed to the utility of cleaning up a part of the system that would cause us to worry less. This is because feature releases are more often than not, tied with business impact, ie. potential to generate revenue, but it’s a little harder to quantify and justify the business impact resulting from engineering reliability into your offering.

Alas, without a healthy balance between release velocity and reliability engineering, you’re going to end up with not only unhappy customers but also unhappy teams. Often, teams realize this problem too late - when business outcomes have already taken a hit.

Hello Error Budgets!

The key to resolving this dilemma is to add more context. This can support your decision making by helping you understand the tradeoffs between release velocity and reliability engineering. Error budgets are one such example of a quantifiable measure that can aid this decision making in real-time. Once we accept that risk cannot be avoided, but can be managed, it becomes easier to understand the utility and application of error budgets.

Error budgets tie back to the concept of SLIs, SLOs, and SLAs. Here’s a quick outline of what they mean -

Service Level Agreement (SLA)  is a promise between you and your customers on the acceptable availability of your system over a certain time period and it outlines business implications and compensation in the event of a breach. Essentially, not maintaining your SLA is going to hurt your bottom line.

Service Level Objectives (SLOs) are just another variation of SLAs which are owned internally by the product and engineering teams. SLOs are typically stricter than SLAs to ensure that an SLO breach precedes an SLA breach. Every customer-impacting service must have an SLO which will serve as its quantitative reliability measure. This is defined in alignment with business stakeholders as it is imperative for them to be on top of the SLAs they can promise to customers. They also need to quantify and explain to their engineering team about the cost attached to downtime and its adverse effects on the business reputation.

Service Level Indicators(SLIs) are metrics that can be monitored and can act as quantifiable indicators of the quality of the service you provide to your customers.They are a direct measurement of a service’s behavior and are typically documented in service level agreements (SLAs), as well as in service level objectives(SLOs).

This is where Error Budgets come in.

Typically  availability is calculated as:

Availability = Uptime / (Uptime + Downtime)

Let’s say a failed request covers errors, no responses and slow responses (eg. if it’s too slow for a customer and they move on to something else) which are direct indicators of the customer experience.Now, the Error Budget for a service can be calculated as:

Error Budget = (1 - Availability) = Failed Requests / (Total Number of  Requests)

So if an SLO for a service is indicated as an Availability of 99.5%, then this service has an error budget of 0.5% which specifies the amount of total downtime you are allowed.

Settling the debate

Error Budgets are the single metric that can be utilized to determine whether the system can take on the additional risk of deploying a new feature or if the team should rather be focussed on making the system more reliable. If the error budget is close to being spent, then the product team should ideally be extra cautious about deploying new features. It is liberating to have error budgets as a key decision making metric to solve this conundrum in any organization, no matter the size.

Here’s how you can calculate your error budget -

You could start by keeping a close watch on the error budget consumption for all the dependant services of a new feature release. The decision to balance release velocity with reliability engineering tasks then becomes a more objective one. When the error budgets are nearly consumed, the team should collectively agree to focus on improving the reliability of the current system.

Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, centralize SLO dashboards, unify internal and external SLIs and automate incident resolution with Squadcast Actions and create a knowledge base to effectively handle incidents.

Written By:
October 14, 2019
Prakya Vasudevan
Prakya Vasudevan
October 14, 2019
SRE
Incident Management
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Managing technical risk effectively with Error Budgets

Oct 14, 2019
Last Updated:
October 4, 2024
Share this post:
Managing technical risk effectively with Error Budgets

Tradeoffs are hard. Think about the time when you had to choose between two equally compelling options - (a) addressing technical debt or (b) pushing out that long-awaited feature release, and risk breaking production. Or when your team couldn’t agree on where to draw the line on improving request latency versus shipping a major new update.

Table of Contents:

    If this sounds familiar, count yourself as part of the Eternal Conundrum Club of Reliability Engineering. It’s fairly easy for us as humans to communicate the impact of a new feature release as opposed to the utility of cleaning up a part of the system that would cause us to worry less. This is because feature releases are more often than not, tied with business impact, ie. potential to generate revenue, but it’s a little harder to quantify and justify the business impact resulting from engineering reliability into your offering.

    Alas, without a healthy balance between release velocity and reliability engineering, you’re going to end up with not only unhappy customers but also unhappy teams. Often, teams realize this problem too late - when business outcomes have already taken a hit.

    Hello Error Budgets!

    The key to resolving this dilemma is to add more context. This can support your decision making by helping you understand the tradeoffs between release velocity and reliability engineering. Error budgets are one such example of a quantifiable measure that can aid this decision making in real-time. Once we accept that risk cannot be avoided, but can be managed, it becomes easier to understand the utility and application of error budgets.

    Error budgets tie back to the concept of SLIs, SLOs, and SLAs. Here’s a quick outline of what they mean -

    Service Level Agreement (SLA)  is a promise between you and your customers on the acceptable availability of your system over a certain time period and it outlines business implications and compensation in the event of a breach. Essentially, not maintaining your SLA is going to hurt your bottom line.

    Service Level Objectives (SLOs) are just another variation of SLAs which are owned internally by the product and engineering teams. SLOs are typically stricter than SLAs to ensure that an SLO breach precedes an SLA breach. Every customer-impacting service must have an SLO which will serve as its quantitative reliability measure. This is defined in alignment with business stakeholders as it is imperative for them to be on top of the SLAs they can promise to customers. They also need to quantify and explain to their engineering team about the cost attached to downtime and its adverse effects on the business reputation.

    Service Level Indicators(SLIs) are metrics that can be monitored and can act as quantifiable indicators of the quality of the service you provide to your customers.They are a direct measurement of a service’s behavior and are typically documented in service level agreements (SLAs), as well as in service level objectives(SLOs).

    This is where Error Budgets come in.

    Typically  availability is calculated as:

    Availability = Uptime / (Uptime + Downtime)

    Let’s say a failed request covers errors, no responses and slow responses (eg. if it’s too slow for a customer and they move on to something else) which are direct indicators of the customer experience.Now, the Error Budget for a service can be calculated as:

    Error Budget = (1 - Availability) = Failed Requests / (Total Number of  Requests)

    So if an SLO for a service is indicated as an Availability of 99.5%, then this service has an error budget of 0.5% which specifies the amount of total downtime you are allowed.

    Settling the debate

    Error Budgets are the single metric that can be utilized to determine whether the system can take on the additional risk of deploying a new feature or if the team should rather be focussed on making the system more reliable. If the error budget is close to being spent, then the product team should ideally be extra cautious about deploying new features. It is liberating to have error budgets as a key decision making metric to solve this conundrum in any organization, no matter the size.

    Here’s how you can calculate your error budget -

    You could start by keeping a close watch on the error budget consumption for all the dependant services of a new feature release. The decision to balance release velocity with reliability engineering tasks then becomes a more objective one. When the error budgets are nearly consumed, the team should collectively agree to focus on improving the reliability of the current system.

    Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, centralize SLO dashboards, unify internal and external SLIs and automate incident resolution with Squadcast Actions and create a knowledge base to effectively handle incidents.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    October 14, 2019
    October 14, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Prakya Vasudevan
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Best Practices in Incident Management
    Best Practices in Incident Management
    May 7, 2020
    Configure an Intuitive Service Dashboard & Reduce Response Time
    Configure an Intuitive Service Dashboard & Reduce Response Time
    April 30, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.