Got a DevOps horror story? Tell us about your worst on-call nightmares this Halloween and get featured! Click Here
Blog
SLOs
Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

September 7, 2021
Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

The tenet of a strong SRE culture lies in responsibly managing Error Budgets. However, you can only calculate error budgets after establishing the expected service SLOs in agreement with all the relevant stakeholders.

The SLO Tracker is designed to help organizations manage their Service Level Objectives (SLOs) and Service Level Agreements (SLAs) effectively by providing a centralized platform for tracking and analysis.

After defining organization-wide SLOs, and the subsequent Service Level Indicators (SLIs) - to track SLAs, calculating Error Budgets is just a numbers-game. In short, these metrics are the foundation to establish a strong SRE culture and I cannot stress enough on how it promotes accountability, trust and timely innovation.

What are SLOs and SLIs?

Let’s understand this with an example. Assuming that your Service Level Indicators (SLIs) are - “xyz is true”, then the Service Level Objectives (SLOs) which are organization-level objectives will read - “xyz is true for a % of time” and the corresponding Service Level Agreements (SLAs) meant for external/ end users are legal contracts that say - “if xyz is not true for a % of time, then, so and so will be compensated”.

Service Level Indicators (SLIs) are specific metrics used to measure the performance of a service. These indicators inform the Service Level Objectives (SLOs), which are internal goals, and Service Level Agreements (SLAs), which are external commitments.

Typically, Error Budgets allow you to track downtime as real time with a burn rate. It is calculated as “1-(Service Level Objectives)”. So, an SLO of 99.99% yearly means it is acceptable for the service to be down for no more than 52.56 minutes in a year.

The development team can utilize the SLO and Error Budget to prioritize tasks, whether it's preventing issues or addressing system instabilities.

Ensuring service uptime is just one of the SRE objectives to ensure user gratification. A few other basic indicators concerning end user requirements could be:

  • App load time should be less than 3 seconds,
  • Load times for every feature in the app should be less than 3 seconds,
  • Less buggy features rolled out - not more than 2 bugs reported by users in a span of 20 days,
  • ‘Update time’ for data inputs to reflect should be less than 4 seconds,
  • And Retrieval of data within the app should be less than 2 seconds, to name a few.

Put in simple words, there needs to be a balance between what is acceptable to the end user versus what is actually deliverable keeping in mind the effort and budget needed. Understanding where end users can compromise with the experience is key. Based on that, identifying the right target thresholds for the identified indicators would be easier.

Impact of SLOs on organizational SLAs

The ideal way to start off is by doing just enough to reduce the number of complaints raised by the end user for a particular feature in the app. For example, when a user is trying to retrieve a huge data set from the app, they would be ready to accept a slight tolerance/ delay. In such a case, promising a 99% SLO for this indicator is both unnecessary and unrealistic. A more sensible target would be around 85% SLO. Even after satisfying this threshold, if users continue complaining, then the indicators and objectives along with their thresholds can be revisited.

In addition to this, having telemetry and observability in place is very important. Without tracking these indicators, you will not be able to measure end user experience against SLO thresholds. This also gives you a sense of the other dependent factors and how their performance can affect the overall performance of a feature or the application in general.

Defining SLOs is a journey and not a destination. You should constantly refine your SLOs because with time, many factors change such as the user base, size of your app and user expectations, etc. Hence your SLOs should be defined mainly to achieve user satisfaction.

Challenges in SLO monitoring

Over the years of setting up SLOs, I have come up against this routine challenge of dealing with False Positives. No matter how efficient or accurate, monitoring tools will sometimes flag an event as an issue in spite of no violation of SLOs. Thus triggering a false positive. So keep in mind that building an efficient, battle-tested and trustingly insightful platform takes time.

During the early days, I’ve noticed teams getting a lot of false positives, which will eat into the Error Budget. And I’ve always yearned for a feature that can help me easily mark events as false positives so as to get precious minutes back into the Error Budget. This helps in practicing observability with actionable data.

Another basic challenge faced by engineers in organizations is tracking all the defined SLIs. Since SLOs are monitored by multiple tools in the observability stack, not maintaining a unified dashboard to accurately track the error budget will make them oblivious to the error budget burn rate.

Thus a single source of truth with multiple SLOs (across all services) tracked in one place, will ensure greater reliability. In most cases, services will be dependent on one another and thus outages are inevitable. The aim here is not just to 'not fail'. Instead, it is about failing measurably and with enough insights to mend it, we can ensure it does not happen again.

The challenges can be summarized as:

  • Lack of a centralized dashboard for tracking SLIs (from multiple alert sources)
  • Too many ‘False Positives’ eating into the error budget
  • Short retention period of metrics stored in Prometheus (or other monitoring tools)

Tackling these challenges which started off as a hobby, became my passion. And that is how this open-source project came into existence.

Introducing the SLO Tracker

As someone who painstakingly experienced the challenges with SLO monitoring, I built this open source project “SLO tracker” - as a simplified means to track the defined SLOs, Error Budgets, and Error Budget burn rate with intuitive graphs and visualizations. This dashboard is easy to set up and makes it simple to aggregate SLI metrics coming in from different sources.

You will be required to first set up your target SLOs. The Error Budget will be calculated and allocated based on that. The SLO Tracker currently:

  • Provides a unified dashboard for all the SLOs that have been set up, in turn giving insights into the SLIs being tracked
  • Gives you a clear visualisation of the Error Budget and alerts you when Error Budget burn rate threshold gets breached
  • Supports Webhook integrations with various observability tools (Prometheus, Pingdom, New Relic) and whenever an alert is received from these tools, the tracker will re-calculate and reduce time from the allocated Error Budget
  • Provides the ability to claim your falsely spent Error Budget back by marking erroneous SLO violation alerts as False Positives
  • Supports manual alert creation from the web app when a violation is not caught:
    • Either by your monitoring tool due to various reasons, but should have been
    • Or, if your monitoring tool is not integrated with SLO Tracker
  • Displays basic Analytics for SLO violation distribution (SLI distribution graph)
  • Is easy to set up, lightweight since it only stores and computes what matters (SLO violation alerts) and not the bulk of the data (every single metric)

How to set this up?

  • Docker-compose file is already part of the project repo. You can bring all the components up with it.
  • Once all the components are up, Users can start adding SLOs from the frontend.
  • "Alert Sources" button will have all the webhook links of supported integrations. Users can add these webhook URLs to their respective monitoring tools.

Final Thoughts

I hope this blog helped you understand the annoyance around SLO and Error Budget tracking. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker.

While this started off as a tool for internal use, we have now made it open-source for everyone to use, provide suggestions, code patches or contribute in any way that can make this a better tool. Let’s make the path to reliability a smoother ride for everyone :)

Written By:
September 7, 2021
Roshan Shetty
Roshan Shetty
September 7, 2021
SLOs
SRE
Monitoring
Observability
Free Tool
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

Sep 7, 2021
Last Updated:
August 27, 2024
Share this post:
Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)

Table of Contents:

    The tenet of a strong SRE culture lies in responsibly managing Error Budgets. However, you can only calculate error budgets after establishing the expected service SLOs in agreement with all the relevant stakeholders.

    The SLO Tracker is designed to help organizations manage their Service Level Objectives (SLOs) and Service Level Agreements (SLAs) effectively by providing a centralized platform for tracking and analysis.

    After defining organization-wide SLOs, and the subsequent Service Level Indicators (SLIs) - to track SLAs, calculating Error Budgets is just a numbers-game. In short, these metrics are the foundation to establish a strong SRE culture and I cannot stress enough on how it promotes accountability, trust and timely innovation.

    What are SLOs and SLIs?

    Let’s understand this with an example. Assuming that your Service Level Indicators (SLIs) are - “xyz is true”, then the Service Level Objectives (SLOs) which are organization-level objectives will read - “xyz is true for a % of time” and the corresponding Service Level Agreements (SLAs) meant for external/ end users are legal contracts that say - “if xyz is not true for a % of time, then, so and so will be compensated”.

    Service Level Indicators (SLIs) are specific metrics used to measure the performance of a service. These indicators inform the Service Level Objectives (SLOs), which are internal goals, and Service Level Agreements (SLAs), which are external commitments.

    Typically, Error Budgets allow you to track downtime as real time with a burn rate. It is calculated as “1-(Service Level Objectives)”. So, an SLO of 99.99% yearly means it is acceptable for the service to be down for no more than 52.56 minutes in a year.

    The development team can utilize the SLO and Error Budget to prioritize tasks, whether it's preventing issues or addressing system instabilities.

    Ensuring service uptime is just one of the SRE objectives to ensure user gratification. A few other basic indicators concerning end user requirements could be:

    • App load time should be less than 3 seconds,
    • Load times for every feature in the app should be less than 3 seconds,
    • Less buggy features rolled out - not more than 2 bugs reported by users in a span of 20 days,
    • ‘Update time’ for data inputs to reflect should be less than 4 seconds,
    • And Retrieval of data within the app should be less than 2 seconds, to name a few.

    Put in simple words, there needs to be a balance between what is acceptable to the end user versus what is actually deliverable keeping in mind the effort and budget needed. Understanding where end users can compromise with the experience is key. Based on that, identifying the right target thresholds for the identified indicators would be easier.

    Impact of SLOs on organizational SLAs

    The ideal way to start off is by doing just enough to reduce the number of complaints raised by the end user for a particular feature in the app. For example, when a user is trying to retrieve a huge data set from the app, they would be ready to accept a slight tolerance/ delay. In such a case, promising a 99% SLO for this indicator is both unnecessary and unrealistic. A more sensible target would be around 85% SLO. Even after satisfying this threshold, if users continue complaining, then the indicators and objectives along with their thresholds can be revisited.

    In addition to this, having telemetry and observability in place is very important. Without tracking these indicators, you will not be able to measure end user experience against SLO thresholds. This also gives you a sense of the other dependent factors and how their performance can affect the overall performance of a feature or the application in general.

    Defining SLOs is a journey and not a destination. You should constantly refine your SLOs because with time, many factors change such as the user base, size of your app and user expectations, etc. Hence your SLOs should be defined mainly to achieve user satisfaction.

    Challenges in SLO monitoring

    Over the years of setting up SLOs, I have come up against this routine challenge of dealing with False Positives. No matter how efficient or accurate, monitoring tools will sometimes flag an event as an issue in spite of no violation of SLOs. Thus triggering a false positive. So keep in mind that building an efficient, battle-tested and trustingly insightful platform takes time.

    During the early days, I’ve noticed teams getting a lot of false positives, which will eat into the Error Budget. And I’ve always yearned for a feature that can help me easily mark events as false positives so as to get precious minutes back into the Error Budget. This helps in practicing observability with actionable data.

    Another basic challenge faced by engineers in organizations is tracking all the defined SLIs. Since SLOs are monitored by multiple tools in the observability stack, not maintaining a unified dashboard to accurately track the error budget will make them oblivious to the error budget burn rate.

    Thus a single source of truth with multiple SLOs (across all services) tracked in one place, will ensure greater reliability. In most cases, services will be dependent on one another and thus outages are inevitable. The aim here is not just to 'not fail'. Instead, it is about failing measurably and with enough insights to mend it, we can ensure it does not happen again.

    The challenges can be summarized as:

    • Lack of a centralized dashboard for tracking SLIs (from multiple alert sources)
    • Too many ‘False Positives’ eating into the error budget
    • Short retention period of metrics stored in Prometheus (or other monitoring tools)

    Tackling these challenges which started off as a hobby, became my passion. And that is how this open-source project came into existence.

    Introducing the SLO Tracker

    As someone who painstakingly experienced the challenges with SLO monitoring, I built this open source project “SLO tracker” - as a simplified means to track the defined SLOs, Error Budgets, and Error Budget burn rate with intuitive graphs and visualizations. This dashboard is easy to set up and makes it simple to aggregate SLI metrics coming in from different sources.

    You will be required to first set up your target SLOs. The Error Budget will be calculated and allocated based on that. The SLO Tracker currently:

    • Provides a unified dashboard for all the SLOs that have been set up, in turn giving insights into the SLIs being tracked
    • Gives you a clear visualisation of the Error Budget and alerts you when Error Budget burn rate threshold gets breached
    • Supports Webhook integrations with various observability tools (Prometheus, Pingdom, New Relic) and whenever an alert is received from these tools, the tracker will re-calculate and reduce time from the allocated Error Budget
    • Provides the ability to claim your falsely spent Error Budget back by marking erroneous SLO violation alerts as False Positives
    • Supports manual alert creation from the web app when a violation is not caught:
      • Either by your monitoring tool due to various reasons, but should have been
      • Or, if your monitoring tool is not integrated with SLO Tracker
    • Displays basic Analytics for SLO violation distribution (SLI distribution graph)
    • Is easy to set up, lightweight since it only stores and computes what matters (SLO violation alerts) and not the bulk of the data (every single metric)

    How to set this up?

    • Docker-compose file is already part of the project repo. You can bring all the components up with it.
    • Once all the components are up, Users can start adding SLOs from the frontend.
    • "Alert Sources" button will have all the webhook links of supported integrations. Users can add these webhook URLs to their respective monitoring tools.

    Final Thoughts

    I hope this blog helped you understand the annoyance around SLO and Error Budget tracking. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker.

    While this started off as a tool for internal use, we have now made it open-source for everyone to use, provide suggestions, code patches or contribute in any way that can make this a better tool. Let’s make the path to reliability a smoother ride for everyone :)

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    September 7, 2021
    September 7, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Roshan Shetty
    Faster Incident Resolution with Context Rich Alerts
    Faster Incident Resolution with Context Rich Alerts
    June 9, 2021
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.