The tenet of a strong SRE culture lies in responsibly managing Error Budgets. However, you can only calculate error budgets after establishing the expected service SLOs in agreement with all the relevant stakeholders.
The SLO Tracker is designed to help organizations manage their Service Level Objectives (SLOs) and Service Level Agreements (SLAs) effectively by providing a centralized platform for tracking and analysis.
After defining organization-wide SLOs, and the subsequent Service Level Indicators (SLIs) - to track SLAs, calculating Error Budgets is just a numbers-game. In short, these metrics are the foundation to establish a strong SRE culture and I cannot stress enough on how it promotes accountability, trust and timely innovation.
Let’s understand this with an example. Assuming that your Service Level Indicators (SLIs) are - “xyz is true”, then the Service Level Objectives (SLOs) which are organization-level objectives will read - “xyz is true for a % of time” and the corresponding Service Level Agreements (SLAs) meant for external/ end users are legal contracts that say - “if xyz is not true for a % of time, then, so and so will be compensated”.
Service Level Indicators (SLIs) are specific metrics used to measure the performance of a service. These indicators inform the Service Level Objectives (SLOs), which are internal goals, and Service Level Agreements (SLAs), which are external commitments.
Typically, Error Budgets allow you to track downtime as real time with a burn rate. It is calculated as “1-(Service Level Objectives)”. So, an SLO of 99.99% yearly means it is acceptable for the service to be down for no more than 52.56 minutes in a year.
The development team can utilize the SLO and Error Budget to prioritize tasks, whether it's preventing issues or addressing system instabilities.
Ensuring service uptime is just one of the SRE objectives to ensure user gratification. A few other basic indicators concerning end user requirements could be:
Put in simple words, there needs to be a balance between what is acceptable to the end user versus what is actually deliverable keeping in mind the effort and budget needed. Understanding where end users can compromise with the experience is key. Based on that, identifying the right target thresholds for the identified indicators would be easier.
The ideal way to start off is by doing just enough to reduce the number of complaints raised by the end user for a particular feature in the app. For example, when a user is trying to retrieve a huge data set from the app, they would be ready to accept a slight tolerance/ delay. In such a case, promising a 99% SLO for this indicator is both unnecessary and unrealistic. A more sensible target would be around 85% SLO. Even after satisfying this threshold, if users continue complaining, then the indicators and objectives along with their thresholds can be revisited.
In addition to this, having telemetry and observability in place is very important. Without tracking these indicators, you will not be able to measure end user experience against SLO thresholds. This also gives you a sense of the other dependent factors and how their performance can affect the overall performance of a feature or the application in general.
Defining SLOs is a journey and not a destination. You should constantly refine your SLOs because with time, many factors change such as the user base, size of your app and user expectations, etc. Hence your SLOs should be defined mainly to achieve user satisfaction.
Over the years of setting up SLOs, I have come up against this routine challenge of dealing with False Positives. No matter how efficient or accurate, monitoring tools will sometimes flag an event as an issue in spite of no violation of SLOs. Thus triggering a false positive. So keep in mind that building an efficient, battle-tested and trustingly insightful platform takes time.
During the early days, I’ve noticed teams getting a lot of false positives, which will eat into the Error Budget. And I’ve always yearned for a feature that can help me easily mark events as false positives so as to get precious minutes back into the Error Budget. This helps in practicing observability with actionable data.
Another basic challenge faced by engineers in organizations is tracking all the defined SLIs. Since SLOs are monitored by multiple tools in the observability stack, not maintaining a unified dashboard to accurately track the error budget will make them oblivious to the error budget burn rate.
Thus a single source of truth with multiple SLOs (across all services) tracked in one place, will ensure greater reliability. In most cases, services will be dependent on one another and thus outages are inevitable. The aim here is not just to 'not fail'. Instead, it is about failing measurably and with enough insights to mend it, we can ensure it does not happen again.
The challenges can be summarized as:
Tackling these challenges which started off as a hobby, became my passion. And that is how this open-source project came into existence.
As someone who painstakingly experienced the challenges with SLO monitoring, I built this open source project “SLO tracker” - as a simplified means to track the defined SLOs, Error Budgets, and Error Budget burn rate with intuitive graphs and visualizations. This dashboard is easy to set up and makes it simple to aggregate SLI metrics coming in from different sources.
You will be required to first set up your target SLOs. The Error Budget will be calculated and allocated based on that. The SLO Tracker currently:
I hope this blog helped you understand the annoyance around SLO and Error Budget tracking. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker.
While this started off as a tool for internal use, we have now made it open-source for everyone to use, provide suggestions, code patches or contribute in any way that can make this a better tool. Let’s make the path to reliability a smoother ride for everyone :)