Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems.
The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems. In particular, site reliability engineers are responsible for ensuring that a given system’s behavior consistently meets business requirements for performance and availability.
Furthermore, whereas traditional operations teams and development teams often have opposing incentives, site reliability engineers are able to align incentives so that both feature development and reliability are promoted simultaneously.
In this article, we’ll cover SRE best practices, provide some examples of those key principles, and include relevant details and illustrations to clarify these examples.
No system can be expected to have perfect performance. It’s important to create reasonable expectations about system performance for both internal stakeholders and external users.
For services that are directly user-facing, such as static websites and streaming, two common and important ways to measure performance are time availability and aggregate availability.
This article provides an example of calculating time availability for a service.
For other services, additional factors are important, including speed (latency), accuracy (correctness), and volume (throughput).
An example calculation for latency is as follows:
Choosing key metrics makes explicit how the performance of a service is evaluated, and therefore what factors pose a risk to service health. In the above example, identifying latency as a key metric indicates average return time as an essential property of the service. Thus, a risk to the reliability of the service is “slowness” or low latency.
In addition to measuring risks, it’s important to clearly define which risks the system can tolerate without compromising quality and which risks must be addressed to ensure quality.
This article provides an example of two types of measurements that address failure: mean time to failure (MTTF) and mean time between failures (MTBF).
The most robust way to define failures is to set SLOs, monitor your services for violations in SLOs, and create alerts and processes for fixing violations. These are discussed in the following sections.
The development of new production features always introduces new potential risks and failures; aiming for a 100% risk-free service is unrealistic. The way to align the competing incentives of pushing development and maintaining reliability is through error budgets.
An error budget provides a clear metric that allows a certain proportion of failure from new releases in a given planning cycle. If the number or length of failures exceeds the error budget, no new releases may occur until a new planning period begins.
The following is an example error budget.
Suppose the development team plans to release 10 new features during the quarter, and the following occurs:
In this way, the error budget has ensured an acceptable feature release velocity while not compromising reliability or degrading user experience.
The best way to set performance expectations is to set specific targets for different system risks. These targets are called service level objectives, or SLOs. The following table lists examples of SLOs based on different risk measurements.
Depending on the service, some SLOs may be more complicated than just a single number. For example, a database may exhibit 99.9% correctness on reads but have the 0.1% of errors it incurs always be related to the most recent data. If a customer relies heavily on data recorded in the past 24 hours, then the service is not reliable. In this case, it makes sense to create a tiered SLO based on the customer’s needs. Here is an example:
One of the main purposes of establishing SLOs is to track how reliability affects revenue. Revisiting the sample error budget from the section above, suppose there is a projected service revenue of $500,000 for the quarter. This can be used to translate the SLO and error budget into real dollars. Thus, SLOs are also a way to measure objectives that are indirectly related to system performance.
Using SLOs to track indirect metrics, such as revenue, allows one to assess the cost for improving a service. In this case, spending $10,000 on improving the SLO from 95% to 99% is a worthwhile business decision. On the other hand, spending $10,000 on improving the SLO from 99% to 99.9% is not.
One characteristic that distinguishes SREs from traditional DevOps is the ability to scale up the scope of a service without scaling the cost of the service. Called sublinear growth, this is accomplished via automation.
In a traditional development-operations split, the development team pushes new features, while the operations team dedicates 100% of its time to maintenance. Thus, a pure operations team will need to grow 1:1 with the size and scope of the service it is maintaining: If it takes O(10) system engineers to serve 1000 users, it will take O(100) engineers to serve 10K users.
In contrast, an SRE team operating according to best practices will devote at least 50% of its time to developing systems that remove the basic elements of effort from the operations workload. Some examples of this include the following:
To maintain reliability, it is imperative to monitor the relevant analytics for a service and use monitoring to detect SLO violations. As mentioned earlier, some important metrics include:
Sometimes durability is also measured, which is the length of time that data is stored with accuracy.
A good way to implement monitoring is through dashboards. An effective dashboard will display SLOs, include the error budget, and present the different risk metrics relevant to the SLO.
Another good way to implement monitoring is through logs. Logs that are both searchable in time and categorized via request are the most effective. If an SLO violation is detected via a dashboard, a more detailed picture can be created by viewing the logs generated during the affected timeframe.
The type of monitoring discussed above that tracks the internal analytics of a service is called whitebox monitoring. Sometimes it’s also important to monitor the behavior of a system from the “outside,” which means testing the workflow of a service from the point of view of an external user; this is called blackbox monitoring. Blackbox monitoring may reveal problems with access permissions or redundancy.
One of the best ways for SREs to reduce effort is to use automation during monitoring for alerts and ticketing. The SRE process is much more efficient than a traditional operations process.
A traditional operations response may look like this:
In contrast, an SRE operations response may look like this:
By using an automated system for alerting and proposing changes to the database, the communication required, the number of people involved, and time to resolution are all reduced.
The following code block is a generic language implementation of latency and throughput thresholds and automated alerts triggered upon detected violations.
The best way to ensure that systems remain reliable is to keep them simple. SRE teams should be hesitant to add new code, preferring instead to modify and delete code where possible. Every additional API, library, and function that one adds to production software increases dependencies in ways that are difficult to track, introducing new points of failure.
Site reliability engineers should aim to keep their code modular. That is, each function in an API should serve only one purpose, as should each API in a larger stack. This type of organization makes dependencies more transparent and also makes diagnosing errors easier.
As part of incident management, playbooks for typical on-call investigations and solutions should be authored and published publicly. Playbooks for a particular scenario should describe the incident (and possible variations), list the associated SLOs, reference appropriate monitoring tools and codebases, offer proposed solutions, and catalog previous approaches.
Just as an SRE codebase should emphasize simplicity, so should an SRE release process. Simplicity is encouraged through a couple of principles:
All code releases should be submitted within a version control system to allow for easy reversions in the event of erroneous, redundant, or ineffective code.
The process of submitting releases should be accompanied by a clear and visible code review process. Basic changes may not require approval, whereas more complicated or impactful changes will require approval from other site reliability engineers or technical leads.
The main principles of SRE are embracing risk, setting SLOs, eliminating work via automation, monitoring systems, keeping things simple, and outlining the release engineering process.
Embracing risk involves clearly defining failure and setting error budgets. The best way to do this is by creating and enforcing SLOs, which track system performance directly and also help identify the potential costs of system improvement. The appropriate SLO depends on how risk is measured and the needs of the customer. Enforcing SLOs requires monitoring, usually through dashboards and logs.
Site reliability engineers focus on project work, in addition to development operations, which allows for services to expand in scope and scale while maintaining low costs. This is called sublinear growth and is achieved through automating repetitive tasks. Monitoring that automates alerting creates a streamlined operations process, which increases reliability.
Site reliability engineers should keep systems simple by reducing the amount of code written, encouraging modular development, and publishing playbooks with standard operating procedures. SRE release processes should be hermetic and push small, frequent changes using version control and code reviews.
Read more on Best SRE Practices