📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
The Guide to SRE Principles

The Guide to SRE Principles

March 31, 2023
The Guide to SRE Principles
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. 

The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems. In particular, site reliability engineers are responsible for ensuring that a given system’s behavior consistently meets business requirements for performance and availability.

Furthermore, whereas traditional operations teams and development teams often have opposing incentives, site reliability engineers are able to align incentives so that both feature development and reliability are promoted simultaneously.

Basic SRE principles 

In this article, we’ll cover SRE best practices, provide some examples of those key principles, and include relevant details and illustrations to clarify these examples.

Principle Description Example
Embrace risk No system can be expected to have perfect performance. It’s important to identify potential failure points and create mitigation plans. Additionally, it’s important to budget a certain percentage of business costs to address these failures in real time. A week consists of 168 hours of potential availability. The business sets an expectation of 165 hours of uptime per week to account for both planned maintenance and unplanned failures.
Set service level objectives (SLOs) Set reasonable expectations for system performance to ensure that customers and internal stakeholders understand how the system is supposed to perform at various levels. Remember that no system can be expected to have perfect performance.
  • The website is up and running 99% of the time.
  • 99% of all API requests return a successful response.
  • The server output matches client expectations 99% of the time.
  • 99% of all API requests are delivered within one second.
  • The server can handle 10,000 requests per second.
Eliminate work through automation Automate as many tasks and processes as possible. Engineers should focus on developing new features and enhancing existing systems at least as often as addressing real-time failures. Production code automatically generates alerts whenever an SLO is violated. The automated alerts send tickets to the appropriate incident response team with relevant playbooks to take action.
Monitor systems Use tools, such as Squadcast, to monitor system performance. Observe performance, incidents, and trends.
  • A dashboard that displays the proportion of client requests and server responses that were delivered successfully in a given time period.
  • A set of logs that displays the expected and actual output of client requests and server responses in a given time period.
Keep things simple Release frequent, small changes that can be easily reverted to minimize production bugs. Delete unnecessary code instead of keeping it for potential future use. The more code and systems that are introduced, the more complexity created; it’s important to prevent accidental bloat. Changes in code are always pushed via a version control system that tracks code writers, approvers, and previous states.
Outline the release engineering process Document your established processes for development, testing, automation, deployments, and production support. Ensure that the process is accessible and visible. A published playbook lists the steps to address reboot failure. The playbook contains references to relevant SLOs, dashboards, previous tickets, sections of the codebase, and contact information for the incident response team.

Embrace risk

No system can be expected to have perfect performance. It’s important to create reasonable expectations about system performance for both internal stakeholders and external users.

Key metrics

For services that are directly user-facing, such as static websites and streaming, two common and important ways to measure performance are time availability and aggregate availability.

This article provides an example of calculating time availability for a service.

For other services, additional factors are important, including speed (latency), accuracy (correctness), and volume (throughput).

An example calculation for latency is as follows:

  • Suppose 10 different users serve up identical HTTP requests to your website, and they are all served properly.
  • The return times are monitored and recorded as follows: 1 ms, 3 ms, 3 ms, 4 ms, 1 ms, 1 ms, 1 ms, 5 ms, 3 ms, and 2 ms.
  • The average response time, or latency, is 24 ms / 10 returns = 2.4 ms.

Choosing key metrics makes explicit how the performance of a service is evaluated, and therefore what factors pose a risk to service health. In the above example, identifying latency as a key metric indicates average return time as an essential property of the service. Thus, a risk to the reliability of the service is “slowness” or low latency.

Define failure

In addition to measuring risks, it’s important to clearly define which risks the system can tolerate without compromising quality and which risks must be addressed to ensure quality.

This article provides an example of two types of measurements that address failure: mean time to failure (MTTF) and mean time between failures (MTBF).

The most robust way to define failures is to set SLOs, monitor your services for violations in SLOs, and create alerts and processes for fixing violations. These are discussed in the following sections.

Error budgets

The development of new production features always introduces new potential risks and failures; aiming for a 100% risk-free service is unrealistic. The way to align the competing incentives of pushing development and maintaining reliability is through error budgets.

An error budget provides a clear metric that allows a certain proportion of failure from new releases in a given planning cycle. If the number or length of failures exceeds the error budget, no new releases may occur until a new planning period begins.

The following is an example error budget.

Planning cycle Quarter
Total possible availability 2,190 hours
SLO 99.9% time availability
Error budget 0.1% time availability = 21.9 hours

Suppose the development team plans to release 10 new features during the quarter, and the following occurs:

  • The first feature doesn’t cause any downtime.
  • The second feature causes downtime of 10 hours until fixed.
  • The third and fourth features each cause downtime of 6 hours until fixed.
  • At this point, the error budget for the quarter has been exceeded (10 + 6 + 6 = 22 > 21.9), so the fifth feature cannot be released.

In this way, the error budget has ensured an acceptable feature release velocity while not compromising reliability or degrading user experience.

Set service level objectives (SLOs)

The best way to set performance expectations is to set specific targets for different system risks. These targets are called service level objectives, or SLOs. The following table lists examples of SLOs based on different risk measurements.

Time availability Website running 99% of the time
Aggregate availability 99% of user requests processed
Latency 1 ms average response rate per request
Throughput 10,000 requests handled every second
Correctness 99% of database reads accurate

Depending on the service, some SLOs may be more complicated than just a single number. For example, a database may exhibit 99.9% correctness on reads but have the 0.1% of errors it incurs always be related to the most recent data. If a customer relies heavily on data recorded in the past 24 hours, then the service is not reliable. In this case, it makes sense to create a tiered SLO based on the customer’s needs. Here is an example:

Level 1 (records within the last 24 hours) 99.99% read accuracy
Level 2 (records within the last 7 days) 99.9% read accuracy
Level 3 (records within the last 30 days) 99% read accuracy
Level 4 (records within the last 6 months) 95% read accuracy

Costs of improvement

One of the main purposes of establishing SLOs is to track how reliability affects revenue. Revisiting the sample error budget from the section above, suppose there is a projected service revenue of $500,000 for the quarter. This can be used to translate the SLO and error budget into real dollars. Thus, SLOs are also a way to measure objectives that are indirectly related to system performance.

SLO Error Budget Revenue Lost
95% 5% $25,000
99% 1% $5,000
99.90% 0.10% $500
99.99% 0.01% $50

Using SLOs to track indirect metrics, such as revenue, allows one to assess the cost for improving a service. In this case, spending $10,000 on improving the SLO from 95% to 99% is a worthwhile business decision. On the other hand, spending $10,000 on improving the SLO from 99% to 99.9% is not.

Integrated full stack reliability management platform
Try for free
Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
Manage incidents on the go with native iOS and Android mobile apps
Seamlessly integrated alert routing, on-call, and incident response
Try for free

Eliminate work through automation

One characteristic that distinguishes SREs from traditional DevOps is the ability to scale up the scope of a service without scaling the cost of the service. Called sublinear growth, this is accomplished via automation.

In a traditional development-operations split, the development team pushes new features, while the operations team dedicates 100% of its time to maintenance. Thus, a pure operations team will need to grow 1:1 with the size and scope of the service it is maintaining: If it takes O(10) system engineers to serve 1000 users, it will take O(100) engineers to serve 10K users.

In contrast, an SRE team operating according to best practices will devote at least 50% of its time to developing systems that remove the basic elements of effort from the operations workload. Some examples of this include the following:

  • A service that detects which machines in a large fleet need software updates and that schedules software reboots in batches over regular time intervals.
  • A “push-on-green” module that provides an automatic workflow for the testing and release of new code to relevant services. 
  • An alerting system that automates ticket generation and notifies incident response teams.

Monitor systems

To maintain reliability, it is imperative to monitor the relevant analytics for a service and use monitoring to detect SLO violations. As mentioned earlier, some important metrics include:

  • The amount of time that a service is up and running (time availability)
  • The number of requests that complete successfully (aggregate availability)
  • The amount of time it takes to serve a request (latency)
  • The proportion of responses that deliver expected results (correctness)
  • The volume of requests that a system is currently handling (throughput)
  • The percentage of available resources being consumed (saturation)

Sometimes durability is also measured, which is the length of time that data is stored with accuracy.

Dashboards

A good way to implement monitoring is through dashboards. An effective dashboard will display SLOs, include the error budget, and present the different risk metrics relevant to the SLO.

Example of an effective SRE dashboard (source)

Logs

Another good way to implement monitoring is through logs. Logs that are both searchable in time and categorized via request are the most effective. If an SLO violation is detected via a dashboard, a more detailed picture can be created by viewing the logs generated during the affected timeframe.

Example of a monitoring log (source)

Whitebox versus blackbox

The type of monitoring discussed above that tracks the internal analytics of a service is called whitebox monitoring. Sometimes it’s also important to monitor the behavior of a system from the “outside,” which means testing the workflow of a service from the point of view of an external user; this is called blackbox monitoring. Blackbox monitoring may reveal problems with access permissions or redundancy.

Automated alerts and ticketing

One of the best ways for SREs to reduce effort is to use automation during monitoring for alerts and ticketing. The SRE process is much more efficient than a traditional operations process.

A traditional operations response may look like this:

  1. A web developer pushes a new update to an algorithm that serves ads to users.
  2. The developer notices that the latest push is reducing website traffic due to an unknown cause and manually files a ticket about reduced traffic with the web operations team.
  3. A system engineer on the web operations team receives a ticket about the reduced traffic issue. After troubleshooting, the issue is diagnosed as a latency issue caused by a stuck cache. 
  4. The web operations engineer contacts a member of the database team for help. The database team looks into the codebase and identifies a fix for the cache settings so that data is refreshed more quickly and latency is decreased.
  5. The database team updates the cache refresh settings, pushes the fix to production, and closes the ticket.

In contrast, an SRE operations response may look like this:

  1. The ads SRE team creates a deployment tool that monitors three different traffic SLOs: availability, latency, and throughput.
  2. A web developer is ready to push a new update to an algorithm that serves ads, for which he uses the SRE deployment tool.
  3. Within minutes, the deployment tool detects reduced website traffic. It identifies a latency SLO violation and creates an alert.
  4. The on-call site reliability engineer receives the alert, which contains a proposal for updated cache refresh settings to make processing requests faster.
  5. The site reliability engineer accepts the proposed changes, pushes the new settings to production, and closes the ticket.

By using an automated system for alerting and proposing changes to the database, the communication required, the number of people involved, and time to resolution are all reduced. 

The following code block is a generic language implementation of latency and throughput thresholds and automated alerts triggered upon detected violations.


# Define the latency SLO threshold in seconds and create a histogram to track
LATENCY_SLO_THRESHOLD = 0.1
REQUEST_LATENCY = Histogram('http_request_latency_seconds', 'Request latency in seconds', ['method', 'endpoint'])

# Define the throughput SLO threshold in requests per second and a counter to track
THROUGHPUT_SLO_THRESHOLD = 10000
REQUEST_COUNT = Counter('http_request_count', 'Request count', ['method', 'endpoint', 'http_status'])

# Check if the latency SLO is violated and send an alert if it is
def check_latency_slo():
    latency = REQUEST_LATENCY.observe(0.1).observe(0.2).observe(0.3).observe(0.4).observe(0.5).observe(0.6).observe(0.7).observe(0.8).observe(0.9).observe(1.0)
    quantiles = latency.quantiles(0.99)
    latency_99th_percentile = quantiles[0]
    if latency_99th_percentile > LATENCY_SLO_THRESHOLD:
        printf("Latency SLO violated! 99th percentile response time is {latency_99th_percentile} seconds.")

# Check if the throughput SLO is violated and send an alert if it is
def check_throughput_slo():
    request_count = REQUEST_COUNT.count()
    current_throughput = request_count / time.time()
    if current_throughput > THROUGHPUT_SLO_THRESHOLD:
        printf("Throughput SLO violated! Current throughput is {current_throughput} requests per second.")

Example of automated alert calls

Keep things simple

The best way to ensure that systems remain reliable is to keep them simple. SRE teams should be hesitant to add new code, preferring instead to modify and delete code where possible. Every additional API, library, and function that one adds to production software increases dependencies in ways that are difficult to track, introducing new points of failure.

Site reliability engineers should aim to keep their code modular. That is, each function in an API should serve only one purpose, as should each API in a larger stack. This type of organization makes dependencies more transparent and also makes diagnosing errors easier.

Playbooks

As part of incident management, playbooks for typical on-call investigations and solutions should be authored and published publicly. Playbooks for a particular scenario should describe the incident (and possible variations), list the associated SLOs, reference appropriate monitoring tools and codebases, offer proposed solutions, and catalog previous approaches.

Outline the release engineering process

Just as an SRE codebase should emphasize simplicity, so should an SRE release process. Simplicity is encouraged through a couple of principles:

  • Smaller size and higher velocity: Rather than large, infrequent releases, aim for a higher frequency of smaller ones. This allows the team to observe changes in system behavior incrementally and reduces the potential for large system failures.
  • Self-service: An SRE team should completely own its release process, which should be automated effectively. This both eliminates work and encourages small-size, high-velocity pushes.
  • Hermetic builds: The process for building a new release should be hermetic, or self-contained. That is to say, the build process must be locked to known versions of existing tools (e.g., compilers) and not be dependent on external tools.

Version control

All code releases should be submitted within a version control system to allow for easy reversions in the event of erroneous, redundant, or ineffective code.

Code reviews

The process of submitting releases should be accompanied by a clear and visible code review process. Basic changes may not require approval, whereas more complicated or impactful changes will require approval from other site reliability engineers or technical leads.

Recap of SRE principles

The main principles of SRE are embracing risk, setting SLOs, eliminating work via automation, monitoring systems, keeping things simple, and outlining the release engineering process.

Embracing risk involves clearly defining failure and setting error budgets. The best way to do this is by creating and enforcing SLOs, which track system performance directly and also help identify the potential costs of system improvement. The appropriate SLO depends on how risk is measured and the needs of the customer. Enforcing SLOs requires monitoring, usually through dashboards and logs. 

Site reliability engineers focus on project work, in addition to development operations, which allows for services to expand in scope and scale while maintaining low costs. This is called sublinear growth and is achieved through automating repetitive tasks. Monitoring that automates alerting creates a streamlined operations process, which increases reliability. 

Site reliability engineers should keep systems simple by reducing the amount of code written, encouraging modular development, and publishing playbooks with standard operating procedures. SRE release processes should be hermetic and push small, frequent changes using version control and code reviews.

Read more on Best SRE Practices

Integrated full stack reliability management platform
Platform
Blameless
Lightstep
Squadcast
Incident Retrospectives
Seamless Third-Party Integrations
Built-In Status Page
On Call Rotations
Incident
Notes
Advanced Error Budget Tracking
Try For free
Platform
Incident Retrospectives
Seamless Third-Party Integrations
Incident
Notes
Built-In Status Page
On Call Rotations
Advanced Error Budget Tracking
Blameless
FireHydrant
Squadcast
Try For free
Written By:
Squadcast Community
Rajeev Ram
Squadcast Community
Rajeev Ram
March 31, 2023
SRE
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

The Guide to SRE Principles

Mar 31, 2023
Last Updated:
October 16, 2024
Share this post:
The Guide to SRE Principles

Learn about SRE Guideline and SRE Principles. How to measure system performance for user-facing services, set service level objectives to define availability, and use error budgets to balance development and reliability.

Table of Contents:

    Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. 

    The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems. In particular, site reliability engineers are responsible for ensuring that a given system’s behavior consistently meets business requirements for performance and availability.

    Furthermore, whereas traditional operations teams and development teams often have opposing incentives, site reliability engineers are able to align incentives so that both feature development and reliability are promoted simultaneously.

    Basic SRE principles 

    In this article, we’ll cover SRE best practices, provide some examples of those key principles, and include relevant details and illustrations to clarify these examples.

    Principle Description Example
    Embrace risk No system can be expected to have perfect performance. It’s important to identify potential failure points and create mitigation plans. Additionally, it’s important to budget a certain percentage of business costs to address these failures in real time. A week consists of 168 hours of potential availability. The business sets an expectation of 165 hours of uptime per week to account for both planned maintenance and unplanned failures.
    Set service level objectives (SLOs) Set reasonable expectations for system performance to ensure that customers and internal stakeholders understand how the system is supposed to perform at various levels. Remember that no system can be expected to have perfect performance.
    • The website is up and running 99% of the time.
    • 99% of all API requests return a successful response.
    • The server output matches client expectations 99% of the time.
    • 99% of all API requests are delivered within one second.
    • The server can handle 10,000 requests per second.
    Eliminate work through automation Automate as many tasks and processes as possible. Engineers should focus on developing new features and enhancing existing systems at least as often as addressing real-time failures. Production code automatically generates alerts whenever an SLO is violated. The automated alerts send tickets to the appropriate incident response team with relevant playbooks to take action.
    Monitor systems Use tools, such as Squadcast, to monitor system performance. Observe performance, incidents, and trends.
    • A dashboard that displays the proportion of client requests and server responses that were delivered successfully in a given time period.
    • A set of logs that displays the expected and actual output of client requests and server responses in a given time period.
    Keep things simple Release frequent, small changes that can be easily reverted to minimize production bugs. Delete unnecessary code instead of keeping it for potential future use. The more code and systems that are introduced, the more complexity created; it’s important to prevent accidental bloat. Changes in code are always pushed via a version control system that tracks code writers, approvers, and previous states.
    Outline the release engineering process Document your established processes for development, testing, automation, deployments, and production support. Ensure that the process is accessible and visible. A published playbook lists the steps to address reboot failure. The playbook contains references to relevant SLOs, dashboards, previous tickets, sections of the codebase, and contact information for the incident response team.

    Embrace risk

    No system can be expected to have perfect performance. It’s important to create reasonable expectations about system performance for both internal stakeholders and external users.

    Key metrics

    For services that are directly user-facing, such as static websites and streaming, two common and important ways to measure performance are time availability and aggregate availability.

    This article provides an example of calculating time availability for a service.

    For other services, additional factors are important, including speed (latency), accuracy (correctness), and volume (throughput).

    An example calculation for latency is as follows:

    • Suppose 10 different users serve up identical HTTP requests to your website, and they are all served properly.
    • The return times are monitored and recorded as follows: 1 ms, 3 ms, 3 ms, 4 ms, 1 ms, 1 ms, 1 ms, 5 ms, 3 ms, and 2 ms.
    • The average response time, or latency, is 24 ms / 10 returns = 2.4 ms.

    Choosing key metrics makes explicit how the performance of a service is evaluated, and therefore what factors pose a risk to service health. In the above example, identifying latency as a key metric indicates average return time as an essential property of the service. Thus, a risk to the reliability of the service is “slowness” or low latency.

    Define failure

    In addition to measuring risks, it’s important to clearly define which risks the system can tolerate without compromising quality and which risks must be addressed to ensure quality.

    This article provides an example of two types of measurements that address failure: mean time to failure (MTTF) and mean time between failures (MTBF).

    The most robust way to define failures is to set SLOs, monitor your services for violations in SLOs, and create alerts and processes for fixing violations. These are discussed in the following sections.

    Error budgets

    The development of new production features always introduces new potential risks and failures; aiming for a 100% risk-free service is unrealistic. The way to align the competing incentives of pushing development and maintaining reliability is through error budgets.

    An error budget provides a clear metric that allows a certain proportion of failure from new releases in a given planning cycle. If the number or length of failures exceeds the error budget, no new releases may occur until a new planning period begins.

    The following is an example error budget.

    Planning cycle Quarter
    Total possible availability 2,190 hours
    SLO 99.9% time availability
    Error budget 0.1% time availability = 21.9 hours

    Suppose the development team plans to release 10 new features during the quarter, and the following occurs:

    • The first feature doesn’t cause any downtime.
    • The second feature causes downtime of 10 hours until fixed.
    • The third and fourth features each cause downtime of 6 hours until fixed.
    • At this point, the error budget for the quarter has been exceeded (10 + 6 + 6 = 22 > 21.9), so the fifth feature cannot be released.

    In this way, the error budget has ensured an acceptable feature release velocity while not compromising reliability or degrading user experience.

    Set service level objectives (SLOs)

    The best way to set performance expectations is to set specific targets for different system risks. These targets are called service level objectives, or SLOs. The following table lists examples of SLOs based on different risk measurements.

    Time availability Website running 99% of the time
    Aggregate availability 99% of user requests processed
    Latency 1 ms average response rate per request
    Throughput 10,000 requests handled every second
    Correctness 99% of database reads accurate

    Depending on the service, some SLOs may be more complicated than just a single number. For example, a database may exhibit 99.9% correctness on reads but have the 0.1% of errors it incurs always be related to the most recent data. If a customer relies heavily on data recorded in the past 24 hours, then the service is not reliable. In this case, it makes sense to create a tiered SLO based on the customer’s needs. Here is an example:

    Level 1 (records within the last 24 hours) 99.99% read accuracy
    Level 2 (records within the last 7 days) 99.9% read accuracy
    Level 3 (records within the last 30 days) 99% read accuracy
    Level 4 (records within the last 6 months) 95% read accuracy

    Costs of improvement

    One of the main purposes of establishing SLOs is to track how reliability affects revenue. Revisiting the sample error budget from the section above, suppose there is a projected service revenue of $500,000 for the quarter. This can be used to translate the SLO and error budget into real dollars. Thus, SLOs are also a way to measure objectives that are indirectly related to system performance.

    SLO Error Budget Revenue Lost
    95% 5% $25,000
    99% 1% $5,000
    99.90% 0.10% $500
    99.99% 0.01% $50

    Using SLOs to track indirect metrics, such as revenue, allows one to assess the cost for improving a service. In this case, spending $10,000 on improving the SLO from 95% to 99% is a worthwhile business decision. On the other hand, spending $10,000 on improving the SLO from 99% to 99.9% is not.

    Integrated full stack reliability management platform
    Try for free
    Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
    Manage incidents on the go with native iOS and Android mobile apps
    Seamlessly integrated alert routing, on-call, and incident response
    Try for free

    Eliminate work through automation

    One characteristic that distinguishes SREs from traditional DevOps is the ability to scale up the scope of a service without scaling the cost of the service. Called sublinear growth, this is accomplished via automation.

    In a traditional development-operations split, the development team pushes new features, while the operations team dedicates 100% of its time to maintenance. Thus, a pure operations team will need to grow 1:1 with the size and scope of the service it is maintaining: If it takes O(10) system engineers to serve 1000 users, it will take O(100) engineers to serve 10K users.

    In contrast, an SRE team operating according to best practices will devote at least 50% of its time to developing systems that remove the basic elements of effort from the operations workload. Some examples of this include the following:

    • A service that detects which machines in a large fleet need software updates and that schedules software reboots in batches over regular time intervals.
    • A “push-on-green” module that provides an automatic workflow for the testing and release of new code to relevant services. 
    • An alerting system that automates ticket generation and notifies incident response teams.

    Monitor systems

    To maintain reliability, it is imperative to monitor the relevant analytics for a service and use monitoring to detect SLO violations. As mentioned earlier, some important metrics include:

    • The amount of time that a service is up and running (time availability)
    • The number of requests that complete successfully (aggregate availability)
    • The amount of time it takes to serve a request (latency)
    • The proportion of responses that deliver expected results (correctness)
    • The volume of requests that a system is currently handling (throughput)
    • The percentage of available resources being consumed (saturation)

    Sometimes durability is also measured, which is the length of time that data is stored with accuracy.

    Dashboards

    A good way to implement monitoring is through dashboards. An effective dashboard will display SLOs, include the error budget, and present the different risk metrics relevant to the SLO.

    Example of an effective SRE dashboard (source)

    Logs

    Another good way to implement monitoring is through logs. Logs that are both searchable in time and categorized via request are the most effective. If an SLO violation is detected via a dashboard, a more detailed picture can be created by viewing the logs generated during the affected timeframe.

    Example of a monitoring log (source)

    Whitebox versus blackbox

    The type of monitoring discussed above that tracks the internal analytics of a service is called whitebox monitoring. Sometimes it’s also important to monitor the behavior of a system from the “outside,” which means testing the workflow of a service from the point of view of an external user; this is called blackbox monitoring. Blackbox monitoring may reveal problems with access permissions or redundancy.

    Automated alerts and ticketing

    One of the best ways for SREs to reduce effort is to use automation during monitoring for alerts and ticketing. The SRE process is much more efficient than a traditional operations process.

    A traditional operations response may look like this:

    1. A web developer pushes a new update to an algorithm that serves ads to users.
    2. The developer notices that the latest push is reducing website traffic due to an unknown cause and manually files a ticket about reduced traffic with the web operations team.
    3. A system engineer on the web operations team receives a ticket about the reduced traffic issue. After troubleshooting, the issue is diagnosed as a latency issue caused by a stuck cache. 
    4. The web operations engineer contacts a member of the database team for help. The database team looks into the codebase and identifies a fix for the cache settings so that data is refreshed more quickly and latency is decreased.
    5. The database team updates the cache refresh settings, pushes the fix to production, and closes the ticket.

    In contrast, an SRE operations response may look like this:

    1. The ads SRE team creates a deployment tool that monitors three different traffic SLOs: availability, latency, and throughput.
    2. A web developer is ready to push a new update to an algorithm that serves ads, for which he uses the SRE deployment tool.
    3. Within minutes, the deployment tool detects reduced website traffic. It identifies a latency SLO violation and creates an alert.
    4. The on-call site reliability engineer receives the alert, which contains a proposal for updated cache refresh settings to make processing requests faster.
    5. The site reliability engineer accepts the proposed changes, pushes the new settings to production, and closes the ticket.

    By using an automated system for alerting and proposing changes to the database, the communication required, the number of people involved, and time to resolution are all reduced. 

    The following code block is a generic language implementation of latency and throughput thresholds and automated alerts triggered upon detected violations.

    
    # Define the latency SLO threshold in seconds and create a histogram to track
    LATENCY_SLO_THRESHOLD = 0.1
    REQUEST_LATENCY = Histogram('http_request_latency_seconds', 'Request latency in seconds', ['method', 'endpoint'])
    
    # Define the throughput SLO threshold in requests per second and a counter to track
    THROUGHPUT_SLO_THRESHOLD = 10000
    REQUEST_COUNT = Counter('http_request_count', 'Request count', ['method', 'endpoint', 'http_status'])
    
    # Check if the latency SLO is violated and send an alert if it is
    def check_latency_slo():
        latency = REQUEST_LATENCY.observe(0.1).observe(0.2).observe(0.3).observe(0.4).observe(0.5).observe(0.6).observe(0.7).observe(0.8).observe(0.9).observe(1.0)
        quantiles = latency.quantiles(0.99)
        latency_99th_percentile = quantiles[0]
        if latency_99th_percentile > LATENCY_SLO_THRESHOLD:
            printf("Latency SLO violated! 99th percentile response time is {latency_99th_percentile} seconds.")
    
    # Check if the throughput SLO is violated and send an alert if it is
    def check_throughput_slo():
        request_count = REQUEST_COUNT.count()
        current_throughput = request_count / time.time()
        if current_throughput > THROUGHPUT_SLO_THRESHOLD:
            printf("Throughput SLO violated! Current throughput is {current_throughput} requests per second.")
    
    
    Example of automated alert calls

    Keep things simple

    The best way to ensure that systems remain reliable is to keep them simple. SRE teams should be hesitant to add new code, preferring instead to modify and delete code where possible. Every additional API, library, and function that one adds to production software increases dependencies in ways that are difficult to track, introducing new points of failure.

    Site reliability engineers should aim to keep their code modular. That is, each function in an API should serve only one purpose, as should each API in a larger stack. This type of organization makes dependencies more transparent and also makes diagnosing errors easier.

    Playbooks

    As part of incident management, playbooks for typical on-call investigations and solutions should be authored and published publicly. Playbooks for a particular scenario should describe the incident (and possible variations), list the associated SLOs, reference appropriate monitoring tools and codebases, offer proposed solutions, and catalog previous approaches.

    Outline the release engineering process

    Just as an SRE codebase should emphasize simplicity, so should an SRE release process. Simplicity is encouraged through a couple of principles:

    • Smaller size and higher velocity: Rather than large, infrequent releases, aim for a higher frequency of smaller ones. This allows the team to observe changes in system behavior incrementally and reduces the potential for large system failures.
    • Self-service: An SRE team should completely own its release process, which should be automated effectively. This both eliminates work and encourages small-size, high-velocity pushes.
    • Hermetic builds: The process for building a new release should be hermetic, or self-contained. That is to say, the build process must be locked to known versions of existing tools (e.g., compilers) and not be dependent on external tools.

    Version control

    All code releases should be submitted within a version control system to allow for easy reversions in the event of erroneous, redundant, or ineffective code.

    Code reviews

    The process of submitting releases should be accompanied by a clear and visible code review process. Basic changes may not require approval, whereas more complicated or impactful changes will require approval from other site reliability engineers or technical leads.

    Recap of SRE principles

    The main principles of SRE are embracing risk, setting SLOs, eliminating work via automation, monitoring systems, keeping things simple, and outlining the release engineering process.

    Embracing risk involves clearly defining failure and setting error budgets. The best way to do this is by creating and enforcing SLOs, which track system performance directly and also help identify the potential costs of system improvement. The appropriate SLO depends on how risk is measured and the needs of the customer. Enforcing SLOs requires monitoring, usually through dashboards and logs. 

    Site reliability engineers focus on project work, in addition to development operations, which allows for services to expand in scope and scale while maintaining low costs. This is called sublinear growth and is achieved through automating repetitive tasks. Monitoring that automates alerting creates a streamlined operations process, which increases reliability. 

    Site reliability engineers should keep systems simple by reducing the amount of code written, encouraging modular development, and publishing playbooks with standard operating procedures. SRE release processes should be hermetic and push small, frequent changes using version control and code reviews.

    Read more on Best SRE Practices

    Integrated full stack reliability management platform
    Platform
    Blameless
    Lightstep
    Squadcast
    Incident Retrospectives
    Seamless Third-Party Integrations
    Built-In Status Page
    On Call Rotations
    Incident
    Notes
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    Seamless Third-Party Integrations
    Incident
    Notes
    Built-In Status Page
    On Call Rotations
    Advanced Error Budget Tracking
    Blameless
    FireHydrant
    Squadcast
    Try For free
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Squadcast Community
    Beyond the Blue Screen: Insights from the Microsoft-CrowdStrike Incident
    Beyond the Blue Screen: Insights from the Microsoft-CrowdStrike Incident
    August 29, 2024
    Squadcast leads the IT Alerting and Incident Management Landscape in G2's Summer 2024 Report
    Squadcast leads the IT Alerting and Incident Management Landscape in G2's Summer 2024 Report
    July 15, 2024
    How Do You Migrate from RBAC to OBAC with Terraform?
    How Do You Migrate from RBAC to OBAC with Terraform?
    May 6, 2024
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.