An organisation with mature Site Reliability Engineering (SRE) principles may conjure images of engineers with years of experience in DevOps and System Administration, having a suite of specialised tools and experts dissecting each service outage. For an organisation that is thinking of implementing SRE principles this is an intimidating image and may seem unattainable. The truth is everyone can get started on their SRE journey by following a few elementary principles, which are outlined here. While we are not claiming that this is the only way to go forward when you don't have an SRE as a job title or role in your team, it's a good place to start.
In this blog, we go over some of the most basic steps you can take in your journey towards SRE adoption. To unlock the full value of SRE practices however requires a deeper commitment than just investing in the tools or training. There is a need to have an organisation wide cultural changes as well. We also look at some of the ways you can implement SRE principles including learning about error budgets, automation, and more. At the heart of great SRE practice lies a willingness to break down existing silos and to communicate and coordinate across cross-functional teams along with automating redundant processes.
Now that you have decided to adopt some SRE practices the question comes up: Where do I start? To start off we look at some of the most useful concepts and practices in SRE and how it makes deployments seamless and your engineers happy.
An error budget can be defined as the maximum amount of time your system can be down without facing consequences. These consequences can be derived from external legal agreements (SLAs) or even internal organisational goals(SLOs). Error budgets are important because they allow your development and IT operations to ensure that the price of having new features in the product does not result in downtime and inconvenience for your users. For example, if your product is running faultlessly your developers can spend the error budget to create and deploy more features. If you have run out of error budget for the month then publishing new features takes a backseat till operational issues have been resolved.
Deciding the error budget will depend on the kind of service you are providing. Are you running a banking platform that needs to be available almost 24/7? Are you running an OTT platform specialising in live streaming sports? An effective error budget must also take into account the external problems that you won't have control over - which includes internet connectivity going down, your remote machines becoming the victim of a DDOS attack and similar unforeseen problems. Your error budget will be derived from the SLOs that you have picked. In the next section of this blog, we look at how you can pick SLOs that would be most helpful for your organisation.
An error budget is calculated with the following formula.
Error Budget = (1 - SLO of the service)
For instance a 99.9% SLO service has a 0.1% error budget.
While error budgets are usually used as an unambiguous criteria for SREs to accept changes from the development team, they are still useful for organisations without SREs. Development teams can use the error budget to decide whether or not to deploy changes or to decide whether to work on riskier features or not.
The adoption of SRE practices should ideally evolve organically from your existing processes. The first thing to keep in mind is “Which metrics are most important to my customers?”. If you are a customer facing website, the most common would be uptime, latency and volume. Those important measurements can become your SLIs (service level indicators) and SLOs(service level objectives). SLOs should ideally shape up as a natural outcome of customer requirements. It is ofcourse tempting to let your IT staff determine your SLOs but for a more sustainable solution the SLOs must come from things that directly impact your customers. The easiest way to pick SLOs is to work backwards from your business objectives. This metric must be something that you can improve upon and something that does not depend on external factors. It is always better to start with fewer SLOs and pick the ones that you feel are most important for your product. The vital thing to remember here is that the SLO should be something that is quantifiable.
“SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction.” says Adam Hammond, DevOps Engineer at Megaport, in his extensive blog called “Choosing SLOs that users need, not the ones you want to provide"
Finally, when you define your SLOs, remember that a good SLO should be S.M.A.R.T.
Additional reading: A narrative case study -“How small changes to your SLOs can be SMART for your business"
SLOs are a great way to generate metrics about what is important to the business or it’s customers. Those insights are critical even if you don’t have a separate SRE team.
If you have been running a production system then toil will be familiar to you. Google’s SRE book defines Toil as “The kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” All organisations that are scaling up their product have had to deal with toil at some point of their expansion. The first stop to tackling toil in your organisation is to identify it properly. Sometimes this can be challenging as toil is often disguised as seemingly important work. A significant aspect of good SRE implementation is recognising toil early and automating it away. There is a cultural aspect to toil as well. Many teams may have included “busy work” in their schedules without realising its potential for long term damage. Naturally all production systems require human intervention to run optimally but the number of people managing them should not be growing linearly with the addition of every new user, virtual machine or service. Toil has a detrimental effect on your engineering team's morale and productivity as well.
Here are some examples of toil that is commonly faced by growing organisations-
An effective way of identifying toil includes doing a survey of your engineers. Here are some sample questions you can ask your engineering team to pinpoint problem areas.
Q: Approximately speaking over the past month, how much of your time was spent on toil?
Q: What according to you were the top five sources of toil?
Q: Is there toil that can be automated but you are not able to due to cultural reasons?
Without a separate SRE team, toil is doubly dangerous -- not only is your limited manpower being wasted on value added activity, they’re probably wasting more time than an experienced SRE would since it’s work that they don’t specialize in.
Automation is a substantial aspect of good SRE practices. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments(Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining.
Automation helps in three ways:
With good automation in place, you can delay the need for specialist SREs and infrastructure people.
Here are some other best practices that are good to follow:
Adopting SRE best practices can be something teams and organisations of all sizes can begin with. If you are anticipating rapid expansion of your product or being plagued by frequent breakdowns, exploring the SRE process makes sense. It is important to remember that SRE is a journey and what works for others may not be the best solution for you. While it’s vital to learn from others, every organisation has its own unique journey towards adopting SRE. Post adopting a dedicated incident management platform it becomes easier to track the problem areas in your digital infrastructure. This includes learning by bringing in important metrics like MTTA (Mean time to acknowledge), MTTR (Mean time to resolve) and others. This transition to a dedicated platform is the next big step on your SRE journey. More than just a set of metrics and practices, SRE envisions a culture where problems are resolved in a “blameless” environment. A culture where issues can be raised and fixed in a transparent manner.
This article is inspired by a talk originally given at LISA'19 by Squadcast with the same title "How to SRE without an SRE on your team".
What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.