The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?
Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.
In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.
As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. Error budget is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a SLO (service level objective). Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck
Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.
Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.
SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. Blameless postmortems encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.
SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require conflict resolution skills. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires strong communication skills. Finally, good negotiation skills will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.
Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.
SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.