📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
From SysAdmin to SRE: How to evolve your skillset

From SysAdmin to SRE: How to evolve your skillset

December 16, 2020
From SysAdmin to SRE: How to evolve your skillset
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?

Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.

In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.

Mindset Changes

Embracing Risk

As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. Error budget is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a SLO (service level objective). Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck

Reducing Toil

Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.

Automate all the things

Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.

Dealing with failure: Understanding SLOs and blameless postmortems

SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. Blameless postmortems encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.

Soft Skills

SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require conflict resolution skills. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires strong communication skills. Finally, good negotiation skills will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.

Technical Skills

Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.

  • Programming & Testing Skills: The emphasis on toil reduction and automation in SRE will require significantly stronger programming and testing skills. Typically an SRE should know one highly productive scripting language like Python and one high performance systems language like Go.
  • Infrastructure as Code: Traditionally, infrastructure deployment is a slow, manual, labour intensive process. Because of this, it is expensive, inelastic, inconsistent and unreliable. Infrastructure as Code (IaC) is an automation technique that brings the rigor of software engineering to infrastructure management. Tools like Ansible, Terraform, Puppet or Chef can be used to power an IaC initiative.
  • Cloud, Containers & Container Orchestration: Cloud and container services make something that was previously difficult to automate -- physical hardware -- manageable via standardised APIs. As an added benefit, they are usually far cheaper, more flexible and faster to provision than traditional hardware. They have also made the IaC technique far more powerful and useful. Knowledge of Amazon AWS, Kubernetes and Docker are now considered basic skills for SREs.
  • Modern Monitoring Tools: Active checking systems, metrics collection, and log aggregation have been the traditional mainstays of monitoring. More recently, code instrumentation and distributed tracing have been added to this arsenal. Older de facto standard tools like Nagios, Ganglia and rsyslog have been surpassed by tools like Prometheus, Datadog, and the ELK stack. APMs like NewRelic are now key for instrumentation and OpenTelemetry seems very promising as a distributed tracing tool. Familiarity of these platforms is a significant requirement for a good SRE.
  • Statistical Analysis: SRE culture demands hard data to support decision making. With the vast volumes of data being generated by monitoring tools, some basic statistical analysis is necessary to generate actionable data. This data can be used for capacity planning, release planning, continuous improvement and incident response.

Conclusion

SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.

Written By:
December 16, 2020
Biju Chacko
Nir Sharma
Biju Chacko
Nir Sharma
December 16, 2020
SRE
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2025
Learn how organizations are using Squadcast
to maintain and improve upon their Reliability metrics
Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds...
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
Alexandre Lessard
System Analyst
Martin do Santos
Platform and Architecture Tech Lead
Sandro Franchi
CTO
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
What our
customers
have to say
mapgears
"Mapgears simplified their complex On-call Alerting process with Squadcast.
Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
Alexandre Lessard
System Analyst
bibam
"Bibam found their best PagerDuty alternative in Squadcast.
By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
Martin do Santos
Platform and Architecture Tech Lead
tanner
"Squadcast helped Tanner gain system insights and boost team productivity.
Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
Sandro Franchi
CTO
Revamp your Incident Response.
Peak Reliability
Easier, Faster, More Automated with SRE.