Since I started tinkering with computers in high school, I enjoyed doing both programming and systems administration tasks. During my time at university, I was contributing to open source projects, and I was also a Gentoo Linux developer. There, I had the opportunity to contribute in many areas, from infrastructure projects to various software tools. I always liked the idea of approaching operations through software engineering lenses. After learning about Site Reliability Engineering and its core principles, I got intrigued. I started exploring more about the skills required, and I realized that it is a good discipline for me to pursue.
SRE seeks to bring a balance between software release velocity and production stability and can influence engineering. So, introducing new concepts such as Service Level Objectives (SLOs), Error Budgets and Production Readiness Reviews to the teams and the organization is a big challenge. Leadership might be reluctant and question the value of SRE because it challenges conventional wisdom. Realistically speaking, though, without buy-in from management, SRE cannot work and thrive. Therefore, it's important to introduce such concepts gradually with a concrete proposal and proper KPIs to measure progress. Otherwise, it becomes quite difficult for management to sponsor such a cultural transformation.
Currently, I believe my text editor (vim) and Spotify are just enough to keep me productive. In my opinion, the tools are only the means to let us do our work, and from time to time, we should read articles and assess new tools and techniques to see if they'll make us more productive.
In late 2013, I received an email from a recruiter about a Google SRE internship, which was a job title that I had never heard before. So, I started to collect related resources to read in a document. That time around, the reading material online was quite scarce. However, after the first SRECon in 2014 and the release of the first Google SRE book in 2016, the web exploded with articles and conference talks. Also, companies started to adopt SRE, and they were publishing blog posts about their experience. Throughout all that time, I continued reading and collecting resources about the role.
Then, I saw this “awesome” repository with a massive list of sub-lists that included resources from various computer science topics, and I realized that it would make sense to share the resource list on my Github in the same form. That way, more folks interested in SRE would benefit from it, and they would contribute back with resources that I wasn't aware of. Thus, I published the awesome-sre repository some weeks after the release of the first Google SRE book. The first commit is from April 2016.
So far, I am happy how the repo has evolved, and I get positive feedback about it when I meet people at meetups and conferences. I'm looking forward to seeing what the future holds.
Although this applies to all the engineering disciplines, in general, you need to be curious, learn from failures, and have a growth mindset. SRE is a broad discipline and requires you to have an end-to-end view of the stack and its underlying infrastructure and, additionally, participate in incident response. So, every day is a new learning experience.
Another advice for people interested in SRE would be to start reading about reliability engineering in other industries such as Aviation, Maritime, and Health Care. That includes research papers, books, and even incident reports from accidents. I find it very eye-opening reading such material, and in the process, you might also learn new techniques to implement in your environment. Exploring other industries allows you to create a more holistic view of Reliability Engineering that will help you become a well-rounded SRE.
With the speed that technology and the IT industry is advancing, it's hard for me to predict what SRE will be in the far-future. However, using the facts that we know, I believe that the foundations for SRE are already laid down. We've got plenty of conferences, books, and articles at our disposal. SRE is on steady growth. We already hear success and failure stories about implementing SRE, and I believe that many companies that initially were reluctant of SRE will begin evaluating it soon. Consequently, as the adoption rate increases, SRE will influence organizations in a positive way that the layer between SRE and feature development will become thinner, and the whole organization will adopt an SRE mentality.
Another thing that I believe will have a significant impact on SRE is the current effort to apply learnings from Resilience Engineering, Human Factors, and Safety Management in complex software systems. That effort contributes towards changing the view on how we approach Incident Response and how to create useful metrics for it, and how to eventually dismiss the classic root cause analysis and rather focus on the contributing factors instead. I believe we'll be in a position to get more insights from our incident response protocol that we'll also enable us to write better post-incident reviews.
Our job is quite interruption-driven, so context switching and task delegation might affect our productivity. How to tackle this issue is very subjective, and it depends on your workload. I found that checklists can be used as an external mental buffer to reduce cognitive overload and help you regain your focus quickly. You can create a list to track your task progress or write things that you'd like to implement in the future. You could use a list to create an agenda with things you want to discuss with your manager on your 1:1, or lists in the form of a curriculum for your daily learning process. Checklists require minimal effort and offer significant productivity gains.
Another thing that I find useful is (digital) reminders, either in the corporate chat or in the calendar. That way, if you're getting interrupted by a customer or an incident, you can set some reminders followed by a note to make sure you don't forget things while switching context.
I've seen that some companies for marketing reasons and to attract more candidates, rename their sysadmin roles to SRE without any cultural transformation from the inside. Although SRE is trending nowadays, companies should avoid cargo culting SRE.
SRE is not a purely operational role, and more steps are required to be taken to establish an SRE culture than just renaming the role title.
Finally, I believe that people should not see SRE as the draconian head of production. Instead, they should see it as a team that influences feature development and also getting influenced by it. Both should have common incentives, to make reliability a first-class citizen and make customers happy.
There are plenty of practices that I could talk about for hours, but I will name some of my favorites. Otherwise, the interview will become boring :) :
- Design Documents: I can't describe how important it is to have design documents in place. A design document is a way to make sure everyone in the team and the organization is on the same page. It acts as a record of design decisions and offers context to them. Moreover, It can be used to propose ideas before we jump to the implementation.
- Wheel of Misfortune: For those who are not familiar with it, it's a role-playing game for incident management training, and its goal is to build confidence via simulated outage scenarios to engineers that are part of an on-call rotation. I find gamification an essential element for SRE training, as it increases the motivation to participate and provides a good engagement model. I have created an open-source version of Wheel of Misfortune, where teams can fork, insert their outage scenarios, and start practicing.
- Production Readiness Review: A Production Readiness Review or PRR is used to assess services and decide whether they meet an organization's reliability practices and standards. A PRR usually comes as a checklist or in the form of a questionnaire. Practically, it is a set of reasonable practices that an organization has identified to maintain stability and improve the reliability of services. SREs are required to have an understanding of all components of the infrastructure, and they work with production that requires special considerations. Therefore, it is important to keep consistency across practices in services in production and also close potential gaps between the application and production. PRR is also a powerful tool for knowledge exchange.
Recently, I read "The Checklist Manifesto" book, which focuses on the origins and uses of checklists in various industries. It also provides tips on how to create useful checklists, which is a powerful tool for SRE used in a wide range of areas, from incident response and troubleshooting to Production Readiness Reviews. Another insightful read lately was a research paper named "Nines are not enough: meaningful metrics for clouds" from Google. It describes how hard it is to define SLOs, and offers tips on how to create fine-grained SLOs and finally, explains how to become a good "SLOgician".
Last month, I attended the SRECon EMEA 2019 in Dublin, which has a vast collection of talks that are now available online and are very inspiring. Besides that SRECon is a great place to meet fellow SREs and exchange ideas, the committee does a great job setting the theme of the conference and focusing primarily in the engineering principles rather than specific tech that might be irrelevant after three years.
I don't believe that SRE is competing with DevOps. SRE originated at Google around 2003, while DevOps emerged somewhere around 2008. As it's already discussed in plenty of talks, practically, class SRE implements DevOps. We could say that DevOps is more of a generalized set of practices and cultural guidelines, and SRE is a set of opinionated practices or an opinionated implementation of DevOps. Both have the same driving force to break the organizational silos and enable faster releases without sacrificing production stability.
Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.