Balancing fast-paced business requirements with the demands of keeping production services stable is not an easy task. SRE is an opinionated implementation of DevOps and is defined by Ben Sloss, VP of Engineering at Google as “what happens when you ask a software engineer to design an operations function”. And it even comes with a completely free manual and workbook.
Although SRE aims to be a “prescription” on how to run complex systems the right way, reliability can mean different things in different contexts. And, usually, unless things go wrong it’s hard to prioritize reliability work ahead of features and bug fixes.
How can SREs encourage teams to think about their operational excellence? How can SREs go about making reliability part of everyone’s daily practices? How can SREs effectively influence people to take reliability seriously and incorporate SRE concepts and practices into their routines? It turns out this is one of the most important question every SRE faces.
When trying to disseminate a practice and cultivate change, you can usually go one of two routes: influence or authority. Influence is “the capacity to have an effect on the character, development, or behavior of someone or something, or the effect itself”. In your context, the goal is to provide best practices, resources, and tools in the hope teams adopt them.
In contrast, authority is “the power or right to give orders, make decisions, and enforce obedience”. Applied to your context, it would be to dictate that teams must follow and adopt several operational practices.
Both approaches can be valuable, depending on the context. For example, for critical systems (e.g. healthcare, aviation) authority might be required to ensure a certain level of safety. On the other hand, authority can be a detriment for teams, making them not feel part of the decision process, not taking into consideration their unique context, and alienating them. With that in mind, you want to be as influential as possible regarding reliability.
When it comes to improving influence there are several ways you can go about it. Applied to an SRE context, there are several tactics you can employ that will help you make reliability part of the discussion when building complex systems.
When people don’t understand your motivations and goals it’s natural for them to become defensive; it’s human nature. It will be a lot easier to spread a reliability mindset if people understand what you’re trying to achieve.
If you’re a central SRE team make it crystal clear what your team does. Clarify if it’s an operational team that takes care of services in production or if it’s a team more focused on guidelines and tooling. Create artifacts that people can access asynchronously (e.g. internal documentation, blog posts) that makes it easy to understand what you’re working towards. Give internal presentations about your work and roadmap and make your team available (e.g. mailing list, slack channel) for consultation or even just for informal chats.
In general, people will be more receptive and address your concerns if they know what you’re all about and that they can reach you when necessary.
A system’s reliability is determined, fundamentally, by its ability to do what its users need it to do. It will then be determined by how happy users are and you know those happy users are good for business. Accepting that reliability is one of the most important requirements of any service, users determine its reliability.
SRE work will be intimately tied to business goals. Satisfied users will generate value (e.g. revenue, product popularity) and reliability is a huge contributor to that perception. It will be important to drive this understanding that you’re not focusing on reliability just to be picky but because that concern will make your business prosper.
You’ve probably been in a situation where you’re trying to communicate with someone that does not speak the same language as you. Maybe you’ve gone to a foreign country, you’re asking for directions and you don’t speak the local language. Eventually, you make whoever you’re trying to speak with understand, you just want the directions to that awesome attraction and they do their best to point you in the right direction. More often than not, it’s a difficult exchange.
Similarly, if you approach product development teams without a clear way to talk about and measure reliability, it will be hard to reason about it. Creating a shared language to talk about reliability, assess it and prioritize work will be detrimental to the success of your quest. The reliability stack will give you the basis for a framework that will make reliability conversations a lot easier. SLIs will provide you with the necessary reliability measurements while SLOs will allow you to assess, within a certain period of time, how reliable your system is. With those pieces in place, Error Budgets will make it easier to prioritize work that addresses reliability concerns.
Because there will always be bugs to fix and features to deliver, reliability will often be an afterthought. Getting buy-in will ensure teams have reliability in mind and will advocate for it on your behalf.
Identify key stakeholders that will help you “spread the message” and treat reliability like an obvious requirement. This will be highly dependent on your organization, but before diving into processes and tools, you should first focus on people. Maybe you need to get product development teams onboard first so that management feels that, not only, reliability is important but that teams are taking it seriously and are receptive to prioritize reliability work. Or maybe you need to address it the other way around, getting buy-in from executives so that development teams understand the need for reliability and feel safe spending time working on it.
Whatever route makes more sense in your context, it will definitely help you improve reliability awareness when key stakeholders are “on your side” and they themselves drive those discussions with peers as well as management.
SRE work is, effectively, work. There’s engineering work but there’s also a lot of work related to advocating, coaching, or consulting. And it should be clear to everyone what you’re working on, what your goals are, what you’re trying to achieve, what your roadmap is.
Whatever tools and processes product development teams are using, you should be using similar ones. This will help standardize how work is tracked and prioritized. It will allow teams to easily understand what you’re focusing on, if they require something from you that you’re not prioritizing or if your shared goals might be compromised.
Making sure your work is visible will assure teams you don’t have any hidden agenda and that you’re all working with the same goals in mind.
It’s very important to build bridges. People are a lot more receptive to hearing your thoughts and ideas if they trust you. If you start, out of the gate, telling people what they need to fix and prioritize, you’ll be met with a lot of resistance.
Start by sitting down with teams and understanding what they do, what their products are, what pain points they have, and what they would like to improve. Actively listening can take you a long way. Ask questions, clarify issues, and understand what’s at stake. You’ll want teams to see you as a partner, that wants to make their lives easier and not as an adversary.
An “us vs them” mentality can be attained quickly if you start imposing reliability-concern gates. Either with lengthy manual processes or automated blocking checks, enforcing might come at the expense of the team’s goodwill. Instead, work with them to make sure those concerns are valid, share engineering work that will address those concerns, and communicate well in advance about when, and if any blocking gate will be put in place.
SRE work is never-ending. It’s a day-to-day practice that needs to scale with the organization. And one of the best ways to scale it is to be self-service.
You should be building tools that would make it easy for teams to incorporate reliability concerns into their work. For example, you could be building and maintaining libraries that export the necessary metrics to your monitoring system or that standardize logging. Or you could be building automation capabilities that would help diagnose problems in your systems. Or you could be building tools that automate manual tasks and address known issues within your systems. Most importantly, you don’t want to be a bottleneck. You want teams to be as independent as possible to do their work and deliver value to the business.
SRE involves a lot of engineering work encompassing, at the same time, a lot of communication. A lot of that communication is targeted at influencing teams to take reliability seriously and make it part of their work.
Making sure that teams understand what you're working towards is critical to get people on board. You should communicate extensively, create artifacts, give internal talks, partner with teams, and make your work visible. Make sure teams understand that reliability is measured by how happy users are with your services, that reliability work is focused on making sure they are satisfied, and that the business will thrive on that.
Get people on board, get their buy-in. Your message and goals will be easier to spread if you have advocates on your side. Identify key stakeholders, work with them, understand their concerns and build with them a common understanding of what reliability looks like, that they would happily share with others. Having a shared language or framework, like the reliability stack, will make it a lot easier to talk about, assess, and prioritize reliability work.
At the end of the day, you want to deliver value and enable teams to focus on delivering value to the business. Make yourself self-service. Build tools that help improve reliability, which is a no-brainer to use. Help teams solve pain points through automation. Build self-remediation tools to help address known issues in your system. And make sure you don’t become just another 'pain' or 'gate' that teams complain and dread about. You should be seen as a partner, as a team that works with the same goals in mind and that is there to help when necessary.