About 5 years ago I read Release It! Coming from extremely small startup environments I had begun to learn a lot of these lessons just from experience, (defensive programming, bounded resource control, operational visibility, critical signals for monitoring). I hadn’t realized that smart people were thinking about this or that Resilience Engineering was a formal Engineering domain; with lots of research around how to detect, recover, and then evolved in how to prevent errors from ever occurring. When the Site Reliability Engineering book was published it further enforced that this was serious and that the industry was about to take this seriously. I had been walking the line between engineering and operations and keeping services running, but didn’t realize that Google was doing it and had a name for it: Site Reliability Engineering (SRE). Only in my most recent role do I have the official title of Site Reliability Engineer!
Site Reliability is still a young discipline and the most challenging part is being an advocate and convincing others of its importance and demonstrating its impact through objective metrics (usually Service Level Objectives; SLOs). Feature and Product focused engineering teams have a different focus, so coming up with low friction ways through processing or tooling to enable product focused teams to start incorporating SRE practices and principles into their day to day is the most challenging part.
Service Level Objectives (SLOs) hands down. Service Level Objectives are the cornerstone for Site Reliability Engineering. They connect engineering, the client & the customer and provide a really elegant, easy to understand and quantifiable feedback loop.
Next is some system for gathering these metrics, either time series (prometheus, datadog, etc) or logging (elastic). The final piece is some sort of alerting system. Alerts are what closes the feedback loop and makes the SLOs actionable (make them enforceable), most of the listed tools have some solution for alerting. I’ve found without alerts it’s just like putting metrics in the void, and hoping that people are trained, and disciplined, enough to consult them.
I’ve been working on a platform named ValueStream to help teams understand their software delivery performance. Google SRE outlines the concept of “Four Golden Signals” for monitoring systems: Throughput, Latency, Error rate and Saturation. These are useful signals for monitoring any system and also closely align with traditional manufacturing signals, stemming from lean manufacturing and the Toyota Production System. Additionally, the State of DevOps report and Accelerate also rank the efficacy of delivery for teams using these metrics. The problem is that these simple metrics aren’t uniformly available in the most common software project management tools: Jira, Trello, Github, Gitlab, Jenkins, etc. and the tools that do expose some of these metrics obviously require that every team and engineer fully use the tool. It’s very difficult to get basic performance metrics for software delivery, and this is compounded in multi-tool environments; suppose one team uses Trello to track their tasks, but another team uses Github, and product uses something else for milestones.
I created ValueStream to enable teams to uniformly measure their software delivery across all of their tools, while also enabling teams to link where work is originating from by modeling work as an actual graph. While ValueStream is starting as a place to centralize delivery metrics, its goal is to be able to provide teams with actionable insights in order to help them succeed at their DevOps and delivery transformations.
I think in the near term (< 3 years) we are going to see products for managing Service Level Objectives (SLOs), Incident Response, as well as Metadata to inventory services and teams and to dynamically calculate Maturity (capability maturity model) scores. I’m super excited about this because a lot of companies are spending huge amounts of time and money developing these solutions in house. Also the companies that are succeeding at these aren’t diffusing their successes throughout the industry.
In the long term I think there is going to be a fundamental shift in how we model systems. We’ll have enough compute storage to model system state as a time series graph (the data structure). Graphs are the natural structure for systems, and we’ll see our system representations slowly start to be modeled as graphs. We’re seeing this operational with the increased popularity of tracing. Instead of individual events, we have causally connected events and are able to see the state of transactions as a timeline (i.e., the transactions system state as a timeseries). These new systems will be able to model the links of our physical and logical systems. For example when an SLO is breached, it will model the relationships between the target service, its dependencies, and events affecting those. For an SLO error rate that fires, it would show the recent infrastructure changes (deploys, scale up events, upstream downstream dependency events), the recent tickets affecting the service and its upstream / downstream dependencies, and context around all connected services. I tried what this might look like using current tools in a recent blog post I published on SRE Knowledge Graphs.
To summarize: In the short term I think we’ll see cloud offerings for common SRE tools, and in the long term I think we’ll see those tools converge into graph based intelligent systems that are able to surface important insights/anomalies (“Debugging as a service”) automatically.
Don’t make assumptions about the system. In my experiences errors happen when there is a mismatch between our mental models and reality, and I find myself to be exponentially more productive when I invest in learning what’s real (what’s actually happening) before making assumptions about the system. I never once thought: “that time learning the system a waste of time”. Inversely when I don’t make this initial investment I find myself way less productive and more likely to produce errors in my work.
SRE is already a really broad role that benefits from tons of diverse backgrounds and requires various levels of technical understanding. It can also be really specialized. 2 years ago I focused on metric standardization, teaching about monitoring, metrics and alerting. I got to dig in for 6 months on distributed tracing, and now I’m able to dig in for 6 months on Service Level Objectives. It’s been very focused and deep work opposed to generalist, or adhoc work.
The most important thing is to establish some sort of overarching KPI or objective measurement for each task being performed. It’s important to make the shortcomings of these metrics explicit and outline what they aren’t able to measure. Service Level Objectives are one of the most important but time to recovery, or performance benchmarks or resource usage are all other examples. Establishing these metrics, instrumenting them and then surfacing them are able to demonstrate impact and gives objective hooks to communicate with stakeholders and non technical coworkers.
The Google SRE book is the most important because it formalized the role and constantly reference when referring to concepts, approaches, or reasons behind doing certain things. The most inspiring book I’ve read recently has been “Thinking in Systems: A Primer” by Donella Meadows. It has tools, heuristics and approaches for understanding systems and interconnected components, which I’ve found especially relevant for Site Reliability Engineering. When errors happen they aren’t one off events but have many interconnected dependencies and relationships. Thinking in Systems is a toolkit for understanding these relationships and reasoning about the effects of them in a structured way.
Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.