📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE Speak
Danny Mican on his experience as an SRE at Auth0

Danny Mican on his experience as an SRE at Auth0

December 2, 2019
Danny Mican on his experience as an SRE at Auth0
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

How did you become an SRE?

About 5 years ago I read Release It! Coming from extremely small startup environments I had begun to learn a lot of these lessons just from experience, (defensive programming, bounded resource control, operational visibility, critical signals for monitoring). I hadn’t realized that smart people were thinking about this or that Resilience Engineering was a formal Engineering domain; with lots of research around how to detect, recover, and then evolved in how to prevent errors from ever occurring. When the Site Reliability Engineering book was published it further enforced that this was serious and that the industry was about to take this seriously. I had been walking the line between engineering and operations and keeping services running, but didn’t realize that Google was doing it and had a name for it: Site Reliability Engineering (SRE). Only in my most recent role do I have the official title of Site Reliability Engineer!

What's the most challenging part of your job?

Site Reliability is still a young discipline and the most challenging part is being an advocate and convincing others of its importance and demonstrating its impact through objective metrics (usually Service Level Objectives; SLOs). Feature and Product focused engineering teams have a different focus, so coming up with low friction ways through processing or tooling to enable product focused teams to start incorporating SRE practices and principles into their day to day is the most challenging part.

What process, tools and techniques you can't live without?

Service Level Objectives (SLOs) hands down. Service Level Objectives are the cornerstone for Site Reliability Engineering. They connect engineering, the client & the customer and provide a really elegant, easy to understand and quantifiable feedback loop.

Next is some system for gathering these metrics, either time series (prometheus, datadog, etc) or logging (elastic). The final piece is some sort of alerting system. Alerts are what closes the feedback loop and makes the SLOs actionable (make them enforceable), most of the listed tools have some solution for alerting. I’ve found without alerts it’s just like putting metrics in the void, and hoping that people are trained, and disciplined, enough to consult them.

How did ValueStream come about? Would love to hear more about this.

I’ve been working on a platform named ValueStream to help teams understand their software delivery performance. Google SRE outlines the concept of “Four Golden Signals” for monitoring systems: Throughput, Latency, Error rate and Saturation. These are useful signals for monitoring any system and also closely align with traditional manufacturing signals, stemming from lean manufacturing and the Toyota Production System. Additionally, the State of DevOps report and Accelerate also rank the efficacy of delivery for teams using these metrics. The problem is that these simple metrics aren’t uniformly available in the most common software project management tools: Jira, Trello, Github, Gitlab, Jenkins, etc. and the tools that do expose some of these metrics obviously require that every team and engineer fully use the tool. It’s very difficult to get basic performance metrics for software delivery, and this is compounded in multi-tool environments; suppose one team uses Trello to track their tasks, but another team uses Github, and product uses something else for milestones.

I created ValueStream to enable teams to uniformly measure their software delivery across all of their tools, while also enabling teams to link where work is originating from by modeling work as an actual graph. While ValueStream is starting as a place to centralize delivery metrics, its goal is to be able to provide teams with actionable insights in order to help them succeed at their DevOps and delivery transformations.

What according to you is the future of SRE?

I think in the near term (< 3 years) we are going to see products for managing Service Level Objectives (SLOs), Incident Response, as well as Metadata to inventory services and teams and to dynamically calculate Maturity (capability maturity model) scores. I’m super excited about this because a lot of companies are spending huge amounts of time and money developing these solutions in house. Also the companies that are succeeding at these aren’t diffusing their successes throughout the industry.

In the long term I think there is going to be a fundamental shift in how we model systems. We’ll have enough compute storage to model system state as a time series graph (the data structure). Graphs are the natural structure for systems, and we’ll see our system representations slowly start to be modeled as graphs. We’re seeing this operational with the increased popularity of tracing. Instead of individual events, we have causally connected events and are able to see the state of transactions as a timeline (i.e., the transactions system state as a timeseries). These new systems will be able to model the links of our physical and logical systems. For example when an SLO is breached, it will model the relationships between the target service, its dependencies, and events affecting those. For an SLO error rate that fires, it would show the recent infrastructure changes (deploys, scale up events, upstream downstream dependency events), the recent tickets affecting the service and its upstream / downstream dependencies, and context around all connected services. I tried what this might look like using current tools in a recent blog post I published on SRE Knowledge Graphs.

To summarize: In the short term I think we’ll see cloud offerings for common SRE tools, and in the long term I think we’ll see those tools converge into graph based intelligent systems that are able to surface important insights/anomalies (“Debugging as a service”) automatically.

Any productivity hacks that you would give to new SREs?

Don’t make assumptions about the system. In my experiences errors happen when there is a mismatch between our mental models and reality, and I find myself to be exponentially more productive when I invest in learning what’s real (what’s actually happening) before making assumptions about the system. I never once thought: “that time learning the system a waste of time”. Inversely when I don’t make this initial investment I find myself way less productive and more likely to produce errors in my work.

What are some of the things people get wrong about this role?

SRE is already a really broad role that benefits from tons of diverse backgrounds and requires various levels of technical understanding. It can also be really specialized. 2 years ago I focused on metric standardization, teaching about monitoring, metrics and alerting. I got to dig in for 6 months on distributed tracing, and now I’m able to dig in for 6 months on Service Level Objectives. It’s been very focused and deep work opposed to generalist, or adhoc work.

What are some of the best practices you’ve picked up along the way?

The most important thing is to establish some sort of overarching KPI or objective measurement for each task being performed. It’s important to make the shortcomings of these metrics explicit and outline what they aren’t able to measure. Service Level Objectives are one of the most important but time to recovery, or performance benchmarks or resource usage are all other examples. Establishing these metrics, instrumenting them and then surfacing them are able to demonstrate impact and gives objective hooks to communicate with stakeholders and non technical coworkers.

Is there any book, video, talk, or tech that has inspired you lately, and why?

The Google SRE book is the most important because it formalized the role and constantly reference when referring to concepts, approaches, or reasons behind doing certain things. The most inspiring book I’ve read recently has been “Thinking in Systems: A Primer” by Donella Meadows. It has tools, heuristics and approaches for understanding systems and interconnected components, which I’ve found especially relevant for Site Reliability Engineering. When errors happen they aren’t one off events but have many interconnected dependencies and relationships. Thinking in Systems is a toolkit for understanding these relationships and reasoning about the effects of them in a structured way.

Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.

Written By:
December 2, 2019
Prakya Vasudevan
Prakya Vasudevan
December 2, 2019
SRE Speak
SRE
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Danny Mican on his experience as an SRE at Auth0

Dec 2, 2019
Last Updated:
November 20, 2024
Share this post:
Danny Mican on his experience as an SRE at Auth0
Disclaimer: The views discussed in this article are personally held by the author and does not in any way represent his/her employer

He loves learning about systems and making changes that positively impact client happiness, employee happiness and long term stability and growth.

In his role as an SRE, he gets to help his organization measure the client experience and ensure that it exceeds their expectations. He is passionate about helping organizations and teams deliver outstanding customer experiences faster and smoother, with less overhead. He does so by understanding their problems and goals, measuring, coaching and sometimes even hands-on software engineering!  :)

In his free time, you can catch him working on ValueStream, a platform he's building to help organizations measure their DevOps maturity. You can also find him share his SRE experiences on his blog.

Twitter & Medium: @dm03514

Stack Overflow: Danny Mican on Stackoverflow

You can find him pretty much on all sites with @dm03514 :)

Danny Mican, an SRE from Auth0 shares his thoughts on SRE and being SLO driven to deliver outstanding customer experiences. Danny currently manages the reliability of systems that authenticate over 2.5 billion logins per month and is expected to have 99.9% (3 Nines) availability.

Table of Contents:

    How did you become an SRE?

    About 5 years ago I read Release It! Coming from extremely small startup environments I had begun to learn a lot of these lessons just from experience, (defensive programming, bounded resource control, operational visibility, critical signals for monitoring). I hadn’t realized that smart people were thinking about this or that Resilience Engineering was a formal Engineering domain; with lots of research around how to detect, recover, and then evolved in how to prevent errors from ever occurring. When the Site Reliability Engineering book was published it further enforced that this was serious and that the industry was about to take this seriously. I had been walking the line between engineering and operations and keeping services running, but didn’t realize that Google was doing it and had a name for it: Site Reliability Engineering (SRE). Only in my most recent role do I have the official title of Site Reliability Engineer!

    What's the most challenging part of your job?

    Site Reliability is still a young discipline and the most challenging part is being an advocate and convincing others of its importance and demonstrating its impact through objective metrics (usually Service Level Objectives; SLOs). Feature and Product focused engineering teams have a different focus, so coming up with low friction ways through processing or tooling to enable product focused teams to start incorporating SRE practices and principles into their day to day is the most challenging part.

    What process, tools and techniques you can't live without?

    Service Level Objectives (SLOs) hands down. Service Level Objectives are the cornerstone for Site Reliability Engineering. They connect engineering, the client & the customer and provide a really elegant, easy to understand and quantifiable feedback loop.

    Next is some system for gathering these metrics, either time series (prometheus, datadog, etc) or logging (elastic). The final piece is some sort of alerting system. Alerts are what closes the feedback loop and makes the SLOs actionable (make them enforceable), most of the listed tools have some solution for alerting. I’ve found without alerts it’s just like putting metrics in the void, and hoping that people are trained, and disciplined, enough to consult them.

    How did ValueStream come about? Would love to hear more about this.

    I’ve been working on a platform named ValueStream to help teams understand their software delivery performance. Google SRE outlines the concept of “Four Golden Signals” for monitoring systems: Throughput, Latency, Error rate and Saturation. These are useful signals for monitoring any system and also closely align with traditional manufacturing signals, stemming from lean manufacturing and the Toyota Production System. Additionally, the State of DevOps report and Accelerate also rank the efficacy of delivery for teams using these metrics. The problem is that these simple metrics aren’t uniformly available in the most common software project management tools: Jira, Trello, Github, Gitlab, Jenkins, etc. and the tools that do expose some of these metrics obviously require that every team and engineer fully use the tool. It’s very difficult to get basic performance metrics for software delivery, and this is compounded in multi-tool environments; suppose one team uses Trello to track their tasks, but another team uses Github, and product uses something else for milestones.

    I created ValueStream to enable teams to uniformly measure their software delivery across all of their tools, while also enabling teams to link where work is originating from by modeling work as an actual graph. While ValueStream is starting as a place to centralize delivery metrics, its goal is to be able to provide teams with actionable insights in order to help them succeed at their DevOps and delivery transformations.

    What according to you is the future of SRE?

    I think in the near term (< 3 years) we are going to see products for managing Service Level Objectives (SLOs), Incident Response, as well as Metadata to inventory services and teams and to dynamically calculate Maturity (capability maturity model) scores. I’m super excited about this because a lot of companies are spending huge amounts of time and money developing these solutions in house. Also the companies that are succeeding at these aren’t diffusing their successes throughout the industry.

    In the long term I think there is going to be a fundamental shift in how we model systems. We’ll have enough compute storage to model system state as a time series graph (the data structure). Graphs are the natural structure for systems, and we’ll see our system representations slowly start to be modeled as graphs. We’re seeing this operational with the increased popularity of tracing. Instead of individual events, we have causally connected events and are able to see the state of transactions as a timeline (i.e., the transactions system state as a timeseries). These new systems will be able to model the links of our physical and logical systems. For example when an SLO is breached, it will model the relationships between the target service, its dependencies, and events affecting those. For an SLO error rate that fires, it would show the recent infrastructure changes (deploys, scale up events, upstream downstream dependency events), the recent tickets affecting the service and its upstream / downstream dependencies, and context around all connected services. I tried what this might look like using current tools in a recent blog post I published on SRE Knowledge Graphs.

    To summarize: In the short term I think we’ll see cloud offerings for common SRE tools, and in the long term I think we’ll see those tools converge into graph based intelligent systems that are able to surface important insights/anomalies (“Debugging as a service”) automatically.

    Any productivity hacks that you would give to new SREs?

    Don’t make assumptions about the system. In my experiences errors happen when there is a mismatch between our mental models and reality, and I find myself to be exponentially more productive when I invest in learning what’s real (what’s actually happening) before making assumptions about the system. I never once thought: “that time learning the system a waste of time”. Inversely when I don’t make this initial investment I find myself way less productive and more likely to produce errors in my work.

    What are some of the things people get wrong about this role?

    SRE is already a really broad role that benefits from tons of diverse backgrounds and requires various levels of technical understanding. It can also be really specialized. 2 years ago I focused on metric standardization, teaching about monitoring, metrics and alerting. I got to dig in for 6 months on distributed tracing, and now I’m able to dig in for 6 months on Service Level Objectives. It’s been very focused and deep work opposed to generalist, or adhoc work.

    What are some of the best practices you’ve picked up along the way?

    The most important thing is to establish some sort of overarching KPI or objective measurement for each task being performed. It’s important to make the shortcomings of these metrics explicit and outline what they aren’t able to measure. Service Level Objectives are one of the most important but time to recovery, or performance benchmarks or resource usage are all other examples. Establishing these metrics, instrumenting them and then surfacing them are able to demonstrate impact and gives objective hooks to communicate with stakeholders and non technical coworkers.

    Is there any book, video, talk, or tech that has inspired you lately, and why?

    The Google SRE book is the most important because it formalized the role and constantly reference when referring to concepts, approaches, or reasons behind doing certain things. The most inspiring book I’ve read recently has been “Thinking in Systems: A Primer” by Donella Meadows. It has tools, heuristics and approaches for understanding systems and interconnected components, which I’ve found especially relevant for Site Reliability Engineering. When errors happen they aren’t one off events but have many interconnected dependencies and relationships. Thinking in Systems is a toolkit for understanding these relationships and reasoning about the effects of them in a structured way.

    Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    December 2, 2019
    December 2, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Prakya Vasudevan
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Best Practices in Incident Management
    Best Practices in Incident Management
    May 7, 2020
    Configure an Intuitive Service Dashboard & Reduce Response Time
    Configure an Intuitive Service Dashboard & Reduce Response Time
    April 30, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.