📢 Webinar Alert! Reliability Automation - AI, ML, & Workflows in Incident Management. Register Here
Blog
SRE
Golden Signals - Monitoring from first principles

Golden Signals - Monitoring from first principles

September 30, 2021
Golden Signals - Monitoring from first principles
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.

This is the first article in the 3 Part blog series covering SRE Golden Signals.

So what should we expect from a monitoring system? Most of the monitoring concepts that apply to information systems apply to other projects and systems as well. Any monitoring system should be able to collect information about the system under monitoring, analyze and/or process it, and then share the derived data in a way that makes sense for the operators and consumers of the system.

The meaningful information that we are trying to gather from the system is called signals. The focus should always be to gather signals relevant to the system. But just like any radio communication technology that we are drawing this terminology from, noise will interfere with signals. Noise being the unwanted and often irrelevant information that is gathered as a side effect of monitoring.

Traditional monitoring has been built around active and passive checks and the use of near-real-time metrics. The good old Nagios and RRDTools worked this way. Monitoring gradually matured to favor metrics-based monitoring, and that gave rise to popular platforms like Prometheus and Grafana.

Centralized Log analysis and deriving metrics from logs became mainstream - the ELK stack was at the forefront of this change. But the focus is now shifting to traces and the term monitoring is being replaced by observability. Beyond this, we also have all the APM (Application Performance Monitoring) and Synthetic monitoring vendors offering various levels of observability and control.

All these platforms provide you with the tools to monitor anything, but they don’t tell you what to monitor. So how do we choose the relevant metrics from all this clutter and confusion? The crowded landscape of monitoring and observability makes the job harder, not to mention the efforts needed to identify the right metrics and separate noise from the signal. When things get complicated, one way to find a solution is to reason from first principles. We need to deconstruct the problem and identify the fundamentals and build on that. In this specific context, that would be to identify what is the absolute minimum that we need to monitor and then build a strategy on that. So on that note, let’s understand the popular strategy used to choose the right metrics.

SRE Golden Signals

SRE Golden signals were first introduced in the Google SRE book - defining it as the basic minimum metrics required to monitor any service. This model was about thinking of metrics from first principles and serves as a foundation for building monitoring around applications. The strategy is simple - for any system, monitor at least these four metrics - Latency, Traffic, Errors, and Saturation.

Latency

Latency is the time taken to serve a request. While the definition seems simple enough, latency has to be measured from the perspective of a client or server application. For an application that serves a web request, the latency it can measure is - the time delta between the moment the application receives the first byte of request, to the moment the last byte of the response to this request leaves the application. This would include the time the application took to process the request and build the response and everything in between - which could include disk seek latencies, downstream database queries, time spent in the CPU queue, etc. Things get a little more complicated when measuring latency from the client perspective because now the network between the client and server also influences the latency. The client could be of two types - the first is another upstream service within your infrastructure, and the second - and more complex - is real users sitting somewhere on the internet and there is no way of ensuring an always stable network between it and the server. For the first kind, you are in control and measure the latencies from the upstream application. For internet users, employ synthetic monitoring or Real User Monitoring (RUM) to get an approximation of latencies. These measurements get overly complicated when there is an array of firewalls, load balancers, and reverse proxies between the client and the server.

There are certain things to keep in mind when measuring latencies. The first is to identify and segregate the good latency and the bad latency, ‌the latencies endured by a successful request versus failed request. Quoting from the SRE Book, an HTTP 500 error latency should be measured as bad latency, and should not be allowed to pollute the HTTP 200 latencies - which could cause an error in judgment when planning to improve your request latencies.

Another important matter is the choice of the type of metrics for latency. Average or rate are not good choices for latency metrics as a large latency outlier can get averaged out and would blindside you. This outlier - otherwise called “tail” can be caught if the latency is measured in buckets of requests. Pick a reasonable number of latency buckets and count the number of requests per bucket. This would allow for plotting the buckets as histograms and flush out the outliers as percentiles or quartiles.

Traffic

Traffic refers to the demand placed on your system by its clients. The exact metric would vary based on what the system is serving - there could also be more than one traffic metric for a system. For most web applications this could be the number of requests served - in a specific time frame. For a streaming service like youtube, it can be the amount of video content served. For a database, it would be the number of queries served and for a cache, it could be the number of cache misses and cache hits.

A traffic metric could be further broken down based on the nature of requests. For a web request, this could be based on the HTTP code, HTTP method or even the type of content served. For video streaming, service content downloads for various resolutions could be categorized. For YouTube, the amount and size of video uploads are traffic metrics as well. Traffic can also be categorized based on geographies or other common characteristics. One way to measure the traffic metrics is to calculate traffic as a monotonically increasing value - usually of the metric type “counter” and then calculate the rate of this metric over a defined interval - say 5 minutes.

Errors

This is measured by counting the number of errors from the application and then calculating the rate of errors in a time interval. Error rate per second is a common metric used by most web applications. For eg: errors could be 5xx server-side errors, 4xx client-side errors, 2xx responses with an application-level error - wrong content, no data found, etc. It would also use a counter-type metric and then a rate calculated over a defined interval.

An important decision to make here would be what we can consider as errors. It might look like the errors are always obvious - like 5xx errors or database access errors, etc. But there is another kind of error that is defined by our business logic or system design. For example, serving wrong content for a perfectly valid customer request would still be an HTTP 200 response, but as per your business logic and the contract with the customer, this is an error. Consider the case of a downstream request that ultimately returns the response to an upstream server, but not before the latency threshold defined by the upstream times out. While the upstream would consider this an error - as it should be, the downstream may not be aware that it breached an SLO (which is subject to change and may not be part of the downstream application design) with its upstream and would consider this a successful request - unless the necessary contract is added to the code itself.

Saturation

Saturation is a sign of how used or “full” the system is. 100% utilization of a resource might sound ideal in theory, but a system that’s nearing full utilization of its resources could lead to performance degradation. The tail latencies we discussed earlier could be the side effect of a resource constraint at the application or system level. The saturation could happen to any sort of resources that are needed by the application. It could be system resources like memory or CPU or IO. Open file counts hitting the max limit set by the operating system and disk or network queues filling up are also common examples of saturation. At the application level, there could be request queues that are filling up, the number of database connections hitting the maximum, or thread contention for a shared resource in memory.

Saturation is usually measured as a “gauge” metrics type, which can go up or down, but usually within a defined upper and lower bound. While not a saturation metric, the 99th percentile request latency (or metrics on outliers) of your service could act as an early warning signal. Saturation can have a ripple effect in a multi-tiered system where your upstream would wait on the downstream service response indefinitely or eventually timeout - but also causing additional requests to queue up, resulting in resource starvation.

While the Golden signals covered in this blog are metrics driven and ideally a good starting point to measure if something is going wrong, they are not the only things to consider. There are various other metrics not necessary to track on a daily basis, but certainly an important place to investigate when an incident takes place. We will be covering this in Part 2 of this blog series.

Irrespective of your strategy, understanding why a system exists, what are the services, and business use cases it serves, are vital. This ‌will lead you to identify the critical paths in your business logic and help you model the metrics collection system based on that.

Written By:
September 30, 2021
Safeer CM
Safeer CM
September 30, 2021
SRE
Monitoring
Observability
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

Golden Signals - Monitoring from first principles

Sep 30, 2021
Last Updated:
November 20, 2024
Share this post:
Golden Signals - Monitoring from first principles

Building a successful monitoring process for your application is essential for high availability. In the first of this three-part blog series, Safeer discusses the four key SRE Golden Signals for metrics-driven measurement, and the role it plays in the overall context of Monitoring.

Table of Contents:

    Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.

    This is the first article in the 3 Part blog series covering SRE Golden Signals.

    So what should we expect from a monitoring system? Most of the monitoring concepts that apply to information systems apply to other projects and systems as well. Any monitoring system should be able to collect information about the system under monitoring, analyze and/or process it, and then share the derived data in a way that makes sense for the operators and consumers of the system.

    The meaningful information that we are trying to gather from the system is called signals. The focus should always be to gather signals relevant to the system. But just like any radio communication technology that we are drawing this terminology from, noise will interfere with signals. Noise being the unwanted and often irrelevant information that is gathered as a side effect of monitoring.

    Traditional monitoring has been built around active and passive checks and the use of near-real-time metrics. The good old Nagios and RRDTools worked this way. Monitoring gradually matured to favor metrics-based monitoring, and that gave rise to popular platforms like Prometheus and Grafana.

    Centralized Log analysis and deriving metrics from logs became mainstream - the ELK stack was at the forefront of this change. But the focus is now shifting to traces and the term monitoring is being replaced by observability. Beyond this, we also have all the APM (Application Performance Monitoring) and Synthetic monitoring vendors offering various levels of observability and control.

    All these platforms provide you with the tools to monitor anything, but they don’t tell you what to monitor. So how do we choose the relevant metrics from all this clutter and confusion? The crowded landscape of monitoring and observability makes the job harder, not to mention the efforts needed to identify the right metrics and separate noise from the signal. When things get complicated, one way to find a solution is to reason from first principles. We need to deconstruct the problem and identify the fundamentals and build on that. In this specific context, that would be to identify what is the absolute minimum that we need to monitor and then build a strategy on that. So on that note, let’s understand the popular strategy used to choose the right metrics.

    SRE Golden Signals

    SRE Golden signals were first introduced in the Google SRE book - defining it as the basic minimum metrics required to monitor any service. This model was about thinking of metrics from first principles and serves as a foundation for building monitoring around applications. The strategy is simple - for any system, monitor at least these four metrics - Latency, Traffic, Errors, and Saturation.

    Latency

    Latency is the time taken to serve a request. While the definition seems simple enough, latency has to be measured from the perspective of a client or server application. For an application that serves a web request, the latency it can measure is - the time delta between the moment the application receives the first byte of request, to the moment the last byte of the response to this request leaves the application. This would include the time the application took to process the request and build the response and everything in between - which could include disk seek latencies, downstream database queries, time spent in the CPU queue, etc. Things get a little more complicated when measuring latency from the client perspective because now the network between the client and server also influences the latency. The client could be of two types - the first is another upstream service within your infrastructure, and the second - and more complex - is real users sitting somewhere on the internet and there is no way of ensuring an always stable network between it and the server. For the first kind, you are in control and measure the latencies from the upstream application. For internet users, employ synthetic monitoring or Real User Monitoring (RUM) to get an approximation of latencies. These measurements get overly complicated when there is an array of firewalls, load balancers, and reverse proxies between the client and the server.

    There are certain things to keep in mind when measuring latencies. The first is to identify and segregate the good latency and the bad latency, ‌the latencies endured by a successful request versus failed request. Quoting from the SRE Book, an HTTP 500 error latency should be measured as bad latency, and should not be allowed to pollute the HTTP 200 latencies - which could cause an error in judgment when planning to improve your request latencies.

    Another important matter is the choice of the type of metrics for latency. Average or rate are not good choices for latency metrics as a large latency outlier can get averaged out and would blindside you. This outlier - otherwise called “tail” can be caught if the latency is measured in buckets of requests. Pick a reasonable number of latency buckets and count the number of requests per bucket. This would allow for plotting the buckets as histograms and flush out the outliers as percentiles or quartiles.

    Traffic

    Traffic refers to the demand placed on your system by its clients. The exact metric would vary based on what the system is serving - there could also be more than one traffic metric for a system. For most web applications this could be the number of requests served - in a specific time frame. For a streaming service like youtube, it can be the amount of video content served. For a database, it would be the number of queries served and for a cache, it could be the number of cache misses and cache hits.

    A traffic metric could be further broken down based on the nature of requests. For a web request, this could be based on the HTTP code, HTTP method or even the type of content served. For video streaming, service content downloads for various resolutions could be categorized. For YouTube, the amount and size of video uploads are traffic metrics as well. Traffic can also be categorized based on geographies or other common characteristics. One way to measure the traffic metrics is to calculate traffic as a monotonically increasing value - usually of the metric type “counter” and then calculate the rate of this metric over a defined interval - say 5 minutes.

    Errors

    This is measured by counting the number of errors from the application and then calculating the rate of errors in a time interval. Error rate per second is a common metric used by most web applications. For eg: errors could be 5xx server-side errors, 4xx client-side errors, 2xx responses with an application-level error - wrong content, no data found, etc. It would also use a counter-type metric and then a rate calculated over a defined interval.

    An important decision to make here would be what we can consider as errors. It might look like the errors are always obvious - like 5xx errors or database access errors, etc. But there is another kind of error that is defined by our business logic or system design. For example, serving wrong content for a perfectly valid customer request would still be an HTTP 200 response, but as per your business logic and the contract with the customer, this is an error. Consider the case of a downstream request that ultimately returns the response to an upstream server, but not before the latency threshold defined by the upstream times out. While the upstream would consider this an error - as it should be, the downstream may not be aware that it breached an SLO (which is subject to change and may not be part of the downstream application design) with its upstream and would consider this a successful request - unless the necessary contract is added to the code itself.

    Saturation

    Saturation is a sign of how used or “full” the system is. 100% utilization of a resource might sound ideal in theory, but a system that’s nearing full utilization of its resources could lead to performance degradation. The tail latencies we discussed earlier could be the side effect of a resource constraint at the application or system level. The saturation could happen to any sort of resources that are needed by the application. It could be system resources like memory or CPU or IO. Open file counts hitting the max limit set by the operating system and disk or network queues filling up are also common examples of saturation. At the application level, there could be request queues that are filling up, the number of database connections hitting the maximum, or thread contention for a shared resource in memory.

    Saturation is usually measured as a “gauge” metrics type, which can go up or down, but usually within a defined upper and lower bound. While not a saturation metric, the 99th percentile request latency (or metrics on outliers) of your service could act as an early warning signal. Saturation can have a ripple effect in a multi-tiered system where your upstream would wait on the downstream service response indefinitely or eventually timeout - but also causing additional requests to queue up, resulting in resource starvation.

    While the Golden signals covered in this blog are metrics driven and ideally a good starting point to measure if something is going wrong, they are not the only things to consider. There are various other metrics not necessary to track on a daily basis, but certainly an important place to investigate when an incident takes place. We will be covering this in Part 2 of this blog series.

    Irrespective of your strategy, understanding why a system exists, what are the services, and business use cases it serves, are vital. This ‌will lead you to identify the critical paths in your business logic and help you model the metrics collection system based on that.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    September 30, 2021
    September 30, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Safeer CM
    No items found.
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.