Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements.
Typical use cases for Kubernetes monitoring include:
Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.
The table below summarizes the Kubernetes monitoring best practices we will explore in this article.
Before we go into more detail, let’s unpack an often confusing topic, monitoring vs. observability. The term “monitoring” is more traditional and covers the collection of metrics and logs used to monitor the application infrastructure components. The idea is to “monitor” a workload by constantly evaluating the real-time performance of its underlying infrastructure.
Observability is a relatively new concept, and even though it overlaps with monitoring, its end goal is to isolate a performance bottleneck along a transaction path instead of monitoring the application infrastructure. Observability gained traction in application environments designed based on the paradigm of microservices, where an application comprises modularized services hosted in ephemeral containers and interacting with each other via application programming interfaces (API). In such an environment, monitoring the servers and containers in isolation isn’t meaningful, so a new perspective was required, giving rise to the notion of observability.
In addition to metrics and logs, observability also includes distributed tracing to follow the path of a transaction through the application infrastructure. Distributed tracing enables operation engineers to understand the path a user’s request takes, including:
Observability allows the operations engineers to quickly understand the upstream and downstream impact of application services on each other. Typically, observability tools will combine metrics, logs, and tracing to give engineers a coherent view of the entire transaction path across the infrastructure. Read this article if you want to learn more about observability (also called “O11y”).
The seven Kubernetes monitoring best practices below can help DevOps and SRE (site reliability engineering) teams achieve SLOs (service level objectives) and improve overall infrastructure observability.
Determining business goals is the first (and arguably the most important) Kubernetes monitoring best practice. Examples of such goals are:
While planning is important, it’s also essential not to overthink it. Teams just getting started with monitoring should avoid analysis paralysis and instead take an iterative approach to developing a plan. Additional requirements can be added later to address new information and requirements.
Once you’ve identified your business goals, you can identify which metrics you need to collect to achieve those goals. This step also includes defining related configuration parameters, such as the collection rate and how long you need to store the metrics data.
Some metrics are usually readily available, typically system metrics. These metrics include:
System metrics are usually necessary as part of any monitoring strategy and tend to show the overall stress the cluster is under. However, they are quite basic and usually won’t give enough actionable information beyond telling whether the cluster seems healthy.
Additionally, more complex metrics are often required. These metrics are often tied to the software you run. For example, they could measure:
Our next Kubernetes monitoring best practice is selecting the right tools based on the required metrics and achieving your monitoring goals.
Free and Open Source Software (FOSS) vs. commercial third-party software is commonly used to categorize Kubernetes monitoring tools. Some examples of FOSS monitoring solutions include:
While plenty of open-source options are available, you will need in-house expertise and a significant amount of DevOps engineers’ time to build and maintain a FOSS monitoring solution. If you don’t have in-house experts, you can hire consultants to build a solution, but this will likely be expensive. On the other hand, developing your own monitoring solution could save you a considerable amount of money in the long-run.
The alternative is to pay for third-party software, which usually offers turn-key, software-as-a-service (SaaS) solutions. Commercial options typically have more advanced products, such as machine learning to detect suspicious trends and patterns or perform offline data analysis. Additionally, most commercial solutions come with a level of support that FOSS projects lack.
When evaluating solutions, remember that using third-party tools (especially SaaS products) can create compliance issues, such as safeguarding personally identifiable information under HIPAA or GDPR. You might also need to open your cluster to allow routes from the internet for the third-party SaaS products, which increases attack surface and could create other security issues.
Unless you run a non-production workload, you probably want every element of your monitoring solution to be highly available and scalable. Achieving high-availability monitoring requires monitoring of the monitoring system itself. At a minimum, you must be able to detect critical failures in your monitoring system and send notifications when they occur. Ideally, you should also configure automated remediation of such problems.
Generally speaking, this extra level of monitoring is required only for in-house solutions, as third-party SaaS vendors usually have monitoring systems for their platforms. Some FOSS products incorporate their own monitoring systems. For example, Loki comes with Loki Canary, which regularly sends dummy logs to Loki and reads them back to ensure it works fine.
Your monitoring system will accumulate data over time, and this data should be managed like any other data. You will need to determine how long you need to hold onto it, maybe even put it in cold storage after a while. Be sure to consider any regulations or legal requirements applicable to your organization so that the data can be accessed and provided quickly if requested. Determining your data retention requirements for your monitoring data will be part of your overall requirement gathering exercise, and you will then need to implement it accordingly.
Do not neglect monitoring your control plane as well! All the best practices we have listed also apply to the control plane, not just the data plane. Some Kubernetes managed solutions, such as Amazon’s EKS, will do that automatically for you. If not, you will need to add the monitoring of the control plane nodes and the various control plane components into your monitoring strategy.
Once your monitoring system is up and running and able to send alerts to your team, you must consider how to respond to such alerts. Squadcast can help coordinate incident responses, ensuring a very high level of coordination within your team so they can be as efficient as possible while dealing with the problem.
Integrating monitoring data into a robust incident response strategy helps teams detect and recover from outages and other production-disrupting incidents faster. As a result, MTTR decreases, and uptime improves.
Monitoring your production workloads is necessary, but working towards true observability across your Kubernetes infrastructure is important. If you’re just starting, the most important of our Kubernetes monitoring best practices is gathering your requirements and defining your business goals.
Once you have requirements and goals, identify which metrics will help fulfill them before you move on to tooling. Selecting the right tools is an important step, especially choosing between FOSS (one consequence being that your team will have to spend time and effort to implement an in-house monitoring solution) and a paid-for third-party solution (which are usually more exhaustive and come with better support). Compliance and security are other considerations you might need to consider when choosing your tools, depending on your project’s requirements.
Finally, especially after building an in-house solution, ensure your monitoring system is reliable, which would require monitoring it. And don’t forget that Squadcast can help with the coordination of incident responses within your team.